wu :: forums (http://www.ocf.berkeley.edu/~wwu/cgi-bin/yabb/YaBB.cgi)
general >> wanted >> Finding p.d.f. from sample data
(Message started by: Icarus on Feb 5th, 2003, 9:06pm)

Title: Finding p.d.f. from sample data
Post by Icarus on Feb 5th, 2003, 9:06pm
Those of you who are familiar with my posts may have noticed that my knowledge of Probability and Statistics is pretty thin.

I was asked at work to produce a "bell curve" for a particular random variable, given a set of 72 values. I realized that by "bell curve" my boss wanted the probability density function (pdf). The variable is continuous, and it is reasonable to assume that the pdf is something like that of normal distribution. What I did, after a couple of false starts (and destroying my first day's work when Excel tricked me into saving a corrupted file over my previous good save, not that I'm bitter about it, not at all! >:(), was the following:


where {Xi} i <= N is my data set, and mu and sigma are the mean and standard deviation. I was able to show that if {Xi} were normal, then as N --> oo, this will converge almost always to the normal pdf


My question is: Is this what I should have done? Is there a "standard" method of approximating the pdf of a continuous random variable, given a sample of values (I should think so, this is fairly common stuff), and if so, what is it?

Let me add here, that I am talking about a situation where the form of the pdf is not known. If I knew for sure that {Xi} were normal, then it would be easy. What I am looking for is the best way to approximate a pdf when all you know about it (other than maybe a rough idea of its general shape), is the sample data set.

Title: Re: Finding p.d.f. from sample data
Post by towr on Feb 5th, 2003, 11:44pm
If you're explicitly asked to make a bell curve, I would assume you may assume normal distribution. So estimate the mean and standard deviation, and draw the curve..

Title: Re: Finding p.d.f. from sample data
Post by James Fingas on Feb 6th, 2003, 1:56pm
If you know that it's normal, then you just find the mean and standard deviation. However, if you don't know that it's normal, then the best way to proceed is to estimate all the various moments.

The mean is the first moment, the variance is the second moment, but you'll want to compute them all. Once you have these, you can do some funky thing with fourier transforms (geez, I never thought I'd want to remember this) and try to match up the moments you got with a known PDF. If it doesn't match up with any PDF you can find, then the best you can do is plot a histogram and try to guess at what the curve is.

Title: Re: Finding p.d.f. from sample data
Post by Icarus on Feb 6th, 2003, 6:35pm
In manufacturing applications, you can't ever depend on a pdf to be normal, or any other known form. What I have are a bunch of aircraft CGs. They are affected not only by the completely random variations that you might expect to produce a normal distribution, but also a number of biases, resulting from changes in design, manufacturing procedures, measuring conditions (guess how much fun it is to try and weigh an airplane on a hot windy day! Close the doors, you're in an oven. Open them, and what your results swing wildly back and forth. And since this is in Kansas, it is always windy.) So I am only trying to see what the graph looks like, not fit it to some preconceived idea of what it ought to be.

My boss used the description "bell curve" simply because he does not know what the proper terminology is, or even what issues are involved. It took me awhile to remember "probability density function" and I'm a mathematician. He's an engineer who has been doing only management stuff for many years. You can't expect anyone like that to know what's what when it comes to probability and statistics. That's why he asked me! (I'm not complaining here - he's actually a pretty good boss, and does his job well from my standpoint.)

My thought process behind the equations I gave was: Each data point indicates represents a substantiation of that value as a probable outcome. Since the pdf is presumably continuous, any nearby value will have nearly the same probability as the sample. With the likely-hood of the same probability dropping off as you get further away. So each data point contributes not only to the pdf at that point but also nearby. The best model for how much to contribute was the normal distribution, with the same variance as the R.V. So I multiplied my values by the normal pdf and added it all up. I then decided to check what happens when the samples come from a normal distribution - do I get the same distribution back as N -->oo? The answer was I would only if I added in the (1-sqr(2))mu term.

The result I got for the actual data is very close to normal. But I am wondering if that was because the data is truly indicative of a normal distribution, or if my method skews the result towards normal? I suspect that both have a role. Histograms for my data show one main lump, but also hint at skew.

James: I will look at the moments and see what I come up with, but if I bring them all in, what I will get is a discrete uniform pdf with 72 possible values, so calculating all the moments from my 72 data points is not going to work. Perhaps if i just do the lower moments ...

Title: Re: Finding p.d.f. from sample data
Post by william wu on Feb 7th, 2003, 12:56am
Here's a link that could help you:


Edit 2:24 PM 2/24/2003: The above URL seems to have been removed. Here's a copy: http://www.ds.unifi.it/VL/VL_EN/point/index.html

Title: Re: Finding p.d.f. from sample data
Post by James Fingas on Feb 7th, 2003, 12:25pm
There is a good tool available to tell whether a process is truly normal. I think it's called a "normality plot". Basically you make a cumulative histogram, but you adjust the y-axis of the chart according to the normal distribution cdf (the cdf is the integral of the pdf).

If the data are normally distributed, then you get a perfectly straight line, with the slope proportional to the standard deviation. If the data come from another type of distribution, you don't get a straight line. A line with a kink in the middle indicates a binormal process (usually showing the same process under two different sets of conditions, eg day/night, operator1/operator2, hot/cold, supplier1/supplier2, etc.). It shouldn't be too hard to figure out how to do this...

Title: Re: Finding p.d.f. from sample data
Post by Icarus on Feb 7th, 2003, 9:42pm
I haven't had a chance to look at the link yet, William, but thanks!

James - I know its not normal now. Following your suggestion, I linearly transformed the data to make the mean=0 and st dev=1, then calculated the first 15 moments, and graphed the corresponding polynomial (truncated moment generation function). When I compare the graph to that of the moment generator of the the normal distribution, its pretty clear that the two are not the same. I have not yet tried to switch the moments back into a pdf. The normality plot would have been easier, but I had already done this before reading your post.

Powered by YaBB 1 Gold - SP 1.4!
Forum software copyright 2000-2004 Yet another Bulletin Board