I am using least square method to find linear equation but getting some problems below.
Is there any idea for this?
Thank you.
Is there any idea for this?
Thank you.
Attachments

15.4 KB Views: 350
I'd like to do it using C++. I don't have any experience about this but this method seems to take a long time. Is it okay?Excel will do a regression through a set of points, and it is easy to go back and delete some of the points, and the regression is adjusted with the points deleted.
This appears simple but I have heard it has lots of caveats with the method. Please tell me if you figure it out where are these caveats.As you compute the squared deviations, you apply a threshold test and exclude all points that exceed a certain value.
For the first dataset: 1.3, 1.3, 1.6, 2, 1.8, 7.0, 0.9, 2.1, 5.1If your deviations are 1.3, 1.4, 1.6, 2, 1.5, 7.0, 0.9, 2.1, 5.1
Then yes the magnitude of the deviations gives the outliers away and a simple exclude a deviation greater than X will work.
But if your deviations are 1.3, 1.4, 1.6, 2, 1.5, 0.9, ie close together you should ask for a different test.
Should you exclude 2 in the example?
Well take the average of the deviations.
Then set a limit of deviation from that average and exclude anything beyond that.
So in the example the average deviation is 1.4 and if we set the exclusion limit at 0.5 we would go back and recalculate excluding 2.
The theory of this depends upon the fact that if the data is unbiased the deviations themselves should be normally distributed, so you can use the confidence limits on this to set the exclusion limits.
But first we don't know the line, how can I calculate distance and reject points?the main caveat is execution speed (compute line, check each point if too far from line, reject points that are too far, then repeat whole thing until all points are close enough).
http://en.wikipedia.org/wiki/Distance_from_a_point_to_a_line
It is an iterative process.But first we don't know the line, how can I calculate distance and reject points?
Assume that the dataset here are points with coordinate (x, y).It is an iterative process.
 You compute the line
 You compute the squared deviations
 You find the mean and the variance of the squared deviations
 You eliminate some points
 Goto 1
Yes I did hint at it.Question:
Why did you choose limit 0.5?
Is there a general way for choosing this value?
Since y is a function of x, you are trying to find the equation of a line that represents that relationship. Once you have such a line you compute only the squared deviations of the y coordinate. THE X COORDINATES DON'T HAVE ANY DEVIATIONS. They are the same for the data points and the line. Do you realize how silly the question was?Assume that the dataset here are points with coordinate (x, y).
1. Compute the line: OK by using least square method.
2. Compute the squared deviations: how? with x or y or both
3. Find the mean and the variance of the squared deviations: with x, y or both?
There is a considerable body of statistical theory available for handling data.But there is no statistical justification for doing so. The data are the data.
Introducing confidence limits implies that you have a sound statistical model of the distribution of the deviation.If points are more than 2 or 3 standard deviations away from whatever model fits the other data, it becomes tempting to omit them.