Excluding wrong points in least square method

MikeML

Joined Oct 2, 2009
5,444
Excel will do a regression through a set of points, and it is easy to go back and delete some of the points, and the regression is adjusted with the points deleted.
 

Thread Starter

anhnha

Joined Apr 19, 2012
905
Excel will do a regression through a set of points, and it is easy to go back and delete some of the points, and the regression is adjusted with the points deleted.
I'd like to do it using C++. I don't have any experience about this but this method seems to take a long time. Is it okay?
 

Thread Starter

anhnha

Joined Apr 19, 2012
905
As you compute the squared deviations, you apply a threshold test and exclude all points that exceed a certain value.
This appears simple but I have heard it has lots of caveats with the method. Please tell me if you figure it out where are these caveats.
 

studiot

Joined Nov 9, 2007
4,998
If your deviations are 1.3, 1.4, 1.6, 2, 1.5, 7.0, 0.9, 2.1, 5.1

Then yes the magnitude of the deviations gives the outliers away and a simple exclude a deviation greater than X will work.

But if your deviations are 1.3, 1.4, 1.6, 2, 1.5, 0.9, ie close together you should ask for a different test.
Should you exclude 2 in the example?

Well take the average of the deviations.

Then set a limit of deviation from that average and exclude anything beyond that.

So in the example the average deviation is 1.4 and if we set the exclusion limit at 0.5 we would go back and recalculate excluding 2.

The theory of this depends upon the fact that if the data is unbiased the deviations themselves should be normally distributed, so you can use the confidence limits on this to set the exclusion limits.
 
Last edited:

studiot

Joined Nov 9, 2007
4,998
Panic mode those formulae are only for the distance to a straight line.

If the hypothesis was a curved relationship they would not work.

Also the question was how would you decide what points are to far away?
We all know you have to reject those.
 

Thread Starter

anhnha

Joined Apr 19, 2012
905
If your deviations are 1.3, 1.4, 1.6, 2, 1.5, 7.0, 0.9, 2.1, 5.1

Then yes the magnitude of the deviations gives the outliers away and a simple exclude a deviation greater than X will work.

But if your deviations are 1.3, 1.4, 1.6, 2, 1.5, 0.9, ie close together you should ask for a different test.
Should you exclude 2 in the example?

Well take the average of the deviations.

Then set a limit of deviation from that average and exclude anything beyond that.

So in the example the average deviation is 1.4 and if we set the exclusion limit at 0.5 we would go back and recalculate excluding 2.

The theory of this depends upon the fact that if the data is unbiased the deviations themselves should be normally distributed, so you can use the confidence limits on this to set the exclusion limits.
For the first dataset: 1.3, 1.3, 1.6, 2, 1.8, 7.0, 0.9, 2.1, 5.1
Mean M = 2.56667
Standard deviation SD = 2.06519
OK value: M-SD = 0.50148 < X < 4.63186 = M + SD
Excluded value: 5.1, 7.0

For the second dataset: 1.3,1.4,1.6,2,1,5,0.9
Mean = 1.88571
Standard deviation = 1.42177
OK value: M-SD = 0.46394 < X < 3.30748 = M + SD
Excluded value: No.

If we set the exclusion limit at 0.5 then:
Mean = 1.88571
Standard deviation = 1.42177
OK value: M-Limit = 0.92177 < X < 1.92177 = M + Limit
Excluded value: 2

Question:
Why did you choose limit 0.5?
Is there a general way for choosing this value?
 

Papabravo

Joined Feb 24, 2006
21,225
But first we don't know the line, how can I calculate distance and reject points?
It is an iterative process.
  1. You compute the line
  2. You compute the squared deviations
  3. You find the mean and the variance of the squared deviations
  4. You eliminate some points
  5. Goto 1
 

Thread Starter

anhnha

Joined Apr 19, 2012
905
It is an iterative process.
  1. You compute the line
  2. You compute the squared deviations
  3. You find the mean and the variance of the squared deviations
  4. You eliminate some points
  5. Goto 1
Assume that the dataset here are points with coordinate (x, y).
1. Compute the line: OK by using least square method.
2. Compute the squared deviations: how? with x or y or both
3. Find the mean and the variance of the squared deviations: with x, y or both?
 

Papabravo

Joined Feb 24, 2006
21,225
Assume that the dataset here are points with coordinate (x, y).
1. Compute the line: OK by using least square method.
2. Compute the squared deviations: how? with x or y or both
3. Find the mean and the variance of the squared deviations: with x, y or both?
Since y is a function of x, you are trying to find the equation of a line that represents that relationship. Once you have such a line you compute only the squared deviations of the y coordinate. THE X COORDINATES DON'T HAVE ANY DEVIATIONS. They are the same for the data points and the line. Do you realize how silly the question was?
 

wayneh

Joined Sep 9, 2010
17,498
Without an assignable cause to explain the "bad" data, deleting data is a questionable activity. The best that statistics can do is to identify the data worth looking at more closely. If points are more than 2 or 3 standard deviations away from whatever model fits the other data, it becomes tempting to omit them. But there is no statistical justification for doing so. The data are the data.
 

studiot

Joined Nov 9, 2007
4,998
But there is no statistical justification for doing so. The data are the data.
There is a considerable body of statistical theory available for handling data.
As with pretty well everything in this universe there is a sliding scale or shades of grey available, it is not cut and dried.

The simplest method is manual estimate, but this is subject to operator bias so different analysts will obtain different results.
Basic formal methods try to eliminate this so all operators will obtain the same results from given set of data.
Least squares is one of the simplest and has the advantage that it can be applied so any number of variables, via the calculus and to any proposed relationship between them. Extension to many variables takes it beyond papabravo's comment about the x and y axis in the current example.

The next step in the analytical complexity is to introduce weighting functions or constants.
This can be done when some of the onservations are made to a known better accuracy or precision than others.
For instance would reading made with a three and half digit voltmeter might be given a lower weighting than those with a four and a half digit one.

If points are more than 2 or 3 standard deviations away from whatever model fits the other data, it becomes tempting to omit them.
Introducing confidence limits implies that you have a sound statistical model of the distribution of the deviation.
I noted that it is expected to be normal, but this may not be so and the deviation could be skewed, in which case your acception/rejection criteria will be false.

So yes there is a danger in this and statistics counters this with weighting.
The problem then become one of setting the weights.
 

MrAl

Joined Jun 17, 2014
11,474
Hello there,

It's been a while for me so i'd have to look a lot of stuff up too, but the basic idea is that once you calculate a given fit if that fit is statistically a good fit then the number of positive deviants will be the same as the number of negative deviants. Since this will rarely happen perfectly, if the number of positive devs is much greater than the number of negative devs (or vice versa) then the errors are considered systematic (not random) and so the fit may be wrong or there may be true outliers.

To give a quick example, say we found a fit that was supposed to be a perfectly straight, horizontal line, with a constant 'y' value of 10. Say we have predicted using the fit a number of points with 'y' values (list kept short for illustration):
9,11,8,12,9.9,10.2,9.1,10
Here we see 9 has dev=-1, 11 has dev=+1, 8 has dev=-2, 12 has dev=+2, 9.9 has dev=-0.1, 10.2 has dev=+0.2, 9.1 has dev=-0.9, 10 has dev=0.
So with this analysis we see that some of the positive dev's match the negative dev's, closely and one is exact, and one lone is negative, so we have one negative dev to think about. This one (from the 9.1) means there are more negatives than positives, but only by 1. Because there is only 1 such result, the fit would be considered statistically ok. In fact, if we had four 9's in there (with a larger data set though) we would probably still consider that ok. But once we get to 5 or 6 negatives (or positives) we would have to start to believe that the errors are systematic. We might try a different fit then for example.
I am hoping you can find more about this online, such as calculating the correlation of neighboring points.

The only way you can eliminate data is if you know something else about the nature of the process from which the data came from. A good example is from one of Radio Shack's old thermistors, where they posted a set of data which represents the resistance at different temperatures. There was one bad data point in the whole set of maybe 20 values, and the only way we could be sure it was bad was because there is a known curve shape for a thermistor, and comparing the fit to the known shape we find that there was one point that did not fit that shape. So rather than change the shape of the fit we found we eliminate that point, or even change it to a predicted value. Amazingly, once we do that the curve follows the log curve with very little error at any point.

There are many other types of fits that have a similar nature close to the sum of squares fit. One in particular is more useful because it looks at the percentage error rather than the absolute error. So it would be called something like the "sum of absolute value of the percentage error". This would be good for example when calculating values that are quite a bit different but we want to ensure about the same percentage error in the calculation, such as when calculating resistor values. If at one point we calculate 100 ohms and another point we calculate 1000 ohms, if they are both subject to the same absolute error both the 100 ohm and the 1000 ohm could be off by say 1 ohm, and that means the 100 ohm is off by 1 percent while the 1000 ohm is off by only 0.1 percent, which is not usually the way we like to do things. We'd usually want the same percentage error. Incorporating a fit based on percentage error means the calculation of the 100 ohm could be off by 1 percent and the same for the 1000 ohm resistor, so the 100 ohm could be from 99 to101 ohms while the 1000 ohm calculation would be allowed to be off by 10 ohms, from 990 to 1010 ohms, which is usually the way we like to do things. This means the errors are spread more realistically across the range and makes the fit more practical.

There are also fits that are whatever you think is best given the nature of the physical process. For the thermistor data, we would want to try a logarithmic curve fit first because we know that is the nature of these devices.

BTW, in your original post you show a line and several points above the line and several below the line. Without looking at the actual numerical values, this fit looks like there are systematic errors because there are quite a few more above the line than below the line. There are not that many points involved though so this is really just a guess. If it is systematic then the nature of the data is not going to be represented very well by a straight line, or at least there is probably a better 'curve' that will fit the data better without eliminating points. Of course again this is if the nature is not already known, because if it is known to be a straight line beforehand than some of points MUST be erroneous, and those will be some of the ones above the line.
 
Last edited:
Top