Excluding wrong points in least square method

anhnha · May 19, 2015

I am using least square method to find linear equation but getting some problems below.
Is there any idea for this?
Thank you.

MikeML · May 19, 2015

Excel will do a regression through a set of points, and it is easy to go back and delete some of the points, and the regression is adjusted with the points deleted.

Papabravo · May 19, 2015

As you compute the squared deviations, you apply a threshold test and exclude all points that exceed a certain value.

anhnha · May 19, 2015

MikeML said:
Excel will do a regression through a set of points, and it is easy to go back and delete some of the points, and the regression is adjusted with the points deleted.

I'd like to do it using C++. I don't have any experience about this but this method seems to take a long time. Is it okay?

anhnha · May 19, 2015

Papabravo said:
As you compute the squared deviations, you apply a threshold test and exclude all points that exceed a certain value.

This appears simple but I have heard it has lots of caveats with the method. Please tell me if you figure it out where are these caveats.

studiot · May 19, 2015

If your deviations are 1.3, 1.4, 1.6, 2, 1.5, 7.0, 0.9, 2.1, 5.1

Then yes the magnitude of the deviations gives the outliers away and a simple exclude a deviation greater than X will work.

But if your deviations are 1.3, 1.4, 1.6, 2, 1.5, 0.9, ie close together you should ask for a different test.
Should you exclude 2 in the example?

Well take the average of the deviations.

Then set a limit of deviation from that average and exclude anything beyond that.

So in the example the average deviation is 1.4 and if we set the exclusion limit at 0.5 we would go back and recalculate excluding 2.

The theory of this depends upon the fact that if the data is unbiased the deviations themselves should be normally distributed, so you can use the confidence limits on this to set the exclusion limits.

panic mode · May 19, 2015

the main caveat is execution speed (compute line, check each point if too far from line, reject points that are too far, then repeat whole thing until all points are close enough).
http://en.wikipedia.org/wiki/Distance_from_a_point_to_a_line

studiot · May 19, 2015

Panic mode those formulae are only for the distance to a straight line.

If the hypothesis was a curved relationship they would not work.

Also the question was how would you decide what points are to far away?
We all know you have to reject those.

anhnha · May 19, 2015

studiot said:
If your deviations are 1.3, 1.4, 1.6, 2, 1.5, 7.0, 0.9, 2.1, 5.1

Then yes the magnitude of the deviations gives the outliers away and a simple exclude a deviation greater than X will work.

But if your deviations are 1.3, 1.4, 1.6, 2, 1.5, 0.9, ie close together you should ask for a different test.
Should you exclude 2 in the example?

Well take the average of the deviations.

Then set a limit of deviation from that average and exclude anything beyond that.

So in the example the average deviation is 1.4 and if we set the exclusion limit at 0.5 we would go back and recalculate excluding 2.

The theory of this depends upon the fact that if the data is unbiased the deviations themselves should be normally distributed, so you can use the confidence limits on this to set the exclusion limits.

For the first dataset: 1.3, 1.3, 1.6, 2, 1.8, 7.0, 0.9, 2.1, 5.1
Mean M = 2.56667
Standard deviation SD = 2.06519
OK value: M-SD = 0.50148 < X < 4.63186 = M + SD
Excluded value: 5.1, 7.0

For the second dataset: 1.3,1.4,1.6,2,1,5,0.9
Mean = 1.88571
Standard deviation = 1.42177
OK value: M-SD = 0.46394 < X < 3.30748 = M + SD
Excluded value: No.

If we set the exclusion limit at 0.5 then:
Mean = 1.88571
Standard deviation = 1.42177
OK value: M-Limit = 0.92177 < X < 1.92177 = M + Limit
Excluded value: 2

Question:
Why did you choose limit 0.5?
Is there a general way for choosing this value?

anhnha · May 19, 2015

panic mode said:
the main caveat is execution speed (compute line, check each point if too far from line, reject points that are too far, then repeat whole thing until all points are close enough).
http://en.wikipedia.org/wiki/Distance_from_a_point_to_a_line

But first we don't know the line, how can I calculate distance and reject points?

Papabravo · May 19, 2015

anhnha said:
But first we don't know the line, how can I calculate distance and reject points?

It is an iterative process.

You compute the line
You compute the squared deviations
You find the mean and the variance of the squared deviations
You eliminate some points
Goto 1

anhnha · May 19, 2015

Papabravo said:
It is an iterative process.

You compute the line

You compute the squared deviations

You find the mean and the variance of the squared deviations

You eliminate some points

Goto 1

Assume that the dataset here are points with coordinate (x, y).
1. Compute the line: OK by using least square method.
2. Compute the squared deviations: how? with x or y or both
3. Find the mean and the variance of the squared deviations: with x, y or both?

studiot · May 19, 2015

Question:
Why did you choose limit 0.5?
Is there a general way for choosing this value?

Yes I did hint at it.

Do you know what confidence intervals are in statistics?

Papabravo · May 19, 2015

anhnha said:
Assume that the dataset here are points with coordinate (x, y).
1. Compute the line: OK by using least square method.
2. Compute the squared deviations: how? with x or y or both
3. Find the mean and the variance of the squared deviations: with x, y or both?

Since y is a function of x, you are trying to find the equation of a line that represents that relationship. Once you have such a line you compute only the squared deviations of the y coordinate. THE X COORDINATES DON'T HAVE ANY DEVIATIONS. They are the same for the data points and the line. Do you realize how silly the question was?

wayneh · May 19, 2015

Without an assignable cause to explain the "bad" data, deleting data is a questionable activity. The best that statistics can do is to identify the data worth looking at more closely. If points are more than 2 or 3 standard deviations away from whatever model fits the other data, it becomes tempting to omit them. But there is no statistical justification for doing so. The data are the data.

studiot · May 20, 2015

But there is no statistical justification for doing so. The data are the data.

There is a considerable body of statistical theory available for handling data.
As with pretty well everything in this universe there is a sliding scale or shades of grey available, it is not cut and dried.

The simplest method is manual estimate, but this is subject to operator bias so different analysts will obtain different results.
Basic formal methods try to eliminate this so all operators will obtain the same results from given set of data.
Least squares is one of the simplest and has the advantage that it can be applied so any number of variables, via the calculus and to any proposed relationship between them. Extension to many variables takes it beyond papabravo's comment about the x and y axis in the current example.

The next step in the analytical complexity is to introduce weighting functions or constants.
This can be done when some of the onservations are made to a known better accuracy or precision than others.
For instance would reading made with a three and half digit voltmeter might be given a lower weighting than those with a four and a half digit one.

If points are more than 2 or 3 standard deviations away from whatever model fits the other data, it becomes tempting to omit them.

Introducing confidence limits implies that you have a sound statistical model of the distribution of the deviation.
I noted that it is expected to be normal, but this may not be so and the deviation could be skewed, in which case your acception/rejection criteria will be false.

So yes there is a danger in this and statistics counters this with weighting.
The problem then become one of setting the weights.

MrAl · May 20, 2015

Hello there,

It's been a while for me so i'd have to look a lot of stuff up too, but the basic idea is that once you calculate a given fit if that fit is statistically a good fit then the number of positive deviants will be the same as the number of negative deviants. Since this will rarely happen perfectly, if the number of positive devs is much greater than the number of negative devs (or vice versa) then the errors are considered systematic (not random) and so the fit may be wrong or there may be true outliers.

To give a quick example, say we found a fit that was supposed to be a perfectly straight, horizontal line, with a constant 'y' value of 10. Say we have predicted using the fit a number of points with 'y' values (list kept short for illustration):
9,11,8,12,9.9,10.2,9.1,10
Here we see 9 has dev=-1, 11 has dev=+1, 8 has dev=-2, 12 has dev=+2, 9.9 has dev=-0.1, 10.2 has dev=+0.2, 9.1 has dev=-0.9, 10 has dev=0.
So with this analysis we see that some of the positive dev's match the negative dev's, closely and one is exact, and one lone is negative, so we have one negative dev to think about. This one (from the 9.1) means there are more negatives than positives, but only by 1. Because there is only 1 such result, the fit would be considered statistically ok. In fact, if we had four 9's in there (with a larger data set though) we would probably still consider that ok. But once we get to 5 or 6 negatives (or positives) we would have to start to believe that the errors are systematic. We might try a different fit then for example.
I am hoping you can find more about this online, such as calculating the correlation of neighboring points.

The only way you can eliminate data is if you know something else about the nature of the process from which the data came from. A good example is from one of Radio Shack's old thermistors, where they posted a set of data which represents the resistance at different temperatures. There was one bad data point in the whole set of maybe 20 values, and the only way we could be sure it was bad was because there is a known curve shape for a thermistor, and comparing the fit to the known shape we find that there was one point that did not fit that shape. So rather than change the shape of the fit we found we eliminate that point, or even change it to a predicted value. Amazingly, once we do that the curve follows the log curve with very little error at any point.

There are many other types of fits that have a similar nature close to the sum of squares fit. One in particular is more useful because it looks at the percentage error rather than the absolute error. So it would be called something like the "sum of absolute value of the percentage error". This would be good for example when calculating values that are quite a bit different but we want to ensure about the same percentage error in the calculation, such as when calculating resistor values. If at one point we calculate 100 ohms and another point we calculate 1000 ohms, if they are both subject to the same absolute error both the 100 ohm and the 1000 ohm could be off by say 1 ohm, and that means the 100 ohm is off by 1 percent while the 1000 ohm is off by only 0.1 percent, which is not usually the way we like to do things. We'd usually want the same percentage error. Incorporating a fit based on percentage error means the calculation of the 100 ohm could be off by 1 percent and the same for the 1000 ohm resistor, so the 100 ohm could be from 99 to101 ohms while the 1000 ohm calculation would be allowed to be off by 10 ohms, from 990 to 1010 ohms, which is usually the way we like to do things. This means the errors are spread more realistically across the range and makes the fit more practical.

There are also fits that are whatever you think is best given the nature of the physical process. For the thermistor data, we would want to try a logarithmic curve fit first because we know that is the nature of these devices.

BTW, in your original post you show a line and several points above the line and several below the line. Without looking at the actual numerical values, this fit looks like there are systematic errors because there are quite a few more above the line than below the line. There are not that many points involved though so this is really just a guess. If it is systematic then the nature of the data is not going to be represented very well by a straight line, or at least there is probably a better 'curve' that will fit the data better without eliminating points. Of course again this is if the nature is not already known, because if it is known to be a straight line beforehand than some of points MUST be erroneous, and those will be some of the ones above the line.

Thread starter	Similar threads	Forum	Replies	Date
M	Have i been thinking about a buck converter wrong this whole time?	Power Electronics	29	Saturday at 11:39 AM
	Arduino joystick - wrong values	Microcontrollers	3	Apr 16, 2024
	What am i doing wrong here?	PCB Layout , EDA & Simulations	29	Apr 12, 2024
B	What's wrong with my discharging circuit?	Homework Help	6	Mar 29, 2024
	Excluding certain file types from backup in Windows 7 ?	Software & IDEs	7	Dec 25, 2012

Excluding wrong points in least square method

Join our Engineering Community! Sign-in with:

Excluding wrong points in least square method

anhnha

Attachments

MikeML

Papabravo

anhnha

anhnha

studiot

panic mode

studiot

anhnha

anhnha

Papabravo

anhnha

studiot

Papabravo

wayneh

studiot

MrAl

You May Also Like

Renesas Rolls Out Entry-Level MCU With ‘Best-in-Class’ Power Consumption

Microchip Expands Its Serial SRAM Devices to 2 Mb and 4 Mb

Wi-Fi HaLow Flexes Its Wings, Extending Two Miles on Morse Micro SoC

TI Launches Compact Power Devices Ahead of APEC 2024