Algorithm for finding glitches in waveforms

Thread Starter

someonesdad

Joined Jul 7, 2009
1,583
I have a software application that plots waveforms that are defined by a sequence of floating point numbers (think of the old X Windows program xgraph and you understand it).

I'd like to enhance this program with the ability to detect "significant events". Mostly, I'd roughly define a "significant event" to be a part of the waveform that differs in amplitude significantly from its neighboring points. For example, a waveform that is all zeros except for one point that has an amplitude of 1 has a "significant event" where the 1 occurs.

I'd like to figure out an algorithm that could efficiently find these significant events. A naive approach that works well on "theoretical" waveforms is to subtract each point from its neighbor. This is a simple finite difference approximation to the derivative.

However, real world signals contain noise and as most of you know, differentiating a noisy signal leads to more noise.

Can any of you suggest some algorithms that might be useful to find these significant events? I know very little about signal processing.

The application is written in python and the numerical processing stuff uses numpy, an array-processing library written in C. I would like to use only numpy functions, as using a python for loop is way too inefficient, as the waveform data can involve millions of points.
 

guitarguy12387

Joined Apr 10, 2008
359
Glitches will have lots of high frequency content. You could use a simple HPF and threshold the results.

If you really want to go nuts, you could use wavelets.
 

Thread Starter

someonesdad

Joined Jul 7, 2009
1,583
I'll have to experiment with the high pass filter; thanks for the idea. Any pointers to people doing this in practice?

I haven't a clue how to use wavelets; but I'll immediately assume the processing will be significant.
 

guitarguy12387

Joined Apr 10, 2008
359
Yeah i'd start with HPF probably. And if you're doing this floating point you shouldn't have too much implementation problems.

Tip: If you have access to matlab, USE FDAtool to design your filter. Amazing resource... it will even export C header files and different things like that. No theory required.

If you have zero dsp experience, you might be in for a bit of reading. If thats the case, just use an FIR filter, get the coefficients with FDAtool or the like, and convert some c source code to your language of choice.

Is this a real time thing or no?
 

Thread Starter

someonesdad

Joined Jul 7, 2009
1,583
Don't have matlab and have no access to it.

I doubt I'll spend the time learning dsp stuff, as the return on the work probably isn't worth the benefits.

It's not a realtime application like you probably meant, but it's a user sitting at a computer and clicking on buttons and things to upload/download waveforms from instrumentation and transform the waveform in memory. Delays of one or two seconds are acceptable, but a delay of 3 seconds or more is probably unacceptable. Since the app runs on PCs running Windows, there will also be older computers out there and the customers running those will be quite annoyed at poor performance.

From a business perspective, this glitch detection is a want, not a must.
 

THE_RB

Joined Feb 11, 2008
5,438
Do you have a screenshot of a typical "glitch"?

It would also be good to see the shape and amplitude of a typical waveform, AND the sample rate, versus what a glitch looks like.
 

guitarguy12387

Joined Apr 10, 2008
359
Gotcha! This spiked my interest... i've started playing with some c++ code for this (free time at work haha!). If I get it working, I can send you the DSP section, if you can translate it from C/C++.
 

Thread Starter

someonesdad

Joined Jul 7, 2009
1,583
THE_RB: there is no sample rate (or, if you wish, the sample rate is a dimensionless 1). The data are contained in a numpy vector of floating point values. Sometimes the information comes from a digital oscilloscope, but the time information is stripped off, as it's meaningless for the application. There really is no typical waveform, as one of the features of the program is to let you make up any waveform you can imagine.

guitarguy12387: I'd definitely be interested in what you come up with. However, as I mentioned in my first post, I cannot descend into a typical thing I'd do in C or C++, which is drop into a for loop and process individual points. That's just too slow in python. Thus, for performance, I'm limited to the numpy array operations.

That said, I'd still consider using a for loop to do the processing if the algorithm works really well and just tell the users they'll have to bear with the performance hit. I'd love to see what you come up with.

I've attached a couple of pictures. The first shows a "clean" glitch at about index 28000. This is simply one point with an amplitude of about 1. The second picture shows some Gaussian noise added to the waveform. The bottom shows a simple finite difference approximation of the derivative which shows the glitch -- but also shows other stuff that might be interpreted as a glitch. Now imagine the number of points multiplied by 20 to 40 or more and you'll have the gist of the problem I'm interested in finding a solution for. And the data that come from the instruments can always have reasonable amounts of noise on them.
 

Attachments

guitarguy12387

Joined Apr 10, 2008
359
Hmmm after some matlab investigation, there's not enough power in the high frequency information of a glitch... i don't think my initial thought would work too well.

Let me think about this some more...
 

Tesla23

Joined May 10, 2009
542
You really have to define what you mean by glitch. One simple definition would be to compare the value at each point with the average of N values on either side. If this exceeded a certain threshold you would declare a glitch. This could be easily implemented by running a sliding average along the data and then doing a point-by-point comparison with the average. Note that most sliding averages will include the actual data point but you can remove this in the comparison. Sliding averages can be done directly with array operations if N is small, or using transform methods (FFT) if N is larger.

You seem to have data with edges, so you may want to modify this algorithm to compare each point with the average of N points before and the average of N points after, and a glitch occurs if BOTH of these exceed a threshold. If you create an array averaged with a sliding window of N points then you can easily implement the point-by-point comparison by the index in the averaged array. i.e. compare x with average av[i-n] and av[i+n] where n is N/2 - you get the idea.
 

Thread Starter

someonesdad

Joined Jul 7, 2009
1,583
Let me explain the context for this "glitch detection" problem. The software can read in data from instruments to acquire a waveform, the user can create a waveform by numerous methods in the program, or a waveform can be loaded from a file. From the programmer's perspective, the waveform could thus be of any form. The program can read in data like that gotten from an oscilloscope (i.e., a pair of vectors where the first is time and the second is amplitude), but the time information is always thrown away and the data in the program is just a single sequence of floating point numbers.

The typical practical vertical resolution is 8 to 14 bits if sending data to and from instruments; however, if the user just wants to create waveforms, their points can be the standard C double floating point of the platform. The number of points allowed in the waveform is 2 to whatever doesn't slow things down horribly. I've tested the program with a waveform of 1e7 points, but it's pretty sluggish. 1e6 to 2e6 is probably the most that folks will use, as a waveform that size can be downloaded from one of the measuring instruments.

The program has to display a sample of the waveform because the typical screen is perhaps 2000 points wide or less. Herein lies the "problem". The high frequency content of the waveform will not be displayed to the user because it's unlikely that short things show up in the sample (I was calling this high-frequency stuff a "glitch" for brevity). I have installed a button that does cause the whole waveform to be drawn, but that can result in over 1e6 calls to a GDI function to draw lines; a 1.2e6 point waveform takes 10-15 seconds to draw. The advantage of drawing all the points is that if there are any glitches like I illustrated above, they will be drawn on the screen.

I have no background in signal processing. Thus, I was wondering if there was an algorithm or heuristic that could be used that would make it more likely these short "significant events" could be shown. The constraint is that I've written the app in python using the numpy array processing library. numpy does things quickly because its algorithms are coded in C, but doing point-by-point processing by a loop in python is too slow for the bigger waveforms (roughly, anything above around 1e5 points). While python allows one to include code written in a compiled language for computation-intensive tasks such as this, I can't use that option for business reasons.

Frankly, I was hoping that I could construct a functional (in the calculus of variations sense) composed of the fast-processing primitives of numpy whose value would alert me to the fact that there were likely glitches in the waveform and that more careful processing was in order. Last year I played around a little with the FFT of the waveforms (which numpy can do) and did simple things like truncating the lower portion of the FFT -- but this wouldn't work here on my examples because those high frequency glitches will essentially have a continuous spectrum that will be lost in the typical high-frequency noise in the waveform's DFT.
 

Tesla23

Joined May 10, 2009
542
I'm no expert on Numpy, but let me explain what I was suggesting.

I suspect that the basis of your glitch detection is to compare a sample with some average sample, and to allow steps, perhaps to compare with an average above and an average below.

Now to compute these averages requires significant processing. For example imagine you want to create an array where x is the average of N adjacent elements of the data array d[]. This is most easily done using FFTs, simply create a second array w[], this will consist of all zeros except for the first N elements being 1. Then take the FFT of d, the FFT of w, do an elementwise multiplication then invert the FFT to get x[]. The result is an array where each element is the average of 20 adjacent elements in d[] (there may be a scaling factor). You should be able to do all this without loops in python. I've just looked at the Numpy documentation and there is a convolve function, this is exactly what you want to convolve d[] with w[] directly. It is implemented using FFTs and should make it easier to cope with scaling, indexing and end effects.

Now for any datapoint d, you can find the average of the N values before and the N values after in x[i-m] and x[i+k] (there will be some fine tuning to determine m and k, and they depend on the indexing in w[]).

I'm not familiar enough with Numpy to know if you can do the comparisons with their primitives.
 

THE_RB

Joined Feb 11, 2008
5,438
Do you actually need to preserve that "gaussian noise"?

Obvious solutions are based on eliminating HF content provided the "glitch" you need to remove is of a very short duration (is it always 1 sample?) and/or generally removing HF content by averaging a number of samples and possibly restricting rate of change per sample.

Restricting rate of change can be done very fast like this;
if(new_samp > (old_samp+10)) new_samp = old_samp+10; // limit to max change of 10
if(new_samp < (old_samp-10)) new_samp = old_samp-10;

And averaging can be done very fast with a single binary divide and accumulator;
average = (accum/4); // get last average
accum = (accum - average); // subtract 1/4
accum = (accum + new_samp); // add in new sample

If you restrict rate of change first then do the averaging you can effectively remove as much HF content as you need with some very fast executing code.
 

Thread Starter

someonesdad

Joined Jul 7, 2009
1,583
Tesla23:

This sounds very interesting and is something I'm going to have to look into. Alas, I'm currently fighting some bugs in instruments that don't work in an intelligent fashion. I will get back to this hopefully in a few days and will try your technique. Rest assured I will undoubtedly have more questions. :p

THE_RB:

I can't change the user's waveform without his permission. So, if I do give your stuff a try, it will have to be with a copy of the data in memory. These arrays should be less than 18 MB each (I could use an optimization to cut that in half if needed), so making a copy isn't a big deal, but it does slow things down a bit.

Both of your suggestions, limiting the rate of change and averaging, probably require a loop to process individual points in the array. That option isn't available to me because it's too slow.
 

THE_RB

Joined Feb 11, 2008
5,438
You already have to sequentially process every sample in the array! How else are you going to remove or fix a bad sample?

Both the processes I described only need to be performed ONCE per each sample, and only require ONE pass through the data. And they are both extremely fast to execute.

I don't mean to sound rude but if your app doesn't have the processing power to do one pass through the samples and perform a simple operation then it really should not be trying to do digital signal processing. ;)
 

Thread Starter

someonesdad

Joined Jul 7, 2009
1,583
I don't mean to sound rude but if your app doesn't have the processing power to do one pass through the samples and perform a simple operation then it really should not be trying to do digital signal processing. ;)
Oh, absolutely, you're not being rude. I thank you for your help. However, I stated in my first post that I can't use a scripting language for loop because I already know it's too slow -- it's fine for waveforms of 1e4 order of magnitude, sluggish for 1e5, and definitely too slow for >1e6 . Thus, I have the numpy primitives (which, when they do process things via a for loop, are doing it at the C library level). Though it's conceivable I could write my own C library, I'm not going to approach this lightly, as it brings other problems associated with packaging and delivery. The app has been in production since last year and I'm also constrained by what the customer is willing to pay to add to the app. I'm reasonably confident that glitch detection wouldn't be a must -- it'd be a "nice" if the more important stuff gets solved first.

That said, the numpy environment is fairly rich (it's much like an array processing language library I designed for myself 25-30 years ago, but didn't have the computing speed or programming knowledge to implement). If I was using it every day in e.g. a scientific programming environment, I'd probably have a better idea of how to do things. But I don't.

I'm doing battle right now with new instrumentation issues, so this glitch detection stuff is on hold. I hope to get back to it in about a week or so.
 

mjhilger

Joined Feb 28, 2011
118
I have stayed out of this discussion as it seems on a good course. However, when I first read your post a few days ago, I immediately thought of measuring the gradient. When I began doing imaging processing in the early 90's we had a chip which localized the neighborhood of surrounding pixels to identify purely by addition and weight (multiply of particular pixels in the neighborhood - in hardware) of 9x9 area. When a change above a fixed threshold (or dynamic - like AGC) a change was identified and we recognized an edge - for handwriting, logos, pictures (I was processing your checks with all their funky backgrounds). Your problem seems similar.
Even without a known sampling period, you do have discrete input data from a regularly sampled waveform as I understand from the posts. So applying numerical methods for derivatives (pure addition +/-) provides those changes so the second or third order should yield a good measure.
And while I'm not up to speed as I should be on Z transforms, that would be a good domain to provide a method as well.
 

Thread Starter

someonesdad

Joined Jul 7, 2009
1,583
mjhilger, I thought so too (i.e., using derivatives). But it's complicated by the fact that these million-point waveforms are coming from a measuring instrument having 8 bits of vertical resolution. The roundoff error from a simple differencing scheme to get the derivative destroys much of the low-frequency information.

Here's an example using numpy (actually, scipy) for a N = 1000 point waveform:

Rich (BB code):
from __future__ import division
from pylab import *
from numpy.random import normal

N, r = 1000, 2*pi
x = sin(arange(0, r, r/N)) + normal(size=N)/50
x *= 127/max(abs(x))
# Convert x to 8 bit signed integers
x = x.astype(int8)
plot(diff(x), label="diff")
plot(x, "r", label="8 bit sine")
title("%d points" % N)
legend()
show()
I've attached a picture of the output showing the derivative of a sine wave with a little noise on it. The diff() function is numpy's calculation of x[n+1] - x[n]. You'd expect the blue curve to be the cosine.

When I first did a diff (the program I've written provides that as one of the features), I was surprised to see an essentially zero waveform rather than the waveform I was expecting (I had a sine displayed on the scope, so I expected to see a cosine). A quick look at the data showed me what was going on. So this is another monkey wrench in the works. However, since the derivative tends to show the high-frequency stuff, it may still be a decent tool to help find glitches and edges (i.e., "interesting" stuff). I'll be looking at this more when I get the scope back.

By the way, on my computer, the diff() function on a 1e7 point waveform takes 63 ms. A python 'for' loop doing exactly the same calculation takes 9.7 seconds, or about 160 times longer. Hence my earlier comments about how the python for loops would slow things down a lot for the user's experience.
 

Attachments

mjhilger

Joined Feb 28, 2011
118
Ok, I think I see the problem. While you need the difference, it must be multiplied by the sample width, so in this case 1000 points. I created a spreadsheet to show the difference it makes. Series 1 is the sin, 2 is only the difference, 3 is the difference * 500∏. See the difference the multiplication makes? If you multiply by the 1000 of your sample size you should see a great improvement. Also, glitches, are by definition high frequency, we could care less about the low frequency changes (unless I don't understand what events you are trying to catch). The attached zip is an excel xls file with a full sine wave across 1000 points, I used this to produce the attached chart. You can see that the difference column is very small changes. I think this is where you could look for drastic changes, either by a fixed threshold or by a dynamic one seeded with a fixed starting number. For the dynamic you might take the average of the surrounding 10 samples. Like I was trying to point out with the imaging, the hardware did it serially, it added one and threw one away and compared that to the pixel of interest. Likewise your actual detection might start 5 or 10 pixels in, if it is only 1000 points, or if it is a serial stream, just keep a running group, add one and throw one away as the next number comes in. I didn't follow the language you displayed, so not sure how to tell you to implement it there, but I'm sure you know. The goal is to only make one pass on the data and detect the anomaly on the fly. The noise should get averaged out if it is truly random. Or if it begins to interfere you might have to decrease the sensitivity of your algorithm. I think you'll see it now
 

Attachments

Top