Reverse Engineering an Error Correcting Code

This blog entry was inspired by the following thread:

Forward Error Correction Identification

The basic gist is that the Original Poster (OP) had a cheap data module for an RF system that, for some reason, he couldn't disclose the make or model of and he didn't have the documentation on its FEC (Forward Error Correction) features. What he wanted to know was whether anyone here could help identify the particular FEC code being used and, presumably from there, was wanting to know how to encode aribitrary data into codewords, decode arbitrary codewords to extract the underlying data, and how to perform the error correction on corrupted data.

By inputting some data strings into it and monitoring the encoded output he was able to determine that the module took each 8-bit data byte from the input stream and independently encoded it into a 12-bit codeword. He then provided several example data/codeword combinations.

So this blog entry is part detective story and part introduction to error-correcting codes. I'm going to sidestep in-depth mathematical discussions on the theory of error correcting codes and algebraic coding theory and, instead, put everything in a context that hopefully most people will be able to follow in a largely intuitive way.

After telling the OP that I really needed the entire set of <data, codeword> pairs to have much hope of figuring out the underlying algorithm (and whether or not it even really was an FEC code), the OP posted 55 such pairs on a Friday (or the 256, or just over 20%) with the indication that he would post the rest on Monday when he went back to work. During that weekend, I had some time (well, I used time that I really should have been doing something else, but that's what I'm doing right now, when you get down to it) to think about how I wanted analyze the data set once I had the whole thing and to write a program (in C) to do it. As it turned out, the limited data set was not only enough to determine the underlying algorithms, but to identify that one of the values in the original data set was wrong (which was confirmed by the OP almost immediately as being a simple typo).

So how was I able to do this? First, let's point out that there are lots of FEC codes out there and many would be all-but-impossible to figure out based on such a small amount of data -- only 82.5 bytes. But the fact that it was block-oriented -- meaning that a given input byte always produced the same 12-bit codeword regardless of which (or how many) bytes preceded that byte in the data stream -- help immensely. Since the block size was eight bits, there were only 256 codewords and since the codewords were only twelve bits, the number of vectors (i.e., potential codewords) was limited to 4096. As a result, even if every possible data word had to be compared to every potential codeword, we were only looking at slightly over a million possibilities, making an exhaustive search through the code space a very reasonable approach, if needed.

Simple Error Correcting Codes

The idea behind most simple error correcting codes is that we have some data that needs to be transmitted and, because we know that we are likely to experience some amount of data corruption during the transmission/reception process, we "encode" the data in such a way that we place carefully chosen redundant information into the data stream. The receiver, who knows that there are probably a few errors in the received bit stream, uses this redundant information in order to detect and correct at least some of the errors. Those that can't be corrected, assuming they can be detected, are commonly handled by simply requesting that the sender retransmit them.

Perhaps the easiest to understand error correcting code is to transmit every data block three times. The receiver examines the blocks and if any two of them agree, it is assumed that the remaining one, if it doesn't agree, has an error. This code, while very simple, is not very efficient. In order to transmit 1 bit of data, we have to transmit 3 bits of information. The "rate" of the code is 1/3. We can do better.

Making an Error Detecting Code

A slightly more sophisticated approach is to use the notion of a "parity" bit, which is nothing more than an additional bit of information that is added and the value of this bit is set so as to make the total number of '1' bits in the set of bits it "covers" (which includes itself) either "even" or "odd", depending on whichever we decide is "better" in our application.

So if we four bits of data, we might add a fifth bit that results in the overal parity of the codeword being even. Hence, for example,

3210    32100
----    -----
0000 -> 00000
1011 -> 10111
1111 -> 11110

P0 = XOR(D3, D2, D1, D0)
The receiver then checks if the parity of the received codeword is even (meaning that there are an even number of '1's in the codeword). If there are, it assumes that no errors occurred. If it isn't, then it knows that an error has occurred. However, it has no idea which bit (or bits) were corrupted. It's possible (in fact, 20% likely) that the data is fine and it was the parity bit itself that was corrupted. But there's no way to know, so the data has to be retransmitted. This is an example of and "error detecting" code.

It should be noted, if not already obvious, that if there are two errors (if two different bits are flipped during the transmission process), that the overal parity will remain unchanged and the receiver will, wrongly, conclude that no errors occurred. This is always possible with any error detecting or error correcting code, because there are always error patterns that will convert one legitimate codeword into another and, thereby, be undetectable. But when we choose (or design) a code to use, we take into account the error probabilities of the tranmission channels that will be in use. The goal, at this level, is not to make the channel error free, but to reduce the residual error rate to a level that is tolerable.

Making an Error Correcting Code

Being able to detect errors is one thing and it is a very useful thing. But it would be nice if, at least most of the time, we did not have to request that the data be retransmitted. In fact, there are many situations in which retransmission is not possible -- the receiver has one shot to get it right.

So let's see what happens with our toy code if we add a couple more parity bits, so now we have:

3210    3210210
----    -------

P2 = XOR(D3, D2, D0)
P1 = XOR(D3. D1, D0)
P0 = XOR(D2, D1, D0)
So the receiver gets a seven bit codeword and assumes that either there are no errors, or perhaps there is, at most, a single error. Keep in mind that ANY of the seven bits, not just data bits, could be the one with the error and, in fact, a given parity bit is just as likely to be wrong as a given data bit.

At this point, let's walk through a particular example to get a feel for how the mechanics work and then look at the general structure of the code to see why the mechanics work.

Let's transmit the data 0x5 (the '0x' means the value that follows is in hexademical -- if you don't know it, look it up).

So we have:

3210    3210210
----    -------
0101 -> 0101010

P2 = XOR(D3, D2, D0) = XOR(0,1,1) = 0
P1 = XOR(D3. D1, D0) = XOR(0,0,1) = 1
P0 = XOR(D2, D1, D0) = XOR(1,0,1) = 0
So now let's consider the one of the seven possible patterns that the receiver might have to work with in the event of a single-bit error. Let's say that D2 was flipped from a '1' to a '0' such that the receiver saw:

Received pattern:
First, we will assume that the data portion was received correctly and, based on that, generate the codeword we should have seen:

3210    3210210
----    -------
0001 -> 0001111

P2 = XOR(D3, D2, D0) = XOR(0,0,1) = 1
P1 = XOR(D3. D1, D0) = XOR(0,0,1) = 1
P0 = XOR(D2, D1, D0) = XOR(0,0,1) = 1
We then XOR the actual received pattern with the expected pattern to produce an "error vector":

0001010 (received pattern)
0001111 ("correct" pattern if data portion correct)
------- XOR
0000101 (error vector)
If the error vector is all zeros, then the receive codeword matches the correct codeword under the assumption that the data portion was error-free, and so we conclude that there were no errors at all.

If the error vector has just a single '1' in it, then we conclude that there was an error, but it was in the parity bit where the '1' is and that the data portion is actually correct.

But if there are multiple '1's in the parity bits (note that there will never be any in the data bits because we are assuming the received data is correct at this point), then it means that we actually do have an error in the data portion (because we are assuming a single bit error). So what we want to know is which data bit, if any would cause the specific set of parity bits that are indicating errors to all flip (and not cause any of the others to flip). By examining the definitions for how the parity bits are calculated, we see that the only bit that affects P2 and P0 but does not affect P1 is D2. Therefore we conclude that D2 is where the error is.

So let's generalize this a bit. To perform the error correction, we do the following:

1) Take the data portion of the received pattern and generate the correct codeword for that pattern.

2) XOR the received pattern with the generated pattern and use the parity portion of the result to decide what to do. If it is zero, then there is no error and if there is a single '1', then the error was in that parity bit. Otherwise, look for the data bit that affects exactly the set of parity bits that are '1' in the XOR result. If there is one, that is where the error lies. If there isn't one, then the assumption of a single-bit error is wrong and the data is unrecoverable.

We can list all of the possible patterns for the three parity bits after performing the XOR and enumerate which bits need to be corrected by providing a mask containing a '1' at the error locations. All we then need to do is XOR this mask with the original received pattern to produce a valid codeword.

To easily see which parity bit patterns are associated with which data bits, we can list the parity relationships as follows:

D3 XX-
D2 X-X
D1 -XX
In the above table, and X means that changing the data bit associated with that row affects the parity bit associated with that column.

This makes if very straightfoward to map out our error correction vectors:

210 3210210 Error Location
--- ------- --------------
000 0000000 (No Error)
001 0000001 P0
010 0000010 P1
011 0010000 D1
100 0000100 P2
101 0100000 D2
110 1000000 D3
111 0001000 D0
Hamming Weight and Hamming Distance

While it hasn't been mentioned up to this point, everything we have done is intrinsically tied to the notion of Hamming weights and Hamming distances.

The "Hamming weight" of a bit pattern is simply the number of bit positions that contain a '1', while the "Hamming distance" between two bit patterns is the number of bit positions in which the two patterns differ.

If we take the bitwise exclusive-OR (XOR) of the two (equal length) codewords, the result is a '1' in any position in which the two patterns differ and a zero in all other positions (i.e., positions in which they are the same). This vector is frequently referred to as the "error vector" between these two patterns, regardless of whether it is being used for any purpose related to errors. The Hamming distance is therefore simply the weight of the error vector between two bit patterns.

The significance of the error vector is that it identifies the number and location of the bits in either of the patterns that have to be flipped (or have errors) to convert that pattern into the other pattern.

Most error correcting codes work by taking the received bit pattern and identifying the legitimate codeword that has the smallest Hamming distance to it. This is precisely what our algorithm above. In fact, the "toy" error correcting code above is not a toy at all. It is the [7,4] Hamming code (except the bits are ordered differently, which has no effect on its behavior) introduced at Bell Labs in 1950 by Richard Hamming (and later named in his honor). It is still in widespread use today for a variety of applications and is the foundation for an entire class of Hamming codes that are ubiquitious throughout the world.

While a bit more subtle, it is also useful to realize that we do not have to alter just one of the patterns. Let's say that two particular bit patterns have a Hamming distance of two. We can pick one of the bits in the error vector and use it to modify the first pattern and the other bit to modify the second pattern. In doing so, we have converted each pattern to the same third pattern and this pattern has a Hamming distance of one to either of the original patterns.

This is the situation with the use of just a single parity bit, such as the example previously with four data bits and one parity bit. All legitimate codewords have a Hamming distance of at least two from any other codeword. But, for any two codewords that are separated by this minimum distance, there is a third bit pattern that is, in a sense, midway between them. If we receive that bit pattern, it could have been the result of a single bit error from either of the two codewords. In fact, it could have resulted from a single bit error from any of five different codewords! There is no way for us to determine which of the five it is (at least not based strictly on the received bit pattern).

However, if the minimum distance between any two codewords is three, then any single bit error in a codeword results in a pattern that is only a distance of one from that codeword but a distnace of at least two to any other codeword. In general, if we want to be able to correct any errors that affect at most n bits, the minimum distance between any two codewords must be (2n+1).

Analyzing the Partial Codebook

To begin with, we have to understand that, without the complete codebook (i.e., all of the data values and their corresponding codewords), we can't prove that the code is anything (though we can prove things that is is not). We could have 255 of the 256 entries and it might look like a legitimate error correcting code but that 256th entry might only be a Hamming distance of one or two from one of the other codes. Conversely, if we only had five of the entries but two of them were separated by a distance that was less than three, we will have proven that the codebook, as a whole, is not a valid error correcting code.

With this in mind, the goal is to see what we can prove (in terms of what the code is not) or tentatively hypothesize (in terms of what the code is) based on the assumption that the remaining codebook entries behave in a manner consistent with the ones in the partial codebook available.

The first step was to identify the minimum distance of the codebook. This was done by simply walking through the list of codewords and computing the minimum Hamming distance between that codeword and each of the remaining codewords in the list. It turned out that this distance was three, lending strong support to the notion that this was, indeed, an FEC code. Furthermore, if it is was a valid FEC, it was also proven that it could only correct single bit errors.

The next step was to probe deeper and try to discover the structure of the code being used. This was once again done by walking through the codebook (several times, in fact). On each pass through the codebook, a single bit position in the 8-bit data was focused on. For each entry, that bit was flipped and the codeword for the corresponding data value was looked up. If it was in the partial codebook, the error vector between the two codewords was computed. After each comparison, a running tally of which bit positions had been affected was updated so that, at the end of the pass (i.e., before moving to the next bit position in the data value), it could be noted which bit positions in the codeword were always affected by changing that bit in the data value, which positions were never affected, and which positions were affected some of the time but not all of the time.

Multilayered Error Correcting Schemes

Also, the use of these codes seldom occurs in isolation. Let's consider an make-believe example. Let's assume that the transmission channel has a 10% probability of receiving any given bit incorrectly. This means that, on average, we will experience errors in 1 bit out of every 100. But they will be randomly spaced meaning that there is the potential to have two errors back-to-back. So let's say that we use our 4-bits of data with parity. -- while we might eliminate, say, 90% of the errors in the transmission, .

But what if, instead,
  • Like
Reactions: Vi Vincent

Blog entry information

Last update

More entries in General

More entries from WBahn

Share this entry