Is there a better way to save bits?

Thread Starter

strantor

Joined Oct 3, 2010
6,782
I wasn't sure whether to post this in the math section or programming; it sorta falls in between

I'm working on a project decoding the data stored in the RFID tags of railcars. They are AEI-ATA (AAR) UHF RFID tags (how about that mouthful; bet the government was involved). The contain a 128bit binary string into which is encoded a surprising amount of data. The format for encoding this data is really wonky, with bits for different parameters being split up and placed in random positions of the string, that need to be parsed and concatenated to interpret. I really question what these guys were thinking when they came up with this.

But one thing that intrigues me is the way they encode the letters of the data field called "initial" (like a callsign, unique identifier) of the railcar. It consists of 4 letters (A-Z and blank) . Since there are 27 possible letters, this could simply and easily be encoded with 5 bits per letter, for a total of 20 bits expressing the 4 letters. 5 bits = 32 possible characters but only 27 characters are used, which leaves 5 wasted opportunities per letter slot.

So to avoid those wasted opportunities they crammed the letters into 19 bits by using arithmetic to combine the letter values into a single number and then encoding that number. Here's an excerpt from the spec explaining how that's done:


Capture.PNG


Seems like a pretty poor improvement and I'm wondering if there's a better way that it could be done. Better in terms of saving more than one bit and/or a simpler cipher. I'm thinking not, because 26*27*27*27 = 511,758, which takes a minimum of 19 bits to express. But it's still eating at my brain that maybe there's a clever mathematical way to further compress the data. Maybe use some of those bits more than once, I don't know.

Also a point of curiosity; when decoding the letters using the formulae provided, you do not get integers. You must drop (not round) all data to the right of the decimal. So that says to me that this scheme is still not 100% efficient. There is still some waste. Am I wrong?
 

WBahn

Joined Mar 31, 2012
29,976
I wasn't sure whether to post this in the math section or programming; it sorta falls in between

I'm working on a project decoding the data stored in the RFID tags of railcars. They are AEI-ATA (AAR) UHF RFID tags (how about that mouthful; bet the government was involved). The contain a 128bit binary string into which is encoded a surprising amount of data. The format for encoding this data is really wonky, with bits for different parameters being split up and placed in random positions of the string, that need to be parsed and concatenated to interpret. I really question what these guys were thinking when they came up with this.

But one thing that intrigues me is the way they encode the letters of the data field called "initial" (like a callsign, unique identifier) of the railcar. It consists of 4 letters (A-Z and blank) . Since there are 27 possible letters, this could simply and easily be encoded with 5 bits per letter, for a total of 20 bits expressing the 4 letters. 5 bits = 32 possible characters but only 27 characters are used, which leaves 5 wasted opportunities per letter slot.

So to avoid those wasted opportunities they crammed the letters into 19 bits by using arithmetic to combine the letter values into a single number and then encoding that number. Here's an excerpt from the spec explaining how that's done:


View attachment 187128


Seems like a pretty poor improvement and I'm wondering if there's a better way that it could be done. Better in terms of saving more than one bit and/or a simpler cipher. I'm thinking not, because 26*27*27*27 = 511,758, which takes a minimum of 19 bits to express. But it's still eating at my brain that maybe there's a clever mathematical way to further compress the data. Maybe use some of those bits more than once, I don't know.

Also a point of curiosity; when decoding the letters using the formulae provided, you do not get integers. You must drop (not round) all data to the right of the decimal. So that says to me that this scheme is still not 100% efficient. There is still some waste. Am I wrong?
Compression works on identifying and exploiting redundancies in the data. With just four characters, that's going to be very difficult. Also, EVERY compression algorithm has inputs that result in the compressed data requiring MORE space than the original data, which means that any compression algorithm you were to come up with would have the possibility of requiring more than 19 bits for some call signs. Now, especially with a data field this small, you could go through and enumerate every possible call sign (there's only a half million of them) and determine how many bits the compressed data field would be and then just declare that any call signs that compressed to more than N bits (presumably with N less than 19) are invalid call signs.

But, in doing so, all you would be doing is restricting the number of unique call signs to a smaller number than now.

Imagine that you have 524,288 rail cars (2^19). Each one needs a unique identifier. Apply whatever manipulation/compression algorithm you want to the data, at the end you will end up with a string of bits and you need to be able to match exactly one of those strings to one of the cars, which means that you need 524,288 unique strings. You can't do that with fewer than 19 bits.

As for the rounding, there's no waste there -- it's merely the simplest (conceptually for a human) way that you pick off digits in a number.

Say I have the number Value = 8735 and I want to pick off each digit (so we are working base 10 instead of base 27). I can get the first digit by going

N1 = floor(Value / 10^3) = floor(8735 / 1000) = floor(8.735) = 8

I'm not wasting any information because I'm not throwing it away, that 0.735 is exactly the starting point for the next step in the process.

Value - N1 *10^3 = 8735 - 8*1000 = 735
 

Papabravo

Joined Feb 24, 2006
21,157
I wasn't sure whether to post this in the math section or programming; it sorta falls in between

I'm working on a project decoding the data stored in the RFID tags of railcars. They are AEI-ATA (AAR) UHF RFID tags (how about that mouthful; bet the government was involved). The contain a 128bit binary string into which is encoded a surprising amount of data. The format for encoding this data is really wonky, with bits for different parameters being split up and placed in random positions of the string, that need to be parsed and concatenated to interpret. I really question what these guys were thinking when they came up with this.

But one thing that intrigues me is the way they encode the letters of the data field called "initial" (like a callsign, unique identifier) of the railcar. It consists of 4 letters (A-Z and blank) . Since there are 27 possible letters, this could simply and easily be encoded with 5 bits per letter, for a total of 20 bits expressing the 4 letters. 5 bits = 32 possible characters but only 27 characters are used, which leaves 5 wasted opportunities per letter slot.

So to avoid those wasted opportunities they crammed the letters into 19 bits by using arithmetic to combine the letter values into a single number and then encoding that number. Here's an excerpt from the spec explaining how that's done:


View attachment 187128


Seems like a pretty poor improvement and I'm wondering if there's a better way that it could be done. Better in terms of saving more than one bit and/or a simpler cipher. I'm thinking not, because 26*27*27*27 = 511,758, which takes a minimum of 19 bits to express. But it's still eating at my brain that maybe there's a clever mathematical way to further compress the data. Maybe use some of those bits more than once, I don't know.

Also a point of curiosity; when decoding the letters using the formulae provided, you do not get integers. You must drop (not round) all data to the right of the decimal. So that says to me that this scheme is still not 100% efficient. There is still some waste. Am I wrong?
Besides the data bits are there any strategically placed check bits to allow for error detection and/or correction? If error correction is possible then separating parts of a message that logically belong together is a way to mitigate the damage of burst errors longer than a certain number of bits. Burst errors are common in RF channels.
 

Thread Starter

strantor

Joined Oct 3, 2010
6,782
Besides the data bits are there any strategically placed check bits to allow for error detection and/or correction? If error correction is possible then separating parts of a message that logically belong together is a way to mitigate the damage of burst errors longer than a certain number of bits. Burst errors are common in RF channels.
There are (2) 2-bit checksums. One in the middle and one at the end.
 

WBahn

Joined Mar 31, 2012
29,976
Sounds like error detection is possible. Error correction -- eh, not so much. You would need a longer checkword.
Error detection is probably good enough. The reader almost certainly has the opportunity to read the chip multiple times until it passes without the user even realizing it's done so. If it still fails, the user will likely automatically try again. Given the paucity of bits available, probably not justifiable to use more check bits to get any error correction ability.
 

Thread Starter

strantor

Joined Oct 3, 2010
6,782
Error detection is probably good enough. The reader almost certainly has the opportunity to read the chip multiple times until it passes without the user even realizing it's done so. If it still fails, the user will likely automatically try again. Given the paucity of bits available, probably not justifiable to use more check bits to get any error correction ability.
Yes, the reader will have all the chances in the world to read it properly. The reader is capable of reading all the railcars of a train going 70mph as they pass a stationary antenna. This application is in a railyard, so the cars will be either stationary or moving less than 10 mph.

I am surprised though at the weak 2-digit checksum. 1 in 4 chance of a false positive
 

WBahn

Joined Mar 31, 2012
29,976
Yes, the reader will have all the chances in the world to read it properly. The reader is capable of reading all the railcars of a train going 70mph as they pass a stationary antenna. This application is in a railyard, so the cars will be either stationary or moving less than 10 mph.

I am surprised though at the weak 2-digit checksum. 1 in 4 chance of a false positive
But there are two of them. I don't know if they both cover the entire data packet or not. If they do, then it's closer to 1 chance in 16.
 

402DF855

Joined Feb 9, 2013
271
I am surprised though at the weak 2-digit checksum. 1 in 4 chance of a false positive
Been a while since I did RFID, but I'd swag they have options for more powerful error detection/correction further down the protocol ladder. At least for more recent RFID.
when decoding the letters using the formulae provided, you do not get integers
This is just basic radix conversion, say binary to base 27; you can rearrange things and use floating point if you want, but that would likely be slower.
 

Thread Starter

strantor

Joined Oct 3, 2010
6,782
But there are two of them. I don't know if they both cover the entire data packet or not. If they do, then it's closer to 1 chance in 16.
Checksum #1 is for the first half of the string, checksum #2 is for the second half. I only care about the information in the first half but I'm evaluating both checksums because I feel like that gets me closer to a 1 in 16, but not quite there. 1 in 8 I think? or maybe not even. Maybe the logical thing to do is treat these halves as two separate strings and not impose the limitation of a good checksum on unneeded data. But I feel like calling the integrity of the entire string into question based on a failure of the second half is prudent, even if I'm not knowledgeable enough to quantify how prudent it is.
 

WBahn

Joined Mar 31, 2012
29,976
Checksum #1 is for the first half of the string, checksum #2 is for the second half. I only care about the information in the first half but I'm evaluating both checksums because I feel like that gets me closer to a 1 in 16, but not quite there. 1 in 8 I think? or maybe not even. Maybe the logical thing to do is treat these halves as two separate strings and not impose the limitation of a good checksum on unneeded data. But I feel like calling the integrity of the entire string into question based on a failure of the second half is prudent, even if I'm not knowledgeable enough to quantify how prudent it is.
In theory, if you don't care about whether the data in the second half is correct then it doesn't matter. But, having said that, since the likelihood of a false positive (claiming the data is good when its not) is so high on the first half, I would be tempted to declare the first half bad if it passes but the second half doesn't on the basis of the likelihood of corruption that affected both halves but that happened to create a false positive on the first half.

A better way is probably to read the data multiple times and require some minimum number of passed checksums and then accept whichever packet passed the most. So if you require three passes and two give you the same data, you assume that that is the real data. Of course, if there is no winner then you keep reading.
 

Thread Starter

strantor

Joined Oct 3, 2010
6,782
In theory, if you don't care about whether the data in the second half is correct then it doesn't matter. But, having said that, since the likelihood of a false positive (claiming the data is good when its not) is so high on the first half, I would be tempted to declare the first half bad if it passes but the second half doesn't on the basis of the likelihood of corruption that affected both halves but that happened to create a false positive on the first half.
Well, you're a very smart guy and it sounds like your head is about where mine is, so that's comforting.

A better way is probably to read the data multiple times and require some minimum number of passed checksums and then accept whichever packet passed the most. So if you require three passes and two give you the same data, you assume that that is the real data. Of course, if there is no winner then you keep reading.
That's a great idea. I was just going to keep polling the reader until I got a string that passed both checksums, but now I think I'll implement your suggestion. I will require at least 3 identical reads (or more maybe, however many is feasible) before calling it a good read. That's a great way to supplement the weak checksum. Thank you for the help!
 

402DF855

Joined Feb 9, 2013
271
Most RFID tags are passive and as such their transmissions are powered by the reader's RF output. A reader is able to read multiple tags during a scan and in fact a single tag will likely itself be read many times. The reader has to keep track of the tags it sees and ignore any that it deems invalid, whether it's due to a bad CRC/checksum or even if the tag was read just once (the latter implying it may be a RX anomaly).
 

Thread Starter

strantor

Joined Oct 3, 2010
6,782
Most RFID tags are passive and as such their transmissions are powered by the reader's RF output. A reader is able to read multiple tags during a scan and in fact a single tag will likely itself be read many times. The reader has to keep track of the tags it sees and ignore any that it deems invalid, whether it's due to a bad CRC/checksum or even if the tag was read just once (the latter implying it may be a RX anomaly).
That's correct and it's crossed my mind what kind (if any) of error checking goes on at the level of the reader. Because the reader has no programming to recognize the 128bit AAR (American Association of Railroads) format and perform error checking with the provided checksums. But I wonder if, by the time I receive a tag string in my serial buffer, the reader has already scanned the tag several times and gotten the same value each time, essentially verifying that the data is valid (or consistently invalid).
 
Top