I'm working on a project decoding the data stored in the RFID tags of railcars. They are AEI-ATA (AAR) UHF RFID tags (how about that mouthful; bet the government was involved). The contain a 128bit binary string into which is encoded a surprising amount of data. The format for encoding this data is really wonky, with bits for different parameters being split up and placed in random positions of the string, that need to be parsed and concatenated to interpret. I really question what these guys were thinking when they came up with this.

But one thing that intrigues me is the way they encode the letters of the data field called "initial" (like a callsign, unique identifier) of the railcar. It consists of 4 letters (A-Z and blank) . Since there are 27 possible letters, this could simply and easily be encoded with 5 bits per letter, for a total of 20 bits expressing the 4 letters. 5 bits = 32 possible characters but only 27 characters are used, which leaves 5 wasted opportunities per letter slot.

So to avoid those wasted opportunities they crammed the letters into 19 bits by using arithmetic to combine the letter values into a single number and then encoding that number. Here's an excerpt from the spec explaining how that's done:

Seems like a pretty poor improvement and I'm wondering if there's a better way that it could be done. Better in terms of saving more than one bit and/or a simpler cipher. I'm thinking not, because 26*27*27*27 = 511,758, which takes a minimum of 19 bits to express. But it's still eating at my brain that maybe there's a clever mathematical way to further compress the data. Maybe use some of those bits more than once, I don't know.

Also a point of curiosity; when decoding the letters using the formulae provided, you do not get integers. You must drop (not round) all data to the right of the decimal. So that says to me that this scheme is still not 100% efficient. There is still some waste. Am I wrong?