# short string compression

#### bug13

Joined Feb 13, 2012
1,948
Hi team

Just wondering if a way to compress short string, something like these:
064904EC94F2A1CEFFFFAE0E01CE
I am running out of bandwidth... some data are missing... and I don't have control over buffer size / speed of the communication channel.

I have tried this: https://github.com/antirez/smaz, and my string actually enlarge by 7%

#### Papabravo

Joined Feb 24, 2006
15,521
Compression efficiency is associated with repetitive or redundant patterns in data. Compression on short strings is hardly worth the effort.

#### djsfantasi

Joined Apr 11, 2010
7,562
I suspect that the code required to compress/decompress the string will vastly eat up more space than the string alone.

The only chance I see, is to review the string encoding to see if it can be optimized. Can the entire string be broken into sub fields? Can there range of values be represented in fewer bits?

You don’t say what MCU/language you are using. Can you move some values from RAM to ROM?

One last technique, which may also use more memory than its worth, is to bit-encode values instead of placing them into bytes. You might be able to cram three values into two bytes.

#### WBahn

Joined Mar 31, 2012
26,398
Hi team

Just wondering if a way to compress short string, something like these:

I am running out of bandwidth... some data are missing... and I don't have control over buffer size / speed of the communication channel.

I have tried this: https://github.com/antirez/smaz, and my string actually enlarge by 7%
Compression algorithms take advantage of redundancies in the data. Naive algorithms that don't exploit information about what the data is representing often don't do a good job (or at least not nearly as good a job as they could if they did). There is always some overhead in any compression algorithm so the data being compressed needs to be long enough to make up for it. Furthermore, as you've already observed, for ANY compression algorithm a data set can be constructed such that the compressed data will consume more space than the original data. For lossless compression algorithms this is trivially shown by simply invoking the pigeon hole principle.

Your strings are so short that you are almost certainly going to have to exploit the structure of the underlying information to craft either a more compact representation (probably your best bet) or to design a compression algorithm that exploits redundancies in that information (and that probably will be the much harder of the two).

So what information does this string represent? That's the proper place to start.

#### bug13

Joined Feb 13, 2012
1,948
You don’t say what MCU/language you are using. Can you move some values from RAM to ROM?
I am moving data over a communicate channel that I have no control of. The link can do 100 bytes per second, but I will need to move 110 bytes per seconds. I try to think of way to do that.

#### bug13

Joined Feb 13, 2012
1,948
So what information does this string represent? That's the proper place to start.
They are ASCII coded HEX values. The communication link only accept 0 to F in text (and SOF and EOF). It ignores everything else.

#### WBahn

Joined Mar 31, 2012
26,398
ASCII is just an encoding of some information. What is the information!? What does this string MEAN? Are they voltage readings? Are they switch positions? What ranges can they have? How many different possible distinct values can they take on.

The string you gave has 28 nibbles in it, or 112 bits. That's 2^112 possible different strings. Does the information that is being conveyed really have that many possible values?

#### djsfantasi

Joined Apr 11, 2010
7,562
If indeed you are transmitting text (ASCII) values, then your string is at least twice as long (in bits) than you need. To encode 0x0F in text takes two bytes. To encode the same value in binary takes one byte. Huge waste of transmission bandwidth.

If your code is in front or behind the transmission, sending/receiving text to binary to text is not an expensive operation.

It depends on if you are trying to save communications bandwidth or code size. I suspect the latter is your problem. Hence my suggestion to store constants in ROM vs RAM.

#### djsfantasi

Joined Apr 11, 2010
7,562
In addition, you refer to the transmissions as strings. Strings vs char have an expensive code cost. Are you sure they are strings and not char arrays?

#### bug13

Joined Feb 13, 2012
1,948
ASCII is just an encoding of some information. What is the information!? What does this string MEAN? Are they voltage readings? Are they switch positions? What ranges can they have? How many different possible distinct values can they take on.

The string you gave has 28 nibbles in it, or 112 bits. That's 2^112 possible different strings. Does the information that is being conveyed really have that many possible values?
Oh I think I see what you mean now. If I understand you correctly, I can combine a few fields and reduce 2 - 3 bytes. Is that what you mean?

Here is the information:
1. I think I can combine: SUB, CMD, TYPE and SN into two bytes.
2. No control over SN1 - 4.
3. possible combine ID.L ID.H and V.L V.H into 3 bytes
4. possible no control over STAT and RSSI
So I can reduce the string by 3 bytes. Can I do better?

#### WBahn

Joined Mar 31, 2012
26,398
If indeed you are transmitting text (ASCII) values, then your string is at least twice as long (in bits) than you need. To encode 0x0F in text takes two bytes. To encode the same value in binary takes one byte. Huge waste of transmission bandwidth.

If your code is in front or behind the transmission, sending/receiving text to binary to text is not an expensive operation.

It depends on if you are trying to save communications bandwidth or code size. I suspect the latter is your problem. Hence my suggestion to store constants in ROM vs RAM.
But note that he has said a few times that he has no control over the transmission protocol. He MUST give it an ASCII character that is a hex digit, as well as two specific control characters. He can't transmit it in binary because he doesn't control the transmission protocol. So he has to find a way to encode the INFORMATION in as few hex characters as he can.

#### nsaspook

Joined Aug 27, 2009
8,176
They are ASCII coded HEX values. The communication link only accept 0 to F in text (and SOF and EOF). It ignores everything else.
What you can do with that data link depends on that needs to be communicated. Actively changing Telemetry is notoriously hard to compress.

Lossless compression of truly random data is impossible but you can compress predetermined patterns in a data stream with lookup tables and code books. The transmitted letters KJB could stand for the entire King James Bible with the next code P1W1C1 meaning Page #1, Word #1 Character #1.

#### WBahn

Joined Mar 31, 2012
26,398
Oh I think I see what you mean now. If I understand you correctly, I can combine a few fields and reduce 2 - 3 bytes. Is that what you mean?

Here is the information:
1. I think I can combine: SUB, CMD, TYPE and SN into two bytes.
2. No control over SN1 - 4.
3. possible combine ID.L ID.H and V.L V.H into 3 bytes
4. possible no control over STAT and RSSI
So I can reduce the string by 3 bytes. Can I do better?
View attachment 221392
How many distinct combinations of SUB, CMD, TYPE, and SN values are there?

How many distinct combinations of other fields are there?

#### bug13

Joined Feb 13, 2012
1,948
It depends on if you are trying to save communications bandwidth or code size.
I am trying to save communication bandwidth. I have no control over the communication link. The com. link only take '0' to 'F' in char (and SOF and EOF).

Sorry I didn't make it clear, when I say string, I was really mean char array. I am programming in C.

#### bug13

Joined Feb 13, 2012
1,948
The transmitted letters KJB could stand for the entire King James Bible with the next code P1W1C1 meaning Page #1, Word #1 Character #1.
I see, I will think about in the lookup table direction.

#### bug13

Joined Feb 13, 2012
1,948
How many distinct combinations of SUB, CMD, TYPE, and SN values are there?

How many distinct combinations of other fields are there?
Ok, I think I start to get it, along with the help from @nsaspook. So if the maximum combination of those field are 256, I can build a lookup table of 16 x 16, so I can represent those by one byte? Is that what you are hinting at??

#### WBahn

Joined Mar 31, 2012
26,398
Ok, I think I start to get it, along with the help from @nsaspook. So if the maximum combination of those field are 256, I can build a lookup table of 16 x 16, so I can represent those by one byte? Is that what you are hinting at??
Yes.

#### DickCappels

Joined Aug 21, 2008
7,180
If your range is $0 to$F you can put two characters in a single byte.

#### bug13

Joined Feb 13, 2012
1,948
If your range is $0 to$F you can put two characters in a single byte.
My communication link only accept '0' to 'F' in char and SOF/EOF only.

#### DickCappels

Joined Aug 21, 2008
7,180
I'm sorry but I cannot imagine how you can improve on that based on data compression, given that you will need at least a nybble for each byte, if even to show the change of the 16 bit range. Are there any strings that have different meanings from other strings?