short string compression

panic mode

Joined Oct 10, 2011
2,752
yea... that link is wasting 50% of data stream

the
Hi team

Just wondering if a way to compress short string, something like these:


I am running out of bandwidth... some data are missing... and I don't have control over buffer size / speed of the communication channel.

I have tried this: https://github.com/antirez/smaz, and my string actually enlarge by 7%
there is no magic, it is all about using patterns and reducing complexity of data using lookup table.


btw. did you check readme for this?
did you see the note about compression rates?
it is bad for numbers and your example "064904EC94F2A1CEFFFFAE0E01CE" is mostly numbers.
the best compression rate is for lowercase letters.

smaz.c shows the codebooks used but changing them to optimize would be a bit of effort.

so.... did you consider shifting all nibble values in your string from hexadecimal range '0'-'F' into range of lowercase letters such as 'a'-'p' before compression...? this should maximize what smaz can do for you and maybe instead of +7% change you can get closer to those -46% or so. if that works on other side just reverse process.
 

WBahn

Joined Mar 31, 2012
30,062
yea... that link is wasting 50% of data stream

the


there is no magic, it is all about using patterns and reducing complexity of data using lookup table.


btw. did you check readme for this?
did you see the note about compression rates?
it is bad for numbers and your example "064904EC94F2A1CEFFFFAE0E01CE" is mostly numbers.
the best compression rate is for lowercase letters.

smaz.c shows the codebooks used but changing them to optimize would be a bit of effort.

so.... did you consider shifting all nibble values in your string from hexadecimal range '0'-'F' into range of lowercase letters such as 'a'-'p' before compression...? this should maximize what smaz can do for you and maybe instead of +7% change you can get closer to those -46% or so. if that works on other side just reverse process.
And what if he does this? He still has to convert it to a string of hexadecimal characters before sending it to the communications module for transmission. I suspect the data is just too short to get meaningful compression to overcome the overhead.

The reason that smaz doesn't work well for numbers is because the codebook that it uses is intended for text messages. So it contains entries for common string fragments, including many HTML tags, that are commonly seen in text message transmission and replaces the fragment with a one-byte index. For anything that isn't in the codebook, it has to encode it as a verbatim string which takes either one or two bytes of overhead. If the entire string contained nothing usable from the codebook, then then it would add two bytes to the unaltered data stream (255 followed by the length of the verbatim string, which would be 28 in these case) That would then add 4 characters to the final output string, thus growing it by about 14%.

If you convert things to lower-case letters (easy to to) you will still only get any compression if the resulting string of lower case letters just happens to have fragments that are in the 254-entry codebook, which means they spell a common word or string fragment used in text message data. This is going to be pretty unlikely.

You could come up with your own codebook, but if the values in most of the fields can span the full 8-bit value space and are largely uncorrelated, then this will be pointless. You could gain compression on the fields the TS indicated they could combine, but you have limited codebook space and you have to have the code preample of either one or two bytes.

I still think his best shot is looking at the underlying encoding of the information. It appears that he thinks he can reduce his 14 bytes into 11 bytes by combining some of the fields using a lookup table. If so, that will achieve a 21% compression and he said he needed to get about 10% compression to meet his budget. So that should take care of the problem. Because the string is highly structured, there's no need for any compression preamble at all.

If he COULD tweak the transmission setup, he could configure it for 7 data bits instead of 8 and be done with it.
 

WBahn

Joined Mar 31, 2012
30,062
Oh I think I see what you mean now. If I understand you correctly, I can combine a few fields and reduce 2 - 3 bytes. Is that what you mean?

Here is the information:
  1. I think I can combine: SUB, CMD, TYPE and SN into two bytes.
  2. No control over SN1 - 4.
  3. possible combine ID.L ID.H and V.L V.H into 3 bytes
  4. possible no control over STAT and RSSI
So I can reduce the string by 3 bytes. Can I do better?
View attachment 221392
We still need more information in order to tell if you can to better.

Again, for the SUB, CMD, TYPE, and SN fields, how many possible values are there for each. Are there any combinations that are not allowed?

The same for the ID.x and V.x fields. How many different combinations and are there any that are not allowed?

When you say no control over SN1 - 4, what does that mean? Clearly you had the ability to manipulate them otherwise you couldn't have used smaz on them in the first place.

Let's say that each of the four could be any value from 0 to 63 (i.e., a six bit value). By simply combining them you end up with 3 bytes instead of 4. So what are the range of values and are their any combinations that are not allowed?
 

Analog Ground

Joined Apr 24, 2019
460
If SUB and CMD are fixed values as shown in your table, can you simply throw them away? If they must remain, then it seems they cannot be combined with anything.
 
Last edited:

panic mode

Joined Oct 10, 2011
2,752
do all parts of the message need to be transmitted at same intervals or some parts may only change at different rate?
and one channel can support 100byte/sec
but how many channels are available?
 

WBahn

Joined Mar 31, 2012
30,062
If SUB and CMD are fixed values as shown in your table, can you simply throw them away? If they must remain, then it seems they cannot be combined with anything.
Far from it. Let's say that there are four distinct values of SUB and four distinct values of CMD and they can appear in any combination. That's 16 combinations which can then be encoded into a singe hexadecimal character instead of the four currently being used, thus saving the needed three bytes from the transmission stream right there.
 

Analog Ground

Joined Apr 24, 2019
460
Far from it. Let's say that there are four distinct values of SUB and four distinct values of CMD and they can appear in any combination. That's 16 combinations which can then be encoded into a singe hexadecimal character instead of the four currently being used, thus saving the needed three bytes from the transmission stream right there.
The Table in Post #10 has SUB = 0x06 and CMD = 0x49 and not "XX". This could imply these two entries are fixed in value. If this is the case, they would not be combined with other entities as variables. So, if they are fixed, can they simply be removed since the values are known? If they are needed for identification of a subframe, etc., then they cannot be changed in any way.

I guess my questions is "Are these fields fixed in value?".


This is the kind of question which might give room for the additional 10% needed.
 

WBahn

Joined Mar 31, 2012
30,062
The Table in Post #10 has SUB = 0x06 and CMD = 0x49 and not "XX". This could imply these two entries are fixed in value. If this is the case, they would not be combined with other entities as variables. So, if they are fixed, can they simply be removed since the values are known? If they are needed for identification of a subframe, etc., then they cannot be changed in any way.

I guess my questions is "Are these fields fixed in value?".


This is the kind of question which might give room for the additional 10% needed.
I agree that, if they are truly fixed, they can be eliminated. However, the TS talks about them like they are not static (he talks about being able to combine those two with two others) and are merely an example. I have asked repeatedly for further information about those fields and for some reason the TS won't supply it.

My response was directed at your claim that, " If they must remain, then it seems they cannot be combined with anything."
 

Thread Starter

bug13

Joined Feb 13, 2012
2,002
so.... did you consider shifting all nibble values in your string from hexadecimal range '0'-'F' into range of lowercase letters such as 'a'-'p' before compression...? this should maximize what smaz can do for you and maybe instead of +7% change you can get closer to those -46% or so. if that works on other side just reverse process.
My communicate channel only accept '0' - 'F' in char, EOF and SOF. I did try using the lower case, it actually increase more than 7%, don't remember the exact number now.

When you say no control over SN1 - 4, what does that mean? Clearly you had the ability to manipulate them otherwise you couldn't have used smaz on them in the first place.
SN1 - 4 are 8 bit numbers ( 0 - 255), using smaz is something I first try but didn't work.

If SUB and CMD are fixed values as shown in your table, can you simply throw them away? If they must remain, then it seems they cannot be combined with anything.
There are other packets that I will need to send on the same channel, and also other devices. I have no control over other devices. The SUB and CMD are for other devices to decode their packets. The fixed values there in the table are assign to me, so I have to use them. However, I can choose other free code that are not use yet.

do all parts of the message need to be transmitted at same intervals or some parts may only change at different rate?
and one channel can support 100byte/sec
but how many channels are available?
There is only one channel, there are two byte of information that I don't need to send all the time, so I may take them off.
 

MrChips

Joined Oct 2, 2009
30,813
Forget compression. There is not enough redundancy in such a short message.
Let's backup for a moment.

064904EC94F2A1CEFFFFAE0E01CE

What are you actually sending?

06 are two characters or one byte?
 

Thread Starter

bug13

Joined Feb 13, 2012
2,002
I guess my questions is "Are these fields fixed in value?".
Just responded in my last reply.

I have asked repeatedly for further information about those fields and for some reason the TS won't supply it.
Sorry WBahn, I got a few ideas from yesterday's (local time here) replies and busy trying a few different things here. The information you asked, some are easy to answer, like the SN1 - 4, some are not so. eg the SUB and CMD, those are assigned to me, but I do have some flexibility to change them, and to change them, I need to look into other part of the system. So I don't have an answer for those yet.
 

WBahn

Joined Mar 31, 2012
30,062
What do you mean that they are assigned to you? Are others monitoring the channel and need to know which strings are meant for them and which are not? If so, then how will they know which are which if your messages are manipulated so as to compress them?
 

Thread Starter

bug13

Joined Feb 13, 2012
2,002
Forget compression. There is not enough redundancy in such a short message.
Let's backup for a moment.

064904EC94F2A1CEFFFFAE0E01CE

What are you actually sending?

06 are two characters or one byte?
These are all ASCII coded hex, eg '0' - 'F'.
 

Thread Starter

bug13

Joined Feb 13, 2012
2,002
What do you mean that they are assigned to you? Are others monitoring the channel and need to know which strings are meant for them and which are not? If so, then how will they know which are which if your messages are manipulated so as to compress them?
The first byte are sub group of the device, there are other devices on the same channel as well. I forgot that myself when I first try using smaz lib.

So the SUB needs to stay.

I think I will combine 2 - 3 raw bytes by lookup table, take away one or two raw byte that doesn't need to send that often. it should solve my problem.
 

djsfantasi

Joined Apr 11, 2010
9,163
The communication channel only accept '0' - 'F' in char, SOF and EOF.
‘0’ to ‘9’ in char is 0x30 to 0x39 in hex. Those are the actual values stored in a byte. ‘A’ through ‘F’ in char is 0x40 to 0x46. I didn’t bother looking up SOF and EOF, but hopefully you get the idea.
 
Last edited:

Thread Starter

bug13

Joined Feb 13, 2012
2,002
‘0’ to ‘9’ in char is 0x30 to 0x39 in hex. That is the actual value stored in a byte. ‘A’ through ‘F’ in char is 0x40 to 0x46. I didn’t bother looking up SOF and EOF, but hopefully you get the idea.
Yes, I got the idea :)
What I am saying is, to send 0x09, I need to send it in two bytes [0x30, 0x39]. If I send one byte [0x09], nothing will come out from the other end of the communication link. And I have to use this communication link.
 

djsfantasi

Joined Apr 11, 2010
9,163
Yes, I got the idea :)
What I am saying is, to send 0x09, I need to send it in two bytes [0x30, 0x39]. If I send one byte [0x09], nothing will come out from the other end of the communication link. And I have to use this communication link.
Ok! I guess you have got it. I was just clarifying so we all were on the same page. Hope you weren’t insulted.
 
Top