This Community site is new! Please help us build a community around the JNIOR.
Sign up, and help share your knowledge. Please sign-up even if you do not plan to post as a sign of support.
If there is evidence of a demand we will continue to develop the content here.

DEFLATE Compression Algorithm

INTEG's own Real-time Operating System. No 3rd party code here. 100% Supportable.
bscloutier
Posts: 401
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Fri Jan 12, 2018 11:07 am

The game next is to combine pairs of leaves into nodes, And then pairs of leaves and nodes into other nodes. This procedure is repeated until there is only one at the head of the tree. Generally one is directed to combine the lowest two weighted leaves or nodes into a single node of combined weight.

Each time a node is constructed that defines a bit 0/1 for left/right for the one or two members the node contains. By working from the lowest weighted or least frequent leaves and then nodes one ends up building longer codes for that than the higher frequency values which are not touch right away.

The following procedure generates a tree (but not the kind we want just yet) using just such a procedure. The lowest two are combined and the list is resorted. We dump and repeat.

        // generate tree
        while (cnt > 1) {
            
            // take lowest weight nodes and create new
            int left = order[cnt - 2];
            int right = order[cnt - 1];
            Node nd1 = nodes.get(left);
            Node nd2 = nodes.get(right);
            nodes.add(new Node(left, right, nd1.weight + nd2.weight));
            order[cnt - 2] = nodes.size() - 1;
            cnt--;

            // sort
            for (int n = 0; n < cnt - 1; n++) {
                nd1 = nodes.get(order[n]);
                nd2 = nodes.get(order[n + 1]);

                if (nd1.weight < nd2.weight) {
                    int k = order[n];
                    order[n] = order[n + 1];
                    order[n + 1] = k;
                    n -= 2;
                    if (n < -1)
                        n = -1;
                }
            }
            
            //dump
            System.out.println("");
            for (int n = 0; n < cnt; n++) {
                Node nd = nodes.get(order[n]);
                if (nd.left == 0 && nd.right == 0)
                    System.out.printf("%d 0x%03x %d\n", order[n], nd.value, nd.weight);
                else
                    System.out.printf("%d %d-%d %d\n", order[n], nd.left, nd.right, nd.weight);
            }
        }

You can follow the procedure in the output although the resulting tree is not displayed. I might create a way to display the tree but I haven't gone that far yet.

Code: Select all

bruce_dev /> jtest flash/gogo.dat

3 0x067 3
5 0x06f 3
1 0x020 2
2 0x065 1
4 0x068 1
6 0x070 1
7 0x072 1
8 0x073 1

3 0x067 3
5 0x06f 3
1 0x020 2
9 7-8 2
2 0x065 1
4 0x068 1
6 0x070 1

3 0x067 3
5 0x06f 3
1 0x020 2
9 7-8 2
10 4-6 2
2 0x065 1

3 0x067 3
5 0x06f 3
11 10-2 3
1 0x020 2
9 7-8 2

12 1-9 4
3 0x067 3
5 0x06f 3
11 10-2 3

13 5-11 6
12 1-9 4
3 0x067 3

14 12-3 7
13 5-11 6

15 14-13 13

Processing 0.865 seconds.
Source 13 bytes.
Result 13 bytes.
Ratio 1.00:1

bruce_dev />
Here the new nodes display node reference indexes as left-right instead of a value.

bscloutier
Posts: 401
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Fri Jan 12, 2018 11:27 am

Now with a tree built I create a recursive routine to walk the tree and assign leaves a code length and binary code. I then collect these leaves into a node array that would allow me to efficiently translate the raw data. Here that array is dumped. You will see the new code later. Here's the table from the prior procedure.
0x020 ' ' count=2 length=3 code 000
0x065 'e' count=1 length=3 code 111
0x067 'g' count=3 length=2 code 01
0x068 'h' count=1 length=4 code 1100
0x06f 'o' count=3 length=2 code 10
0x070 'p' count=1 length=4 code 1101
0x072 'r' count=1 length=4 code 0010
0x073 's' count=1 length=4 code 0011
if we manually encode the phrase we see that it does in fact compress to just 37 bits.
g  o  ' ' g  o  ' ' g  o  p    h    e   r    s
01 10 000 01 10 000 01 10 1101 1100 111 0010 0011
total 37 bits
But this table does not meet the DEFLATE requirements and cannot fit the shorthand.

By the way here are two other tables from the Duke University page. Both of these are different yet again from our table but each gets the job done. None of these meet the DEFLATE requirement.


2018-01-12_11-22-13.png
2018-01-12_11-22-13.png (37.9 KiB) Viewed 410 times
2018-01-12_11-24-58.png
2018-01-12_11-24-58.png (37.83 KiB) Viewed 410 times

bscloutier
Posts: 401
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Fri Jan 12, 2018 11:45 am

Now that third tree is close. But here is the tree that we need to learn to generate from this data.

gophers.png
gophers.png (18.65 KiB) Viewed 410 times

Why? Because the two additional requirements are met.

First the code lengths (depth of the tree) increase from left to right.

Second, for the same code length (depth) the values increase from left to right (lexicographically).

And with this tree we can apply the required shorthand to properly to fit the DEFLATE format. The details of that shorthand I will get into.

bscloutier
Posts: 401
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Fri Jan 12, 2018 1:19 pm

Let's revise the procedure to consider pairs of nodes from right to left.

Here we first determine the combined weight of the rightmost pair. We will combine that pair and any prior pair whose combined weight is less than or equal to that. We repeat this until we have only one node that being the head of the tree.

        // generate tree
        while (cnt > 1) {
            
            // determine to combined weight of the lowest two nodes
            int left = order[cnt - 2];
            int right = order[cnt - 1];
            Node nd1 = nodes.get(left);
            Node nd2 = nodes.get(right);
            int weight = nd1.weight + nd2.weight;
            
            // Now combine node pairs equal to or less than this weight from
            //  right to left.
            int pos = cnt;
            while (pos >= 2) {
                
                // Get the combined weight of the pair preceeding the pointer. We will
                //  combine the psir if its weight is less than or equal to that of 
                //  the rightmost (least) pair. We stop if not.
                left = order[pos - 2];
                right = order[pos - 1];
                nd1 = nodes.get(left);
                nd2 = nodes.get(right);
                int w = nd1.weight + nd2.weight;
                if (w > weight)
                    break;
                
                // Combine the pair and reduce teh order array.
                nodes.add(new Node(left, right, w));
                order[pos - 2] = nodes.size() - 1;
                for (int n = pos; n < cnt; n++)
                    order[pos - 1] = order[n];
                cnt--;
                
                // onto the the next prior pair
                pos -= 2;
            }
            
            //dump
            System.out.println("");
            for (int n = 0; n < cnt; n++) {
                Node nd = nodes.get(order[n]);
                if (nd.left == 0 && nd.right == 0)
                    System.out.printf("%d 0x%03x %d\n", order[n], nd.value, nd.weight);
                else
                    System.out.printf("%d %d-%d %d\n", order[n], nd.left, nd.right, nd.weight);
            }
        }

Now when this is executed we obtain a different tree. This one actually is the one we seek.
0x020 ' ' count=2 length=3 code 100
0x065 'e' count=1 length=3 code 101
0x067 'g' count=3 length=2 code 00
0x068 'h' count=1 length=4 code 1100
0x06f 'o' count=3 length=2 code 01
0x070 'p' count=1 length=4 code 1101
0x072 'r' count=1 length=4 code 1110
0x073 's' count=1 length=4 code 1111

Code: Select all

bruce_dev /> jtest flash/gogo.dat

3 0x067 3
5 0x06f 3
1 0x020 2
2 0x065 1
4 0x068 1
6 0x070 1
7 0x072 1
8 0x073 1

3 0x067 3
5 0x06f 3
1 0x020 2
2 0x065 1
10 4-6 2
9 7-8 2

3 0x067 3
5 0x06f 3
12 1-2 3
11 10-9 4

14 3-5 6
13 12-11 7

15 14-13 13
0x020 ' ' count=2 length=3 code 100
0x065 'e' count=1 length=3 code 101
0x067 'g' count=3 length=2 code 00
0x068 'h' count=1 length=4 code 1100
0x06f 'o' count=3 length=2 code 01
0x070 'p' count=1 length=4 code 1101
0x072 'r' count=1 length=4 code 1110
0x073 's' count=1 length=4 code 1111

Processing 0.840 seconds.
Source 13 bytes.
Result 13 bytes.
Ratio 1.00:1

bruce_dev /> 
I may not be ready to claim victory here but this appears to be very promising. Perhaps we should return to a more complicated situation.

bscloutier
Posts: 401
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Fri Jan 12, 2018 1:35 pm

Alright so we re-enable the LZ77 compression and remove much of the dump output. When we run this on jniorboot.log we get the following table.

Code: Select all

bruce_dev /> jtest jniorboot.log
0x004 '.' count=1 length=11 code 10011110110
0x00a '.' count=9 length=6 code 110000
0x00b '.' count=13 length=6 code 110100
0x00c '.' count=4 length=8 code 10110011
0x00d '.' count=5 length=6 code 110010
0x00e '.' count=8 length=7 code 1010001
0x00f '.' count=3 length=8 code 10011101
0x010 '.' count=4 length=8 code 10111100
0x011 '.' count=8 length=7 code 1001100
0x012 '.' count=2 length=10 code 1001111000
0x013 '.' count=2 length=10 code 1001111001
0x020 ' ' count=41 length=2 code 00
0x028 '(' count=1 length=11 code 10011110111
0x029 ')' count=1 length=11 code 10110111110
0x02c ',' count=6 length=6 code 110110
0x02d '-' count=5 length=6 code 110011
0x02e '.' count=7 length=8 code 10110001
0x02f '/' count=3 length=9 code 100111110
0x030 '0' count=16 length=5 code 11111
0x031 '1' count=12 length=7 code 1011100
0x032 '2' count=12 length=7 code 1011101
0x033 '3' count=6 length=6 code 110111
0x034 '4' count=10 length=7 code 1001001
0x035 '5' count=7 length=8 code 10110100
0x036 '6' count=6 length=7 code 1001010
0x037 '7' count=4 length=8 code 10111101
0x038 '8' count=9 length=6 code 110001
0x039 '9' count=9 length=6 code 100000
0x03a ':' count=10 length=6 code 111000
0x041 'A' count=4 length=7 code 1000100
0x042 'B' count=2 length=9 code 111010000
0x043 'C' count=3 length=9 code 100111111
0x044 'D' count=1 length=11 code 10110111111
0x045 'E' count=2 length=9 code 111010001
0x046 'F' count=1 length=10 code 1110110110
0x047 'G' count=3 length=8 code 11101110
0x048 'H' count=1 length=10 code 1110110111
0x049 'I' count=3 length=8 code 11101111
0x04a 'J' count=1 length=10 code 1011111110
0x04d 'M' count=1 length=10 code 1011111111
0x04e 'N' count=4 length=7 code 1000101
0x04f 'O' count=3 length=8 code 11101010
0x050 'P' count=4 length=9 code 101101100
0x052 'R' count=3 length=8 code 11101011
0x053 'S' count=4 length=9 code 101101101
0x054 'T' count=3 length=7 code 1000110
0x055 'U' count=1 length=10 code 1110100110
0x057 'W' count=1 length=10 code 1110100111
0x05b '[' count=1 length=10 code 1110110100
0x05d ']' count=1 length=10 code 1110110101
0x061 'a' count=8 length=7 code 1001101
0x062 'b' count=5 length=7 code 1010010
0x063 'c' count=7 length=8 code 10110101
0x064 'd' count=9 length=6 code 100001
0x065 'e' count=29 length=4 code 0110
0x066 'f' count=1 length=10 code 1110100100
0x067 'g' count=2 length=10 code 1011011100
0x068 'h' count=2 length=10 code 1011011101
0x069 'i' count=14 length=6 code 101011
0x06b 'k' count=2 length=9 code 101111100
0x06c 'l' count=10 length=6 code 111001
0x06d 'm' count=6 length=7 code 1001011
0x06e 'n' count=9 length=7 code 1010000
0x06f 'o' count=17 length=4 code 0111
0x070 'p' count=5 length=7 code 1010011
0x072 'r' count=17 length=5 code 11110
0x073 's' count=11 length=7 code 1001000
0x074 't' count=15 length=6 code 101010
0x075 'u' count=8 length=8 code 10110000
0x076 'v' count=4 length=8 code 10011100
0x077 'w' count=2 length=9 code 101111101
0x079 'y' count=5 length=8 code 10110010
0x07a 'z' count=1 length=10 code 1110100101
0x100 '.' count=1 length=10 code 1011111100
0x101 '.' count=30 length=3 code 010

Processing 10.284 seconds.
Source 954 bytes.
Result 680 bytes.
Ratio 1.40:1

bruce_dev />

Well okay. Just about all you can say is that it does appear that the stuff with the higher counts (frequency) does appear to use the shortest code lengths. Another good clue is the fact that the first alphabet entry that uses the smallest code is represented by a sequence of all zeroes.

I suppose now we get into what I have been calling the shorthand storage format for Huffman table. If this table can be so represented and the table reconstructed from that then we are good to go.

bscloutier
Posts: 401
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Fri Jan 12, 2018 2:32 pm

Huffman table "Shorthand"

While "shorthand" is my term and no one else's that I've seen, it still refers to efficiently conveying the table. To start I have defined an entry for 285 possible codes each with a count and a binary code. Even with some cute integer packing this is still a lot of bytes. Having to pass the table with the compressed file painfully can cut into the benefit of the compression.

It turns out that if the Huffman table conforms to the two special rules it can be reconstructed from only knowing the code length for each of the alphabet members. So we don't need to include the actual code.

Once we have the code lengths for each alphabet that gets packed further. It's a bit crazy. The array of 285 code lengths contains a lot of repetition. This is packed using a form and run-length encoding where sequences of the same length are defined by the count (run length). Then that data is again (ugh) run through a Huffman encoding which results in just 19 bit lengths. Those are stored in a weird order which is intended to keep those alphabet members whose bit lengths are likely to be 0 near the end as trailing zeroes need not be included. So the entire Huffman table ends up being conveyed in just a handful of bytes. I guess people were really creative back then.

The procedure for reconstructing the Huffman table from the code lengths first requires a count of codes for each length. Let me add that to our table output. Here is the additional output from the execution.
Length=2 Count 1
Length=3 Count 1
Length=4 Count 2
Length=5 Count 2
Length=6 Count 13
Length=7 Count 15
Length=8 Count 14
Length=9 Count 8
Length=10 Count 15
Length=11 Count 4
From this table we can calculate the first binary code assigned to that code length group. Each alphabet member using that code length is then assigned an incremental binary value from that. I can add that calculation to the this table.
Length=2 Count 1 Start Code 00
Length=3 Count 1 Start Code 010
Length=4 Count 2 Start Code 0110
Length=5 Count 2 Start Code 10000
Length=6 Count 13 Start Code 100100
Length=7 Count 15 Start Code 1100010
Length=8 Count 14 Start Code 11100010
Length=9 Count 8 Start Code 111100000
Length=10 Count 15 Start Code 1111010000
Length=11 Count 4 Start Code 11110111110
Okay so I can see that the Huffman table generated DOES NOT conform. As an exercise you can see for yourself. So I wonder where error might be. Hmm...

Code: Select all

bruce_dev /> jtest jniorboot.log
0x004 '.' count=1 length=11 code 10011110110
0x00a '.' count=9 length=6 code 110000
0x00b '.' count=13 length=6 code 110100
0x00c '.' count=4 length=8 code 10110011
0x00d '.' count=5 length=6 code 110010
0x00e '.' count=8 length=7 code 1010001
0x00f '.' count=3 length=8 code 10011101
0x010 '.' count=4 length=8 code 10111100
0x011 '.' count=8 length=7 code 1001100
0x012 '.' count=2 length=10 code 1001111000
0x013 '.' count=2 length=10 code 1001111001
0x020 ' ' count=41 length=2 code 00
0x028 '(' count=1 length=11 code 10011110111
0x029 ')' count=1 length=11 code 10110111110
0x02c ',' count=6 length=6 code 110110
0x02d '-' count=5 length=6 code 110011
0x02e '.' count=7 length=8 code 10110001
0x02f '/' count=3 length=9 code 100111110
0x030 '0' count=16 length=5 code 11111
0x031 '1' count=12 length=7 code 1011100
0x032 '2' count=12 length=7 code 1011101
0x033 '3' count=6 length=6 code 110111
0x034 '4' count=10 length=7 code 1001001
0x035 '5' count=7 length=8 code 10110100
0x036 '6' count=6 length=7 code 1001010
0x037 '7' count=4 length=8 code 10111101
0x038 '8' count=9 length=6 code 110001
0x039 '9' count=9 length=6 code 100000
0x03a ':' count=10 length=6 code 111000
0x041 'A' count=4 length=7 code 1000100
0x042 'B' count=2 length=9 code 111010000
0x043 'C' count=3 length=9 code 100111111
0x044 'D' count=1 length=11 code 10110111111
0x045 'E' count=2 length=9 code 111010001
0x046 'F' count=1 length=10 code 1110110110
0x047 'G' count=3 length=8 code 11101110
0x048 'H' count=1 length=10 code 1110110111
0x049 'I' count=3 length=8 code 11101111
0x04a 'J' count=1 length=10 code 1011111110
0x04d 'M' count=1 length=10 code 1011111111
0x04e 'N' count=4 length=7 code 1000101
0x04f 'O' count=3 length=8 code 11101010
0x050 'P' count=4 length=9 code 101101100
0x052 'R' count=3 length=8 code 11101011
0x053 'S' count=4 length=9 code 101101101
0x054 'T' count=3 length=7 code 1000110
0x055 'U' count=1 length=10 code 1110100110
0x057 'W' count=1 length=10 code 1110100111
0x05b '[' count=1 length=10 code 1110110100
0x05d ']' count=1 length=10 code 1110110101
0x061 'a' count=8 length=7 code 1001101
0x062 'b' count=5 length=7 code 1010010
0x063 'c' count=7 length=8 code 10110101
0x064 'd' count=9 length=6 code 100001
0x065 'e' count=29 length=4 code 0110
0x066 'f' count=1 length=10 code 1110100100
0x067 'g' count=2 length=10 code 1011011100
0x068 'h' count=2 length=10 code 1011011101
0x069 'i' count=14 length=6 code 101011
0x06b 'k' count=2 length=9 code 101111100
0x06c 'l' count=10 length=6 code 111001
0x06d 'm' count=6 length=7 code 1001011
0x06e 'n' count=9 length=7 code 1010000
0x06f 'o' count=17 length=4 code 0111
0x070 'p' count=5 length=7 code 1010011
0x072 'r' count=17 length=5 code 11110
0x073 's' count=11 length=7 code 1001000
0x074 't' count=15 length=6 code 101010
0x075 'u' count=8 length=8 code 10110000
0x076 'v' count=4 length=8 code 10011100
0x077 'w' count=2 length=9 code 101111101
0x079 'y' count=5 length=8 code 10110010
0x07a 'z' count=1 length=10 code 1110100101
0x100 '.' count=1 length=10 code 1011111100
0x101 '.' count=30 length=3 code 010

Length=2 Count 1 Start Code 00
Length=3 Count 1 Start Code 010
Length=4 Count 2 Start Code 0110
Length=5 Count 2 Start Code 10000
Length=6 Count 13 Start Code 100100
Length=7 Count 15 Start Code 1100010
Length=8 Count 14 Start Code 11100010
Length=9 Count 8 Start Code 111100000
Length=10 Count 15 Start Code 1111010000
Length=11 Count 4 Start Code 11110111110

Processing 10.395 seconds.
Source 954 bytes.
Result 680 bytes.
Ratio 1.40:1

bruce_dev /> 

bscloutier
Posts: 401
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Sat Jan 13, 2018 12:52 pm

Yeah I had at least one glitch but trying to generate the precise tree appropriate for the DEFLATE "shorthand" still eludes me. The search engines these days are much less effective for locating useful technical information than for finding ways to separate me from my money. It seems easier to reinvent the wheel and devise my own algorithm even though I know that a simple procedure is likely documented in numerous pages on the net.

It strikes me that all we need to do is determine the optimum bit length for each of the used alphabet members. It is almost irrelevant as to where in a tree a particular member ends up. Once we have the proper bit length a tree meeting the DEFLATE requirements can be directly created.

Perhaps the simple procedure for generating a valid Huffman tree ignoring the DEFLATE requirements can be employed and without actually building a tree structure. Note that when two leaves are combined you are simply assigning another bit to them regardless of which gets '0' and which gets '1'. The bit length is incremented for the two leaves as it is combined into a node. In fact when you combine two nodes you need only increment the bit length (depth) for all of the members below it. So in creating a node I need only keep track of all of the leaves below it. A simple linked list suffices.

Such an implementation need not even retain intermediate nodes. You just need to maintain the node membership list. You need that so you can advance the bit count for all of the leaves below as the node is combined.

Maybe you follow me to this point or maybe not. I'll go ahead an try an implementation.

bscloutier
Posts: 401
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Sat Jan 13, 2018 5:44 pm

Okay this new approach is golden! And its fast! Oh and I don't need to build any damn tree!

    static void bufflush(BufferedOutputStream outfile) throws Throwable {
        
        // Determine frequecies by counting each occurrence of a byte value. 
        //  Here we force the end-of-block code that we know we will use.
        int[] sym_cnt = new int[286];
//        sym_cnt[0x100] = 1;
        
        for (int n = 0; n < BUFPTR; n++) {
            
            // Get the byte value. This may be escaped with 0xff.
            int ch = BUFR[n] & 0xff;
            
            // not escaped
            if (ch != 0xff)
                sym_cnt[ch]++;
            
            // escaped
            else {
                
                // Get next byte.
                ch = BUFR[++n] & 0xff;
                
                // may just be 0xff itself
                if (ch == 0xff)
                    sym_cnt[0xff]++;
                
                // length-distance pair
                else {
                    
                    // obtain balance of length and distance values
                    int len = (ch << 8) + (BUFR[++n] & 0xff);
                    int dist = ((BUFR[++n] & 0xff) << 8) + (BUFR[++n] & 0xff);
                    
                    // determine length code to use (257..285)
                    for (int k = 0; k < blen.length; k++) {
                        if (len < blen[k]) {
                            sym_cnt[257 + k]++;
                            break;
                        }
                    }
                    
                    // determine distance code to use (0..29)
                    for (int k = 0; k < bdist.length; k++) {
                        if (dist < bdist[k]) {
                            sym_cnt[k]++;
                            break;
                        }
                    }
                    
                }
            }
        }
        
        // Create node list containing symbols in our alphabet that are found in the
        //  data. This will be sorted and used to assign bit lengths. Note list pointers
        //  are stored +1 to reserve 0 as a list terminator.
        int[] nodes = new int[286];
        int[] cnts = new int[286];
        int nodecnt = 0;
        for (int n = 0; n < 286; n++) {
            if (sym_cnt[n] > 0) {
                nodes[nodecnt] = n + 1;
                cnts[nodecnt] = sym_cnt[n];
                nodecnt++;                
            }
        }
        
        // Determine optimal bit lengths. Here we initialize a bit length array and a
        //  node membership list pointer array. These will be used as we generate
        //  the detail required for Huffman coding.
        int[] sym_len = new int[286];
        int[] sym_ptr = new int[286];
        
        // Perform Huffman optimization. This loops until we've folded all the leaves
        //  into a single head node.
        while (nodecnt > 1) {
            
            // The leaves are sorted by decreasing frequency (counts).
            for (int n = 0; n < nodecnt - 1; n++) {
                if (cnts[n] < cnts[n + 1]) {
                    int k = nodes[n];
                    nodes[n] = nodes[n + 1];
                    nodes[n + 1] = k;
                    k = cnts[n];
                    cnts[n] = cnts[n + 1];
                    cnts[n + 1] = k;
                    if (n > 0)
                        n -= 2;
                }
            }

            // The last two leaves/nodes have the lowest frequencies and are to 
            //  be combined. Here we increment the bit lengths for each and 
            //  merge leaves into a single list of node members.
            int ptr = nodes[nodecnt - 2];
            int add_ptr = nodes[nodecnt - 1];
            while (ptr > 0) {
                sym_len[ptr - 1]++;
                int p = sym_ptr[ptr - 1];
                if (p == 0 && add_ptr > 0) {
                    sym_ptr[ptr - 1] = add_ptr;
                    p = add_ptr;
                    add_ptr = 0;
                }
                ptr = p;
            }
            
            // Combine the last two nodes by adding their frequencies and dropping 
            //  the last.
            cnts[nodecnt - 2] += cnts[nodecnt - 1];
            nodecnt--;
            
        }
        
        // dump nonzero bit lengths
        for (int n = 0; n < 286; n++) {
            if (sym_len[n] > 0)
                System.out.printf("0x%03x '%c' count=%d optimal bits %d\n", n,
                        n >= 0x20 && n < 0x7f ? n : '.', sym_cnt[n], sym_len[n]);
        }
        
        outfile.write(BUFR, 0, BUFPTR);
        BUFPTR = 0;        
    }

Running this on the "go go gophers " test string again with LZ77 disabled yields the desired results.
bruce_dev /> jtest flash/gogo.dat
0x020 ' ' count=2 optimal bits 3
0x065 'e' count=1 optimal bits 3
0x067 'g' count=3 optimal bits 2
0x068 'h' count=1 optimal bits 4
0x06f 'o' count=3 optimal bits 2
0x070 'p' count=1 optimal bits 4
0x072 'r' count=1 optimal bits 4
0x073 's' count=1 optimal bits 4

Processing 0.540 seconds.
Source 13 bytes.
Result 13 bytes.
Ratio 1.00:1

bruce_dev />

Now I can use this to generate the DEFLATE compatible Huffman table.

If you multiply the frequency (count) times the bit length and add them for this example you get 37 bits which we know is the optimal for this example.

The code with comments above should be reasonably understandable. If there are any questions I can describe what is going on. But basically the routine that counts occurrences of each alphabet symbol is as it was before. Unfortunately that is complicated a bit as I have to process the length-distance pointers to determine the encoding for the tally.

Next we create a list of leaves so we can combine the two with the lowest frequency of occurrence. Here's where we simply increment the bit lengths for node members. The leaf list distills down as it would for the traditional Huffman process.

Next I can reinstate the LZ77 compressor and run this on real data. Then we can then take the process further.

bscloutier
Posts: 401
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Sun Jan 14, 2018 2:44 pm

We can now use the procedure detailed in the DEFLATE specification to convert assigned bit lengths into binary codes for compression. We don't need to build a tree.

First we tally the number of symbols in our alphabet the use each bit length.

        // count the occurrence of each bit length
        int[] bits = new int[19];
        for (int n = 0; n < 286; n++) {
            if (sym_len[n] > 0)
                bits[sym_len[n]]++;
        }

With this we can calculate the starting binary code for each bit length. Basically for each bit length we reserve N codes and use the next as a prefix for subsequent bit lengths.

        // determine starting bit code for each bit length
        int[] start = new int[19];
        int c = 0;
        for (int n = 0; n < 19; n++) {
            start[n] = c;
            c = (c + bits[n]) << 1;
        }

This gives us the correct first codes as we see here.
bruce_dev /> jtest flash/gogo.dat
bit length 2 count 2 first code 00
bit length 3 count 2 first code 100
bit length 4 count 4 first code 1100
Now we use these starting codes in assigning the binary codes to each symbol.

        // assign codes to used alphabet symbols
        int[] code = new int[286];
        for (int n = 0; n < 286; n++) {
            if (sym_len[n] > 0) 
                code[n] = start[sym_len[n]]++;
        }

This results are displayed by the attached program.

Code: Select all

bruce_dev /> jtest flash/gogo.dat
bit length 2 count 2 first code 00
bit length 3 count 2 first code 100
bit length 4 count 4 first code 1100

0x020 ' ' count=2 optimal bits 3 100
0x065 'e' count=1 optimal bits 3 101
0x067 'g' count=3 optimal bits 2 00
0x068 'h' count=1 optimal bits 4 1100
0x06f 'o' count=3 optimal bits 2 01
0x070 'p' count=1 optimal bits 4 1101
0x072 'r' count=1 optimal bits 4 1110
0x073 's' count=1 optimal bits 4 1111

Processing 0.654 seconds.
Source 13 bytes.
Result 13 bytes.
Ratio 1.00:1

bruce_dev />
This gives us all of the codes that we need to compress our data. In this case it is for the example string "go go gophers". Happily we did not need to build any tree structure. And, this Huffman coding is compatible with the DEFLATE specification. We can move forward with the shorthand.
Attachments
JTest.java
(16.52 KiB) Downloaded 19 times

bscloutier
Posts: 401
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Sun Jan 14, 2018 2:58 pm

Curious? Here's what I get using jniorboot.log. The content of that file has changed by the way as I have rebooted the JNIOR during the course of this topic. Here the LZ77 compression has also been re-enabled. the program however does not yet generate the bit stream compressed with these Huffman codes.

Code: Select all

bruce_dev /> jtest jniorboot.log
bit length 4 count 3 first code 0000
bit length 5 count 8 first code 00110
bit length 6 count 20 first code 011100
bit length 7 count 22 first code 1100000
bit length 8 count 11 first code 11101100
bit length 9 count 18 first code 111101110

0x004 '.' count=1 optimal bits 9 111101110
0x00a '.' count=8 optimal bits 6 011100
0x00b '.' count=12 optimal bits 5 00110
0x00c '.' count=5 optimal bits 7 1100000
0x00d '.' count=5 optimal bits 7 1100001
0x00e '.' count=6 optimal bits 6 011101
0x00f '.' count=3 optimal bits 8 11101100
0x010 '.' count=5 optimal bits 7 1100010
0x011 '.' count=7 optimal bits 6 011110
0x012 '.' count=4 optimal bits 7 1100011
0x013 '.' count=4 optimal bits 7 1100100
0x020 ' ' count=41 optimal bits 4 0000
0x028 '(' count=1 optimal bits 9 111101111
0x029 ')' count=1 optimal bits 9 111110000
0x02c ',' count=6 optimal bits 6 011111
0x02d '-' count=5 optimal bits 7 1100101
0x02e '.' count=7 optimal bits 6 100000
0x02f '/' count=3 optimal bits 8 11101101
0x030 '0' count=18 optimal bits 5 00111
0x031 '1' count=15 optimal bits 5 01000
0x032 '2' count=9 optimal bits 6 100001
0x033 '3' count=8 optimal bits 6 100010
0x034 '4' count=9 optimal bits 6 100011
0x035 '5' count=5 optimal bits 7 1100110
0x036 '6' count=4 optimal bits 7 1100111
0x037 '7' count=5 optimal bits 7 1101000
0x038 '8' count=7 optimal bits 6 100100
0x039 '9' count=8 optimal bits 6 100101
0x03a ':' count=9 optimal bits 6 100110
0x041 'A' count=4 optimal bits 7 1101001
0x042 'B' count=2 optimal bits 8 11101110
0x043 'C' count=2 optimal bits 8 11101111
0x044 'D' count=1 optimal bits 9 111110001
0x045 'E' count=2 optimal bits 8 11110000
0x046 'F' count=1 optimal bits 9 111110010
0x047 'G' count=3 optimal bits 7 1101010
0x048 'H' count=1 optimal bits 9 111110011
0x049 'I' count=2 optimal bits 8 11110001
0x04a 'J' count=1 optimal bits 9 111110100
0x04d 'M' count=1 optimal bits 9 111110101
0x04e 'N' count=4 optimal bits 7 1101011
0x04f 'O' count=3 optimal bits 7 1101100
0x050 'P' count=5 optimal bits 7 1101101
0x052 'R' count=3 optimal bits 7 1101110
0x053 'S' count=4 optimal bits 7 1101111
0x054 'T' count=3 optimal bits 7 1110000
0x055 'U' count=1 optimal bits 9 111110110
0x057 'W' count=1 optimal bits 9 111110111
0x05b '[' count=1 optimal bits 9 111111000
0x05d ']' count=1 optimal bits 9 111111001
0x061 'a' count=8 optimal bits 6 100111
0x062 'b' count=5 optimal bits 7 1110001
0x063 'c' count=7 optimal bits 6 101000
0x064 'd' count=9 optimal bits 6 101001
0x065 'e' count=29 optimal bits 4 0001
0x066 'f' count=1 optimal bits 9 111111010
0x067 'g' count=2 optimal bits 8 11110010
0x068 'h' count=2 optimal bits 8 11110011
0x069 'i' count=14 optimal bits 5 01001
0x06b 'k' count=2 optimal bits 8 11110100
0x06c 'l' count=10 optimal bits 6 101010
0x06d 'm' count=6 optimal bits 6 101011
0x06e 'n' count=9 optimal bits 6 101100
0x06f 'o' count=17 optimal bits 5 01010
0x070 'p' count=5 optimal bits 7 1110010
0x072 'r' count=17 optimal bits 5 01011
0x073 's' count=11 optimal bits 6 101101
0x074 't' count=15 optimal bits 5 01100
0x075 'u' count=8 optimal bits 6 101110
0x076 'v' count=4 optimal bits 7 1110011
0x077 'w' count=2 optimal bits 8 11110101
0x079 'y' count=5 optimal bits 7 1110100
0x07a 'z' count=1 optimal bits 9 111111011
0x100 '.' count=1 optimal bits 9 111111100
0x101 '.' count=28 optimal bits 4 0010
0x102 '.' count=6 optimal bits 6 101111
0x103 '.' count=2 optimal bits 8 11110110
0x105 '.' count=1 optimal bits 9 111111101
0x10d '.' count=13 optimal bits 5 01101
0x10e '.' count=4 optimal bits 7 1110101
0x10f '.' count=1 optimal bits 9 111111110
0x110 '.' count=1 optimal bits 9 111111111

Processing 9.181 seconds.
Source 954 bytes.
Result 680 bytes.
Ratio 1.40:1

bruce_dev /> 

Post Reply