This Community site is new! Please help us build a community around the JNIOR.
Sign up, and help share your knowledge. Please sign-up even if you do not plan to post as a sign of support.
If there is evidence of a demand we will continue to develop the content here.

DEFLATE Compression Algorithm

INTEG's own Real-time Operating System. No 3rd party code here. 100% Supportable.
bscloutier
Posts: 390
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Thu Jan 04, 2018 4:24 pm

Sticking in a couple of placeholder bytes for the length and distance codes this is representative of the pre-Huffman coding compression ratio of this file.
bruce_dev /> jtest jniorboot.log
Processing 79.124 seconds.
Source 998 bytes.
Result 533 bytes.
Ratio 46.59%

bruce_dev />
Oh it'll be fast in C and in the JANOS kernel. Okay... Huffman.

bscloutier
Posts: 390
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Fri Jan 05, 2018 11:21 am

So before getting too deep into generating the Huffman coding with dynamic tables I figured that it would make sense to write a quick decompressor for my interim LZ77 compression as a check. I modified the compressor to output length and distance codes as shorts using a 0xff prefix which I then escaped. This stream I will later be able to digest in performing the Huffman coding. The decompressor would take outfile.dat and generate the decompressed newfile.dat.

Well after compressing and then decompressing the content of newfile.dat resembled jniorboot.log very closely but there were a few variances. First, I found the glitch in the step that eliminates any overlap in matched sequences (shouldn't have modified distance when also modifying the CURPTR). Then I had to address the boundary conditions at the end of the file in order to properly process the entire file (I ended up a couple of bytes short initially). With that we achieved success.

You can see here how we can use MANIFEST to verify file size and content. Note that the MD5 are identical.
bruce_dev /> jtest jniorboot.log
Processing 69.323 seconds.
Source 950 bytes.
Result 671 bytes.
Ratio 1.42:1

bruce_dev /> jtest2

bruce_dev /> manifest jniorboot.log
JNIOR Manifest      Fri Jan 05 11:02:57 EST 2018
  Size                  MD5                  File Specification
 950      dc425a0283e22944b463eeab9e625adb  [Modified] /jniorboot.log
End of Manifest (1 files listed)

bruce_dev /> manifest newfile.dat  
JNIOR Manifest      Fri Jan 05 11:02:59 EST 2018
  Size                  MD5                  File Specification
 950      dc425a0283e22944b463eeab9e625adb  [New] /newfile.dat
End of Manifest (1 files listed)

bruce_dev />
Here is the original content and the resulting compressed format that I have used in bread-boarding this.

Code: Select all

bruce_dev /> cat jniorboot.log
01/05/18 07:39:52.913, -- Model 410 v1.6.3 - JANOS Series 4
01/05/18 07:39:52.960, Copyright (c) 2012-2018 INTEG Process Group, Inc., Gibsonia PA USA
01/05/18 07:39:52.980, JANOS written and developed by Bruce Cloutier
01/05/18 07:39:52.999, Serial Number: 614070500
01/05/18 07:39:53.018, File System mounted
01/05/18 07:39:53.039, Registry mounted
01/05/18 07:39:53.089, Network Initialized
01/05/18 07:39:53.109, Ethernet Address: 9c:8d:1a:00:07:ee
01/05/18 07:39:53.229, Sensor Port initialized
01/05/18 07:39:53.284, I/O services initialized
01/05/18 07:39:53.327, FTP server enabled for port 21
01/05/18 07:39:53.347, Protocol server enabled for port 9200
01/05/18 07:39:53.368, WebServer enabled for port 80
01/05/18 07:39:53.390, Telnet server enabled for port 23
01/05/18 07:39:53.414, POR: 5927
01/05/18 07:39:53.439, Cumulative Runtime: 8 Weeks 5 Days 9 Hours 32:22.102
01/05/18 07:39:53.460, Boot Completed [2.3 seconds]

bruce_dev />

Code: Select all

bruce_dev /> cat outfile.dat -h
00000000  30 31 2f 30 35 2f 31 38  20 30 37 3a 33 39 3a 35  01/05/18 .07:39:5
00000010  32 2e 39 31 33 2c 20 2d  2d 20 4d 6f 64 65 6c 20  2.913,.- -.Model.
00000020  34 31 30 20 76 31 2e 36  2e 33 20 2d 20 4a 41 4e  410.v1.6 .3.-.JAN
00000030  4f 53 20 53 65 72 69 65  73 20 34 0d 0a ff 00 13  OS.Serie s.4.....
00000040  00 3d 36 30 2c 20 43 6f  70 79 72 69 67 68 74 20  .=60,.Co pyright.
00000050  28 63 29 20 32 30 31 32  2d ff 00 03 00 05 38 20  (c).2012 -.....8.
00000060  49 4e 54 45 47 20 50 72  6f 63 65 73 73 20 47 72  INTEG.Pr ocess.Gr
00000070  6f 75 70 2c 20 49 6e 63  2e 2c 20 47 69 62 73 6f  oup,.Inc .,.Gibso
00000080  6e 69 61 20 50 41 20 55  53 41 ff 00 15 00 5b 38  nia.PA.U SA....[8
00000090  ff 00 03 00 5b ff 00 06  00 82 77 72 69 74 74 65  ....[... ..writte
000000A0  6e 20 61 6e 64 20 64 65  76 65 6c 6f 70 65 64 20  n.and.de veloped.
000000B0  62 79 20 42 72 75 63 65  20 43 6c 6f 75 74 69 65  by.Bruce .Cloutie
000000C0  72 ff 00 15 00 46 39 39  2c ff 00 05 00 c2 61 6c  r....F99 ,.....al
000000D0  20 4e 75 6d 62 65 72 3a  20 36 31 34 30 37 30 35  .Number: .6140705
000000E0  30 30 ff 00 12 00 31 33  2e ff 00 03 00 b9 2c 20  00....13 ......,.
000000F0  46 69 6c 65 20 53 79 73  74 65 6d 20 6d 6f 75 6e  File.Sys tem.moun
00000100  74 65 64 ff 00 15 00 2c  33 ff 00 03 00 5d 52 65  ted...., 3....]Re
00000110  67 69 73 74 72 79 ff 00  1d 00 29 38 ff 00 03 00  gistry.. ..)8....
00000120  29 4e 65 74 77 6f 72 6b  ff 00 03 01 02 69 74 ff  )Network .....it.
00000130  00 03 00 8f 69 7a ff 00  16 00 2c 31 30 ff 00 03  ....iz.. ..,10...
00000140  00 2c 45 74 68 65 72 6e  65 74 20 41 64 64 72 ff  .,Ethern et.Addr.
00000150  00 03 01 3e 3a 20 39 63  3a 38 64 3a 31 61 3a 30  ...>:.9c :8d:1a:0
00000160  30 3a ff 00 03 00 2c 65  65 ff 00 14 00 3c 32 32  0:....,e e....<22
00000170  ff 00 05 00 ee 6e 73 6f  72 20 50 6f 72 74 20 69  .....nso r.Port.i
00000180  ff 00 1e 00 6c 32 38 34  ff 00 03 01 92 2f 4f 20  ....l284 ...../O.
00000190  73 65 72 76 69 ff 00 03  01 a7 ff 00 20 00 31 33  servi... ......13
000001A0  32 37 ff 00 03 01 1e 54  50 ff 00 05 00 31 65 72  27.....T P....1er
000001B0  20 65 6e 61 62 6c ff 00  03 01 8c 66 ff 00 03 00  .enabl.. ...f....
000001C0  71 70 ff 00 04 00 71 32  31 ff 00 15 00 37 34 ff  qp....q2 1....74.
000001D0  00 03 00 37 ff 00 03 02  09 74 6f 63 6f 6c ff 00  ...7.... .tocol..
000001E0  19 00 3c 39 32 ff 00 16  01 93 33 36 ff 00 03 01  ..<92... ..36....
000001F0  93 57 65 62 ff 00 03 01  c7 ff 00 15 00 38 38 ff  .Web.... .....88.
00000200  00 16 00 36 39 ff 00 03  02 40 54 65 6c ff 00 04  ...69... .@Tel...
00000210  01 46 ff 00 19 00 ae 33  ff 00 14 00 3a 34 31 ff  .F.....3 ....:41.
00000220  00 03 01 16 50 4f 52 3a  20 35 39 32 37 ff 00 15  ....POR: .5927...
00000230  00 22 ff 00 04 01 f9 43  75 6d 75 6c 61 74 69 76  .".....C umulativ
00000240  65 20 52 ff 00 03 01 fa  69 6d 65 3a 20 38 ff 00  e.R..... ime:.8..
00000250  03 00 a8 65 6b 73 20 35  20 44 61 79 73 20 39 20  ...eks.5 .Days.9.
00000260  48 6f 75 72 73 20 33 32  3a 32 32 ff 00 03 01 da  Hours.32 :22.....
00000270  32 ff 00 15 00 4d ff 00  04 03 44 42 6f 6f 74 ff  2....M.. ..DBoot.
00000280  00 03 03 49 6d 70 6c 65  ff 00 03 02 44 20 5b 32  ...Imple ....D.[2
00000290  ff 00 03 03 81 73 65 63  6f 6e 64 73 5d 0d 0a     .....sec onds]..

bruce_dev />
Attachments
JTest2.java
(3.04 KiB) Downloaded 5 times
JTest.java
(9.42 KiB) Downloaded 5 times

bscloutier
Posts: 390
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Fri Jan 05, 2018 11:29 am

So at this point I feel like this algorithm generates the optimum LZ77 compression for the data. This should even take into account the lazy matches however that is perceived by the industry. When I cast it into C I will work on optimizing the execution.

The only question might be in optimizing distance codes to minimize extra bits. I didn't consider that in pruning the matched sequence list for a block. When those situations occur there might be a bit or two to save if I were to retain the closer match. I am not going to worry about that. My feeling is that we son't save anything noticeable if anything at all.

Now to handle the Huffman coding.

bscloutier
Posts: 390
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Mon Jan 08, 2018 12:45 pm

Well there are a couple of bugs in my coding which were discovered while testing the approach on much larger files. With those issues fixed I see that I need to focus on optimizing because this all-encompassing matching is much too slow (even when considering the Java breadboard).

bscloutier
Posts: 390
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Mon Jan 08, 2018 2:32 pm

The approach would find all of the sequence matches to data in the previous 32KB of the stream (sliding window) for a section of the input stream bound by non-matching data. Once collected I would then end up trashing the vast majority of those. That is wasteful of processing time. It was a logical approach if without thinking you weren't sure if a better compression ratio couldn't be obtained through careful selection of sequences. There is this suggestion that better compression is possible if lazy matches are allowed. Without really knowing what those are the shotgun all-encompassing approach guaranteed at least that you had all of the information you needed to reach the optimum. Let's actually look at this more closely.

Matching

We start a match upon receipt of a data bytes. I'm keeping a bidirectional linked list for the occurrences of each byte value. This allows the routine to rapidly create an active match object for each. Subsequently as each new byte is received we check each active match for those that may be extended and those that are no longer useful. When we reach a point where none of the active matches have been extended we select the longest match completed as the best. For matches of equal length we pick the closest one (lowest distance).

The DEFLATE specification recognizes matches of 3 or more bytes (maximum 258). Why 3? That is because the compression is achieved by replacing a matched sequence with a pointer back to the same data found in previous 32KB of data (the sliding window). That pointer consists of a distance and a length. That pointer in the worst case requires about 3 bytes. So replacing shorter sequences on average won't buy you anything. That's for DEFALTE. I am actually using like 5 bytes for this breadboard but eventually we will be be strictly DEFLATE. Obviously the longer the match the greater the savings. Therefore the best match is the longest and closest (uses a minimum pointer size).

So for a point in the incoming stream we seek the longest match. If there is no 3 byte match then we output that first byte as is and search using the next one. The results can be impressive especially for text and log files. It's not LZW but it works. It turns out to be good enough for the JNIOR.

Lazy Matching

So what is with this lazy matching? Well imagine a sequence of 3 or more matching bytes located someplace in the sliding window. The consider that if we ignored that match and searched for matches starting with the next byte we might find a much longer match from someplace else in the sliding window. Do we miss an opportunity for better compression?

I can graphically show the overlap of the best matches. Say the first is 5 in lenght and the other some 15.
|---|
 |-------------|
Here vertical bars denote the first and last matching byte and the dashes bytes in between. The first would replace 5 data bytes and the second 15 starting a byte later.

If we were to strictly process matches as they are found we would encode the 5-byte match. And then we would still find the latter 11 bytes of the seconds sequence (or maybe even another better sequence). This would encode as two sequences one right after another and require two pointers for a total of maybe 6 bytes.
|---||---------|
2 pointers = 6 bytes
Note that we can always prune a match. We can ignore some of its leading bytes by incrementing the replace position (CURPTR) and decrementing the length. We can even ignore trailing bytes in a match merely by shortening its length. So here we drop the first 4 bytes of the 15-byte sequence that were eclipsed by the initial 5-byte sequence. We don't have to actually do this manipulation as supposedly our search algorithm would find it directly for us.

Now those who get excited by such things would point out that if we absolutely ignored the first 5-byte sequence completely and outputted that first raw byte then we would use just one pointer.
.|-------------|
1 raw byte plus 1 pointer = 4 bytes
And, yes, there is a savings that depending on how often such a thing occurs will in fact lead to a better result. This is a lazy match. This is even true when the second sequence is further offset as seen here.
|---|
  |-------------|

..|-------------|
2 raw bytes plus 1 pointer = 5 bytes
But there is no benefit beyond that. If the two sequences were offset by 3 bytes then you might as well include first 3 as a sequence and you end up using 6 bytes (or less) anyway with 2 pointers.

OK so

Alright for the JNIOR it is likely that these lazy matches aren't necessary. After all we just want to create a file collection or a graphics file. We aren't really worried about saving every byte. In fact, we are probably more concerned about it getting done quickly.

So using matches as they come works. But... if we can find a way to efficiently accommodate the lazy matching it would be cool.

bscloutier
Posts: 390
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Fri Jan 12, 2018 7:36 am

Optimized Program Code

Now that I have a little better understanding as to the lazy matching we can take what we have and move on to the next step in DEFLATE. Later after I cast this into C we can decide if it is worth handling the lazy matches. It represents a trade off between an optimum compression ratio and processing time. For the JNIOR we really are more concerned about the processing time as the brute force compression ratios appear more than acceptable. Note that I bet that our original processing of all possible matches between unmatched raw data would lead to an even better compression ratio than that including just the lazy matches but that would be slow.

Speaking of slow I thought to take a little time to distill our algorithm down and to code it so it would execute faster. I know that it is still a breadboard but I would like to not waste as much time with iterations in debugging. So I have eliminated the Match object and automatic growing lists. And since we are identifying the best match for a single position I have eliminated the list of completed matches (RSEQ). I also made an adjustment so as to be able to replay bytes into the matcher should we need to output raw data.

The following is the LZ77 code. Hopefully the comments are sufficient for you to follow the algorithm.

// Our LZ77 compression engine
    static void do_compress(BufferedOutputStream outfile, BufferedInputStream infile, int filesize) 
            throws Throwable {
        
        int ch;
        
        // process uncompressed stream byte-by-byte
        while (filesize > 0) {
            
            // Make sure that there are bytes in the queue to work with. We process bytes from 
            //  the queue using SEQPTR. When SEQPTR reaches the INPTR then we add bytes from the input
            //  stream. The linked lists are updated. 
            if (SEQPTR == INPTR) {
                
                // obtain byte from uncompressed stream
                ch = infile.read();
                filesize--;
                
                // queue data and manage associated linked list
                DATA[INPTR] = (byte)ch;

                // Add byte to the head of the appropriate linked list. Note pointers are stored +1 so
                //  as to use 0 as an end of list marker. Lists are bi-directional so we can trim the 
                //  tail when data is dropped from the queue.
                int ptr = HEAD[ch];
                HEAD[ch] = INPTR + 1;
                FWD[INPTR] = ptr;
                BACK[INPTR] = 0;
                if (ptr != 0)
                    BACK[ptr - 1] = INPTR + 1;

                // advance entry pointer
                INPTR++;
                if (INPTR == WINDOW)
                    INPTR = 0;

                // drop old data from queue when the sliding window is full
                if (INPTR == OUTPTR) {

                    // trim linked list as byte is being dropped
                    if (BACK[OUTPTR] == 0)
                        HEAD[DATA[OUTPTR]] = 0;
                    else
                        FWD[BACK[OUTPTR] - 1] = 0;

                    // push end of queue
                    OUTPTR++;
                    if (OUTPTR == WINDOW)
                        OUTPTR = 0;
                }
            }
            
            // Obtain the next character to process. We are assured of a byte at SEQPTR now.
            //  SEQPTR allows us to replay bytes into the sequence matching.
            ch = DATA[SEQPTR++];
            if (SEQPTR == WINDOW)
                SEQPTR = 0;
            
            // Reset match state. These will define the best match should one be found for 
            //  the current CURPTR.
            int best_distance = 0;
            int best_length = 0;
            
            // If there are no active sequences we create a new set. This uses the linked list
            //  for the byte at CURPTR to initialize a series of potention sequence sites.
            if (MSIZE == 0) {

                // create new active matches for all CH in the queue (except last)
                int ptr = HEAD[ch];
                while (ptr != 0) {
                    if (ptr - 1 != CURPTR) {
                        int distance = CURPTR - ptr + 1;
                        if (distance < 0)
                            distance += WINDOW;

                        DISTANCE[MSIZE] = distance;
                        LENGTH[MSIZE] = 1;
                        MSIZE++;
                    }

                    ptr = FWD[ptr - 1];
                }
                
            }
                
            // Otherwise process the active sequence matches. Here we advance sequences as each
            //  new byte is processed. Of those matches that cannot be extended we keep the
            //  best (longest and closest to CURPTR). We will use the best match if all of the
            //  potential matches end.
            else {
                
                // each active match
                for (int n = MSIZE - 1; 0 <= n; n--) {
                    
                    int p = CURPTR - DISTANCE[n];
                    if (p < 0)
                        p += WINDOW;

                    p += LENGTH[n];
                    if (p >= WINDOW)
                        p -= WINDOW;

                    // Can we extend this match? If so we bump its length and move on to
                    //  the next match.
                    if (DATA[p] == ch && LENGTH[n] < 258) {
                        LENGTH[n]++;

                        if (DISTANCE[n] + LENGTH[n] < WINDOW && filesize > 0)
                            continue;
                    }

                    // Sequence did not get extended. See if it is the best found so far.
                    if (LENGTH[n] >= 3) {
                        
                        // first 
                        if (best_length == 0) {
                            best_distance = DISTANCE[n];
                            best_length = LENGTH[n];
                        }
                        
                        // longer
                        else if (LENGTH[n] > best_length) {
                            best_distance = DISTANCE[n];
                            best_length = LENGTH[n];
                        }
                        
                        // closer
                        else if (LENGTH[n] == best_length && DISTANCE[n] < best_distance) {
                            best_distance = DISTANCE[n];
                            best_length = LENGTH[n];
                        }
                    }

                    // Competed matches are eliminated from the active list. To be quick we
                    //  replace it with the last in the list and reduce the count.
                    MSIZE--;
                    DISTANCE[n] = DISTANCE[MSIZE];
                    LENGTH[n] = LENGTH[MSIZE];
                }
            }
            
            // If there are no active sequence matches at this point we can generate output.
            if (MSIZE == 0) {
                
                // If a we have a completed sequence we can output a pointer. These are escaped
                //  into the output buffer for later processing into encoded length-distance
                //  pairs.
                best_length = 0;
                if (best_length != 0) {
                    bufwrite(0xff, outfile);
                    bufint(best_length, outfile);
                    bufint(best_distance, outfile);

                    // Move CURPTR to the next byte after the replaced sequence
                    CURPTR += best_length;
                    if (CURPTR >= WINDOW)
                        CURPTR -= WINDOW;
                }                

                // Otherwise output a raw uncompressed byte. The unmatched byte is sent to
                //  the output stream and we move CURPTR to the next. 
                else {
                    bufbyte(DATA[CURPTR], outfile);
                    CURPTR++;
                    if (CURPTR == WINDOW)
                        CURPTR = 0;
                }

                // Here we reset SEQPTR to process from the nex CURPTR location. In the case that
                //  we could not match this replays bytes previously processed so as to not miss
                //  an opportunity.
                SEQPTR = CURPTR;
            }
        }
        
        // If we are done and there are unprocessed bytes left we push them to the output stream.
        while (CURPTR != INPTR) {
            bufbyte(DATA[CURPTR], outfile);
            CURPTR++;
            if (CURPTR == WINDOW)
                CURPTR = 0;
        }
    }

With this in place we are going to move on to see what needs to be done next with our output stream.

bscloutier
Posts: 390
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Fri Jan 12, 2018 8:41 am

Our LZ77 compression routine loads an output buffer with processed and hopefully compressed data. When this output buffer fills (say to 64KB) we must process it further. That data is then compressed again using a form of Huffman coding.

Huffman coding for DEFLATE

If you search you can find lots of useful descriptions of Huffman coding. Not all of those will provide the detail for constructing the required tree from the data. Of those that do, most do not lead you to creating a Huffman table compatible with DEFLATE. This is because the Huffman table in DEFLATE is eventually stored using a form of shorthand. That is only possible if the Huffman encoding follows some addition rules. To meet those requirements we need to be careful in constructing our initial dynamic tree.

The DEFLATE specification cryptically defines it:
The Huffman codes used for each alphabet in "deflate" format have two additional rules:

   * All codes of a given bit length have lexicographically consecutive values, in the same order as the
     symbols they represet;

   * Shorter codes lexicographically precede longer codes.

It would be nice if they would avoid words like lexicographically but you can't have everything. You can also get confused over the term codes verses the binary values of the bytes in the alphabet. And of course shorter refers to bit count. That being perhaps a little more obvious but here again these must "lexicographically precede" others.

Alphabet

This refers to the set of values that we intend to compress. Obviously this needs to include byte values (0..255) since we are not constraining our input to ASCII or something. We include all of the possible values in the alphabet (in increasing value) even if some do not appear in the data. That seems obvious but DEFLATE also defines an end-of-block code (like an EOF) of 256 as well as special codes from 257..285 used to represent length codes (in the length-distance pointers we created).

So we will need to encode bytes from 0 thru 285. Okay, That set requires 9 bits and makes life in a world of bytes difficult. Remember how I had to escape my length-distance pointers in the buffer? Anyway, we can handle it in building our trees as we can define the value of a node as an integer. So for DEFLATE our "alphabet" consists of the numbers 0..285.

Don't be confused if you notice that length codes and also distance codes generally include some "extra" bits. They do and those are simply slipped into the bit stream and are not subjected to Huffman coding. We'll get into that later.

Length codes

These lie outside of the normal byte values 0..255 simply because in decompression we need to recognize them. These are flagged just as I have escaped the same in the output buffer. There are 29 of the length codes which are used with extra bits in some cases to encode lengths of 3 to 258. You may recall that we did not create matching sequences of less that 3 bytes and there is a maximum of a 258 byte length. The 258 maximum I bet results from storing the length-3 as a byte (0..255) someplace. But I would be very curious as to the thought process that breaks these 256 possible lengths into 29 codes. That is likely based upon some probability distribution or some such thing. It is what it is.

Distance codes

Unlike the length codes the distance codes do not need to be flagged. We expect a distance code after a length code and so those use normal byte values already represented in the alphabet (0..29). Here there are 30 distance codes some also requiring extra bits encoding distance from 1..32768. This allows the matched sequence to sit in that 32KB sliding window.

bscloutier
Posts: 390
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Fri Jan 12, 2018 9:27 am

Huffman coding compresses data by representing frequent values with a small number of bits. If a space ' ' (0x20) appears in the data a tremendous number of times it might get encoded by just 2 bits. That saving 6 bits for every occurrence of a space. That can be a huge savings. The down side is that a rare byte that might occur only a few times might be encoded by 10 bits. That actually increases the storage from the original 8-bit byte but happens only a few times.

This implies then that we know the frequencies of each member of our alphabet. That is the first step. We need to proceed to count each occurrence of each member in our alphabet that appears in the data.

Here we modify my bufflush() routine that is responsible for emptying the buffer. First we will add a routine to count. There are 286 members in the alphabet (256 byte values, the end-of-block code and 29 length codes). We create an integer array where we use the value as an index to count occurrences. There is one complication in that I need to convert my length-distance escaping into the DEFLATE encoding. That entails tables of length and distance ranges so we can decide which of the length and distance codes we need to use.

Code: Select all

    // length code range maximums
    static int[] blen = { 
        4, 5, 6, 7, 8, 9, 10, 11, 13, 15, 17, 19, 23, 27, 31, 35, 
        43, 51, 59, 67, 83, 99, 115, 131, 163, 195, 227, 258, 259 
    };
Here for each or the 29 length ranges we specify the largest size plus 1 that it can encode. For a given length we will loop through these ranges to determine the proper alphabet value to use. The DEFLATE lengths are encoded as follows (from RFC 1951):
                 Extra               Extra               Extra
            Code Bits Length(s) Code Bits Lengths   Code Bits Length(s)
            ---- ---- ------     ---- ---- -------   ---- ---- -------
             257   0     3       267   1   15,16     277   4   67-82
             258   0     4       268   1   17,18     278   4   83-98
             259   0     5       269   2   19-22     279   4   99-114
             260   0     6       270   2   23-26     280   4  115-130
             261   0     7       271   2   27-30     281   5  131-162
             262   0     8       272   2   31-34     282   5  163-194
             263   0     9       273   3   35-42     283   5  195-226
             264   0    10       274   3   43-50     284   5  227-257
             265   1  11,12      275   3   51-58     285   0    258
             266   1  13,14      276   3   59-66
Similarly we create an array for the 30 distance codes.

Code: Select all

    static int[] bdist = { 
        2, 3, 4, 5, 7, 9, 13, 17, 25, 33, 49, 65, 97, 129, 193, 
        257, 385, 513, 769, 1025, 1537, 2049, 3073, 4097, 6145, 
        8193, 12289, 16385, 24577, 32769 
    };
The DEFLATE specification encodes distances as follows:
                  Extra           Extra               Extra
             Code Bits Dist  Code Bits   Dist     Code Bits Distance
             ---- ---- ----  ---- ----  ------    ---- ---- --------
               0   0    1     10   4     33-48    20    9   1025-1536
               1   0    2     11   4     49-64    21    9   1537-2048
               2   0    3     12   5     65-96    22   10   2049-3072
               3   0    4     13   5     97-128   23   10   3073-4096
               4   1   5,6    14   6    129-192   24   11   4097-6144
               5   1   7,8    15   6    193-256   25   11   6145-8192
               6   2   9-12   16   7    257-384   26   12  8193-12288
               7   2  13-16   17   7    385-512   27   12 12289-16384
               8   3  17-24   18   8    513-768   28   13 16385-24576
               9   3  25-32   19   8   769-1024   29   13 24577-32768
So this buffer flush routine looks as follows. Note that we are not encoding the output in any way yet. This merely determines the counts.

    // length code range maximums
    static int[] blen = { 
        4, 5, 6, 7, 8, 9, 10, 11, 13, 15, 17, 19, 23, 27, 31, 35, 
        43, 51, 59, 67, 83, 99, 115, 131, 163, 195, 227, 258, 259 
    };
    
    static int[] bdist = { 
        2, 3, 4, 5, 7, 9, 13, 17, 25, 33, 49, 65, 97, 129, 193, 
        257, 385, 513, 769, 1025, 1537, 2049, 3073, 4097, 6145, 
        8193, 12289, 16385, 24577, 32769 
    };

    static void bufflush(BufferedOutputStream outfile) throws Throwable {
        
        // Determine frequecies by counting each occurrence of a byte value
        int[] freq = new int[286];
        for (int n = 0; n < BUFPTR; n++) {
            
            // Get the byte value. This may be escaped ith 0xff.
            int ch = BUFR[n] & 0xff;
            
            // not escaped
            if (ch != 0xff)
                freq[ch]++;
            
            // escaped
            else {
                
                // Get next byte.
                ch = BUFR[++n] & 0xff;
                
                // may just be 0xff itself
                if (ch == 0xff)
                    freq[0xff]++;
                
                // length-distance pair
                else {
                    
                    // obtain balance of length and distance values
                    int len = (ch << 8) + (BUFR[++n] & 0xff);
                    int dist = ((BUFR[++n] & 0xff) << 8) + (BUFR[++n] & 0xff);
                    
                    // determine length code to use (257..285)
                    for (int k = 0; k < blen.length; k++) {
                        if (len < blen[k]) {
                            freq[257 + k]++;
                            break;
                        }
                    }
                    
                    // determine distance code to use (0..29)
                    for (int k = 0; k < bdist.length; k++) {
                        if (dist < bdist[k]) {
                            freq[k]++;
                            break;
                        }
                    }
                    
                }
            }
        }
        
        // dump the results
        for (int n = 0; n < 256; n++) {
            if (freq[n] > 0)
                System.out.printf("0x%03x %d\n", n, freq[n]);
        }
        
        outfile.write(BUFR, 0, BUFPTR);
        BUFPTR = 0;        
    }

And the unsorted results. Here we list only those values that appear in the data.

Code: Select all

jtest jniorboot.log
0x004 1
0x00a 9
0x00b 13
0x00c 4
0x00d 5
0x00e 8
0x00f 3
0x010 4
0x011 8
0x012 2
0x013 2
0x020 41
0x028 1
0x029 1
0x02c 6
0x02d 5
0x02e 7
0x02f 3
0x030 16
0x031 12
0x032 12
0x033 6
0x034 10
0x035 7
0x036 6
0x037 4
0x038 9
0x039 9
0x03a 10
0x041 4
0x042 2
0x043 3
0x044 1
0x045 2
0x046 1
0x047 3
0x048 1
0x049 3
0x04a 1
0x04d 1
0x04e 4
0x04f 3
0x050 4
0x052 3
0x053 4
0x054 3
0x055 1
0x057 1
0x05b 1
0x05d 1
0x061 8
0x062 5
0x063 7
0x064 9
0x065 29
0x066 1
0x067 2
0x068 2
0x069 14
0x06b 2
0x06c 10
0x06d 6
0x06e 9
0x06f 17
0x070 5
0x072 17
0x073 11
0x074 15
0x075 8
0x076 4
0x077 2
0x079 5
0x07a 1
0x101 30
0x102 2
0x103 3
0x105 1
0x10c 1
0x10d 13
0x10e 2
0x10f 2
0x110 1

Processing 6.129 seconds.
Source 954 bytes.
Result 680 bytes.
Ratio 1.40:1

bruce_dev /> 
There is one omission. We will in fact use one end-of-block code (0x100) and so we will need to force it into the table.

bscloutier
Posts: 390
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Fri Jan 12, 2018 9:53 am

I will need to build a tree. So we need to create some kind of node which will have left and right references as well as a potential value and a weigh (frequency). Here I will define a class.

    static ArrayList <Node> nodes = new ArrayList(512);
    
    static class Node {
        int left;
        int value;
        int weight;
        int right;
        int length;
        int code;
        
        Node() {
        }
        
        // instantiate leaf
        Node(int val, int w) {
            value = val;
            weight = w;
        }
        
        // instantiate node
        Node(int l, int r, int w) {
            left = l;
            right = r;
            weight = w;
        }                
    }

I know that later I will assign individual leaves a code length and a code so those are included in the class as well. Now I will take our array of value frequencies and create leaves for each of those that appear in the data. An ArrayList will store these nodes and grow as we define a full tree.

I will also create an ordering array that we will use to properly constructing the tree. Each entry in this array will reference a node. Initially we will sort this by decreasing frequency. Also in keeping with the lexicographical requirement those nodes of the same frequency will be ordered by increasing alphabet value. This is where it gets tricky but we start here.

We replace the frequency dump loop with the following.

        // create node set (0 not used)
        int[] order = new int[286];
        int cnt = 0;
        nodes.add(new Node());  // not used (index 0 is terminator)
        for (int n = 0; n < 286; n++) {
            if (freq[n] > 0) {
                nodes.add(new Node(n, freq[n]));
                order[cnt++] = nodes.size() - 1;
            }
        }
        
        // sort
        for (int n = 0; n < cnt - 1; n++) {
            Node nd1 = nodes.get(order[n]);
            Node nd2 = nodes.get(order[n + 1]);

            if (nd1.weight < nd2.weight || nd1.weight == nd2.weight && nd1.value > nd2.value) {
                int k = order[n];
                order[n] = order[n + 1];
                order[n + 1] = k;
                n -= 2;
                if (n < -1)
                    n = -1;
            }
        }
        
        //dump
        System.out.println("");
        for (int n = 0; n < cnt; n++) {
            Node nd = nodes.get(order[n]);
            System.out.printf("%d 0x%03x %d\n", order[n], nd.value, nd.weight);
        }

And the results of the sort are displayed.

Code: Select all

bruce_dev /> jtest jniorboot.log

12 0x020 41
75 0x101 30
55 0x065 29
64 0x06f 17
66 0x072 17
19 0x030 16
68 0x074 15
59 0x069 14
3 0x00b 13
80 0x10d 13
20 0x031 12
21 0x032 12
67 0x073 11
23 0x034 10
29 0x03a 10
61 0x06c 10
2 0x00a 9
27 0x038 9
28 0x039 9
54 0x064 9
63 0x06e 9
6 0x00e 8
9 0x011 8
51 0x061 8
69 0x075 8
17 0x02e 7
24 0x035 7
53 0x063 7
15 0x02c 6
22 0x033 6
25 0x036 6
62 0x06d 6
5 0x00d 5
16 0x02d 5
52 0x062 5
65 0x070 5
72 0x079 5
4 0x00c 4
8 0x010 4
26 0x037 4
30 0x041 4
41 0x04e 4
43 0x050 4
45 0x053 4
70 0x076 4
7 0x00f 3
18 0x02f 3
32 0x043 3
36 0x047 3
38 0x049 3
42 0x04f 3
44 0x052 3
46 0x054 3
77 0x103 3
10 0x012 2
11 0x013 2
31 0x042 2
34 0x045 2
57 0x067 2
58 0x068 2
60 0x06b 2
71 0x077 2
76 0x102 2
81 0x10e 2
82 0x10f 2
1 0x004 1
13 0x028 1
14 0x029 1
33 0x044 1
35 0x046 1
37 0x048 1
39 0x04a 1
40 0x04d 1
47 0x055 1
48 0x057 1
49 0x05b 1
50 0x05d 1
56 0x066 1
73 0x07a 1
74 0x100 1
78 0x105 1
79 0x10c 1
83 0x110 1

Processing 8.873 seconds.
Source 954 bytes.
Result 680 bytes.
Ratio 1.40:1

bruce_dev /> 
The first number if the node index. The next is the value and the last the count of occurrences. Here we see that the space (0x20) is the most frequent in this file. You can see that I included the end-of-block code (0x100) once and it is as infrequent as a few others.

bscloutier
Posts: 390
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Fri Jan 12, 2018 10:16 am

The next step is to construct a Huffman tree. So to simplify things at this point we are going to use a simple phrase as the data and disable the LZ77 aspect. There is a Duke University page describing Huffman coding that uses the phrase "go go gophers". For lack of anything better we will use the same.

With LZ77 disabled and running only that phrase our frequency sort yields the following.

Code: Select all

bruce_dev /> jtest flash/gogo.dat

3 0x067 3
5 0x06f 3
1 0x020 2
2 0x065 1
4 0x068 1
6 0x070 1
7 0x072 1
8 0x073 1

Processing 0.649 seconds.
Source 13 bytes.
Result 13 bytes.
Ratio 1.00:1

bruce_dev /> 

bruce_dev /> cat flash/gogo.dat
go go gophers
bruce_dev />
We also ignore any end-of-block code.

The Duke page demonstrates that this phrase can be encoded with just 37 bits. It also demonstrates that there are multiple possible Huffman trees that can be created to yield that result. Interesting that none of them there meet the DEFLATE requirement. So I am going to determine the procedure that creates the right kind of table.

Post Reply