This Community site is new! Please help us build a community around the JNIOR.
Sign up, and help share your knowledge. Please sign-up even if you do not plan to post as a sign of support.
If there is evidence of a demand we will continue to develop the content here.

DEFLATE Compression Algorithm

INTEG's own Real-time Operating System. No 3rd party code here. 100% Supportable.
bscloutier
Posts: 399
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Mon Jan 15, 2018 9:05 am

With the Huffman coding for DEFLATE there is one thing that we need to worry about. We need to limit the bit length to a maximum of 15. It is not very likely to occur I suspect. But this is because the bit length list for the alphabet is run-length encoded using a procedure that can only handle bit lengths of 0 to 15. Codes of 16, 17 and 18 are used to signal certain types of repetition. This is where I got the '19' I use in my breadboard code to dimension the bit length arrays. That need only be '16' as the 3 additional are repetition codes. If the Huffman coding results in a bit length exceeding 15 I will simply have to decrease the size of the block we are encoding until we are good to go.

So now that we have the ability to reasonably compress our data using LZ77 and then to compress it further with Huffman coding, we are ready to generate the DEFLATE formatted payload. It is time to tackle the "craziness" and "shorthand" that I have referred to. We can apply the run-length encoding and the second iteration of Huffman coding that are required. We are ready to generate the DEFLATE format compressed bit stream.

JANOS has been able to process JAR/ZIP files since early in its development. This was required to meet our goal of executing Java directly from the JAR files generated by the compiler. So we have been able to decipher the DEFLATE format. I just hadn't needed to generate it. But the advantage to this is that there is already proven code parsing the DEFLATE structure. Referring to that helps to remove any question when trying to figure out how to generate such a structure.

Rather than drop this topic now that I have the LZ77 and Huffman procedures that I need, I'll take a moment to review the final steps. Let me see if I can clarify some of it here. For our Java breadboard to produce something usable I would not only have to complete the DEFLATE programming but also encapsulate the result in the JAR/ZIP library format. That's more effort than I need given that I will be doing just that at the C level within the OS and I need to get to that soon.

bscloutier
Posts: 399
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Mon Jan 15, 2018 12:02 pm

DEFLATE Formatted Data

When file data is compressed using DEFLATE and included in a JAR/ZIP library it is represented as a bit stream. That sounds simple enough but because our micro-controller works with bytes and retrieves bytes from a file data stream we need to be concerned with bit order and byte order.

Normally bits in a byte are ordered from right to left with the least significant bit (the first in the bit stream) on the right. Like this:
+--------+
|76543210|
+--------+
The 9th bit in the stream then comes from the next byte and bytes are in sequence in increasing memory addresses (or increasing file position). This order of bits seems only natural as it should.

So if we were to retrieve a 5-bit value from the stream we would obtain the right 5 bits from the first byte using the mask 0x1f. Placing that in a byte of its own would give us the numeric value of the 5-bit data element. The next 5-bit element would use the remaining 3 bits in the first byte and 2 from the right side of the next. We would likely be using a right shift before applying the mask to pull those together.

Huffman codes will seem to contradict this. These codes are packed starting with the most-significant bit of the code. In other words the most significant bit of the first Huffman code would be found in the rightmost bit of the first byte. Once you realize that you must process Huffman codes a bit at a time and that you are reading single bit data elements this order makes sense. Pointing out that the Huffman code appears in the byte in reverse order serves to confuse us. But you are reading it a single bit at a time using each bit to decide which direction to descend through the Huffman tree. That means that you need the code's most-significant bit first. We also never know in advance how many bits we are going to need to reach the first leaf of the tree and thus our first encoded member of the alphabet.

Block Format

The DEFLATE bit stream is broken down into a series of 1 or more blocks of bits. In our case when we flush our 64KB LZ77 data buffer we are going to construct a single block. Since we are compressing it you would expect that it will contain much less than 512Kb (64KB x 8 bits). For large file we will likely need to flush our buffer multiple times creating a stream with multiple blocks.

Each block contains a 3-bit header. The first bit is called BFINAL and it is a 1 only for the last block in the stream. The next 2 bits are called BTYPE and these define the data encoding for this block. That means that we could use a different encoding for each block if we felt it to be beneficial. The 2 BTYPE bits gives us 4 options.
  00 - no compression
  01 - compressed with fixed Huffman codes
  10 - compressed with dynamic Huffman codes
  11 - reserved (error)
We have been working toward being able to generate blocks of BTYPE 10. We could have used the predefined Huffman tables in BTYPE 01 or not even bothered to compress using BTYPE 00. There may be times when our compression fails to reduce the size of our file data. IN that case we could decide to include a block without compressing. But for now we will concern ourselves with BTYPE 10 that includes dynamic (not adaptive) Huffman tables (tables that change block to block).

So to start this defines our block so far. I'll show the stream progressing from left to right with the number of bits for each element shown in parentheses.
 Bit 0         1         2         3         4         5         6         7
+---------+---------+---------+---------+---------+---------+---------+---------
| BFINAL  |     BTYPE (2)     |        Balance of Block . . .
+---------+---------+---------+---------+---------+---------+---------+---------

BFINAL set on the last block in the stream.

bscloutier
Posts: 399
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Mon Jan 15, 2018 12:49 pm

Now logically we know that somehow we need to get the Huffman table before we see compressed data so we can decompress that data. We also have seen that the Huffman table can be defined knowing the bit lengths for each of the symbols in our alphabet. In fact we expect that, since I made a big deal about having the right kind of Huffman table for DEFLATE that can be generated from just the bit lengths. So we are looking for that array of bit lengths. Here's where the fun begins.

To start we know that not all of the 286 members of the alphabet will be used in the data. Some entries in that table will be assigned a 0 bit length. We have to assume that the majority of the literal bytes (0..255) will be represented in the data. We also know that the end-of-block code (257) will appear once. So we need an array of bit lengths at least 257 entries long. Beyond that we don't know how many of the sequence length codes (257..285) will be used. But if some of the trailing codes aren't used then the array doesn't need to be 286 entries long. We just need the non-zero bit lengths. So this array will be 257 plus however many length codes we need to cover all of the non-zero ones.

The next element in the bit stream is HLIT. This is a 5-bit element when added to 257 defines the count of entries in the bit length array that will be provided. We will assume that the reset are 0. Since there are 29 length codes beyond the end-of-block code we need only know how many of those to know the size of the array. That can be passed in a 5-bit element.
       Bit 0         1         2         3         4         5         6         7
      +---------+---------+---------+---------+---------+---------+---------+---------+-----
      | BFINAL  |     BTYPE (2)     |                     HLIT (5)                    |
      +---------+---------+---------+---------+---------+---------+---------+---------+-----
HLIT + 257 tells us how many of the 286 bit lengths will be provided. But, don't expect that those will be forthcoming. At least not right away.

bscloutier
Posts: 399
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Mon Jan 15, 2018 1:34 pm

Next comes something that the breadboard program handled incorrectly. The distance codes for the length-distance pointers are Huffman coded using their own table. The length codes are Huffman coded using the same table as the literal data. This makes sense since when you retrieve the next code you don't know if it is a literal or a length code. You do know that after the length code (and any extra bits) comes the distance code. So compressing those with their own table is a benefit. The DEFLATE specification (RFC 1951) states this but it isn't all that obvious.

So wait! Now we need 2 Huffman tables and therefore 2 arrays of bit lengths. Yes we do.

Next we receive a 5-bit data element containing HDIST - 1. This defines the number of bit lengths to be supplied for the distance alphabet and the second Huffman table. The specification shows that there are 30 distance codes (0..29) but refers to 32 distance codes. It states that codes 30 and 31 will never occur in the data. These are perhaps reserved to allow for a larger sliding window in the future. There is also the need for a 0 distance code which would be used as a flag to indicate that no distance codes are used at all and that the data is all literals. So to pass a bit length array size of up to 33 the value is stored -1.
       Bit 0         1         2         3         4         5         6         7
      +---------+---------+---------+---------+---------+---------+---------+---------+-----
      | BFINAL  |     BTYPE (2)     |                     HLIT (5)                    |
      +---------+---------+---------+---------+---------+---------+---------+---------+-----
           8         9        10        11        12        13        14        15
 -----+---------+---------+---------+---------+---------+---------+---------+---------+-----
      |                    HDIST (5)                    |
 -----+---------+---------+---------+---------+---------+---------+---------+---------+-----
Now we know the length of two arrays defining bit lengths for two alphabets. I had alluded to the fact that we would again use Huffman coding to compress the bit length array data. That is the case and so we need yet another Huffman table and therefore a third array of bit lengths. Will it never end??

bscloutier
Posts: 399
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Mon Jan 15, 2018 2:04 pm

The bit length alphabet includes 3 codes for repetition for total of 19 codes. The size of the bit length array for this is conveyed in a 4-bit data element HCLEN - 4. Note though that this array defines bit lengths for the codes in a very specific order. This was devised to keep those codes that typically have 0 bit length at the end of the array so they can be omitted. The order of the codes is as follows:
16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15
That means that we will have to arrange the bit lengths in this order before deciding how many to include in our stream. When reading these we would shuffle them into the proper location.

Our bit stream now looks like this.
       Bit 0         1         2         3         4         5         6         7
      +---------+---------+---------+---------+---------+---------+---------+---------+-----
      | BFINAL  |     BTYPE (2)     |                     HLIT (5)                    |
      +---------+---------+---------+---------+---------+---------+---------+---------+-----
           8         9        10        11        12
 -----+---------+---------+---------+---------+---------+-----
      |                    HDIST (5)                    |
 -----+---------+---------+---------+---------+---------+-----
          13        14        15        16
 -----+---------+---------+---------+---------+-----
      |               HCLEN (4)               |
 -----+---------+---------+---------+---------+-----
Well there is a lot of information so far conveyed in just 2 bytes. Next we will finally receive some bit length data. Following this we are supplied HCLEN + 4 data elements each 3-bits (0..15) defining the code lengths for the Huffman table that we will use to generate bit length arrays for the other tables. Note that these are in the predefined order and must be sorted into the bit lengths for the 19 symbol alphabet. Now there are a variable number of these data elements and so I will no longer be able to number the bits nor can I show them all. HCLEN + 4 data elements however many is required follow in the stream.
          13        14        15        16        17        18        19   . . . 
 -----+---------+---------+---------+---------+---------+---------+---------+-----
      |               HCLEN (4)               |          CODE16 (3)         |
 -----+---------+---------+---------+---------+---------+---------+---------+-----

 -----+---------+---------+---------+---------+---------+---------+-----
      |          CODE17 (3)         |          CODE18 (3)         |
 -----+---------+---------+---------+---------+---------+---------+-----

 -----+---------+---------+---------+---------+-----
      |          CODE0 (3)          |       . . . 
 -----+---------+---------+---------+---------+-----


bscloutier
Posts: 399
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Mon Jan 15, 2018 3:03 pm

We now have a bit length array for our 19 symbol alphabet. At least we are familiar with this from the breadboard program. We can use the procedure to generate the Huffman code table. First we count the number of times each bit length is used. Then we calculate a starting code for each length. And then we assign sequential codes to the alphabet for each bit length group. So we can decipher Huffman codes. It's a good thing too because now what follows in the bit stream are Huffman codes from this table.

Keeping up? We are talking about what is involved in decompressing a DEFLATE stream but really we are doing this so we know what to do when we compress our own data. So now is the time to consider dropping some bread crumbs because if you are going to create a DEFLATE stream you will need to find your way back through this.

Supposedly at this point we can read Huffman codes, locate the symbol they represent and retrieve the bit length array data we were originally looking for. Well almost. While some of the values we obtain will be bit lengths for the next entry in a bit length array there are 3 special codes each requiring a different action. The specification describe them as follows.
               0 - 15: Represent code lengths of 0 - 15
                   16: Copy the previous code length 3 - 6 times.
                       The next 2 bits indicate repeat length
                             (0 = 3, ... , 3 = 6)
                          Example:  Codes 8, 16 (+2 bits 11),
                                    16 (+2 bits 10) will expand to
                                    12 code lengths of 8 (1 + 6 + 5)
                   17: Repeat a code length of 0 for 3 - 10 times.
                       (3 bits of length)
                   18: Repeat a code length of 0 for 11 - 138 times
                       (7 bits of length)
Here we see that when we encounter one of the codes 16, 17 or 18 we are required to pull 2, 3 or 7 additional bits respectively from the bit stream which define a repeat count.

First we will receive Huffman codes to define HLIT + 257 bit lengths for the literal Huffman table. Then we will receive codes defining HDIST + 1 bit lengths for the distance Huffman table. But don't think the fun ends here.

HLIT and HDIST do not define the count of Huffman codes that follow. If you obtain a code that repeats a value that counts for that many bit lengths. That perhaps makes sense. But to just make things a little trickier, once you acquire the HLIT + 257 codes you immediately start defining the HDIST + 1 codes even if you are performing repetition. Yeah, a single repeat code can take you from one table to the next. If you are repeating some 0 bit lengths trailing in the HLIT bit length array you would just keep going to define any 0 bit lengths in the first part of the HDIST array. The specification says "code lengths form a single sequence of HLIT + HDIST + 258 values."

When you are generating these Huffman codes of course you don't have to force it to be a single sequence. You might just be wasting a few bits. Today that's not a big deal but it certainly must have been 40 years ago.

So start pulling Huffman codes. Remember you process these 1 bit at a time so you are starting with the most-significant bit of some code. With each bit you are either descending through a tree looking for a leaf and the corresponding symbol or otherwise collecting the code looking for a match in a code list. The former is faster but the latter easier to structure in memory (no tree). You proceed to process each symbol to define a bit length or repetition code.

Now you have 2 tables of bit lengths and you know how to generate the Huffman codes for the associated alphabets. What follows next in our bit stream are the actual Huffman codes for the compressed data block. Each Huffman code will either define a literal value (byte) or a length code. The byte you just push into your uncompressed stream and the sliding window. In the case of a length code you would retrieve the additional bits defining the length of a matched sequence. You would then use the distance Huffman table for the next code that together with extra bits defines a distance back into the sliding window. Push the referenced string into your uncompressed stream and the sliding window. This is repeated until you encounter the end-of-block code (256). If BFINAL was set for the block you can save your now uncompressed data and you are done. Otherwise another block will follow.

Now we follow this logic backwards to figure out for our own compression effort what we need to do to generate the proper DEFLATE format for our data.

bscloutier
Posts: 399
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Mon Jan 15, 2018 3:22 pm

Okay, feel free to post questions, comments, corrections or whatever. I would be curious to know if I have helped anyone. I have written these posts as a form of review and preparation for myself. I am now ready to generate some C code in JANOS to make perhaps broad use of DEFLATE.
  • This will allow the existing JAR/ZIP JANOS command to create or modify compressed file collections. That will be helpful in applications that generate massive amounts of log data.
  • It will let JANOS create PNG graphics using drawing and plotting commands. This will allow us to easily display the data acquired by monitoring applications.
  • The WebServer can utilize DEFLATE to more efficiently transfer content. It can really help here as we already serve files (the DCP for example) directly out of a ZIP library. Whereas we presently decompress those files and send uncompressed content, the WebServer could forward the DEFLATE formatted data directly providing a bandwidth benefit.

bscloutier
Posts: 399
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Tue Jan 16, 2018 7:23 am

The JANOS WebServer uniquely can locate and serve content directly from ZIP file collections. Generally files in a ZIP collection are compressed in DEFLATE format already. The WebServer can detect a browser's ability to accept content in DEFLATE format directly and transfer the compressed content directly. Why spend the time to decompress before transfer?

Since this does not involve a DEFLATE compressor it was quick to implement. Starting with JANOS v1.6.4 (now in Beta) the WebServer will utilize DEFLATE content encoding when transferring files already compressed in that format provided that the browser can accept it. It works nicely.

bscloutier
Posts: 399
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Thu Jan 25, 2018 3:56 pm

I've been busy extending the JAR/ZIP command in JANOS to allow you to create, update and freshen an archive file. The first step was to just store files uncompressed. Getting all of that command line logic straight is enough of a headache. Once that was behind me I was ready to implement DEFLATE.

One difference in the JANOS implementation from the approach taken earlier in this thread is that I am going to work with buffers as opposed to a byte stream. Since the JNIOR uses a 64MB heap and generally consumes only a few MB of it I can load an entire file's content in a memory buffer. Yeah, files on the JNIOR aren't very large. This eliminates the queue approach to the sliding window. That helps with matching as it eliminates any need to test pointers for wrap around.

Where I used a bidirectional linked list before tracking occurrences of each byte value in the sliding window, I have gravitated to a series of queues tracking the last 256 matching bytes (or fewer if that be the case) in the 32KB preceding window. There is also no need now to keep any array of active matches since we are not streaming. So a pass through the appropriate queue for the current byte generally delivers a usable match and limits the search times in interest of speed. I get results within 2% or so of what the PC generates for the same file. This is certainly acceptable for JANOS.

I use the code length generator algorithm that I devised previously so as to not have to generate any Huffman tree physically. This for certain test files tended to hit the DEFLATE maximum code length limits. So I will describe a variation that seems to avoid that problem.

I will go through the implementation details here over the next day or so and maybe share some of the code. Remember that I am writing in C so...

bscloutier
Posts: 399
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Fri Jan 26, 2018 11:51 am

DEFLATE Compressor Implementation (C Language)

The goal is to compress using DEFLATE a byte buffer filled with the entire contents of a file or other data. I want a simple function call like this:
FZIP_deflate(fbuff, filesize, &cbuff, &csize)
This function needs to return two things, a buffer containing the DEFLATE formatted bit stream and the compressed length. There are other ways to return these parameters but for JANOS which has to remain thread-safe (cannot use static variables) this works. This routine returns TRUE if we can successfully compress the data. It might return FALSE if in compressing the results we decide it isn't going to be worth it. This function is used as follows in the JAR/ZIP command in JANOS.

				// optionally compress
				compressed = FALSE;
				compsize = filesize;
				if (filesize >= 128)
				{
					char *cbuff = NULL;
					uint32_t csize;

					// compress the content
					if (FZIP_deflate(fbuff, filesize, &cbuff, &csize))
					{
						compressed = TRUE;
						compsize = csize;
						MM_free(fbuff);
						fbuff = cbuff;
					}
					else
						MM_free(cbuff);
				}

Here we see that successful compression replaces the uncompressed buffer and modifies the compressed data size. It sets the compressed flag so the file can be properly saved in the JAR or ZIP archive. Like magic we have a compressed file!

Preface

Why am I doing this? I mean there is code out there and people have been able to construct archives and compress files literally for multiple decades. Why reinvent the wheel?

Well there are multiple reasons. First is a design goal for JANOS. This operating system uses no third-party written code. That sounds crazy but what it means is that there is no bug or performance that we cannot correct, change, improve or alter. And, this can be done quickly and in a timely fashion. Every bit of code is understood, clearly written and documented. It is written for a single target and not littered with conditional compilation blocks which obfuscate everything. If you support an operating system you might see how you could be envious of this.

Another reason is educational. Now that maybe is selfish but if I am going to be able to fully debug something I need to fully understand it. We cannot tolerate making what seems like a simple bug correction which later turns out to break some other part of the system. The only way to guarantee that this risk is minimized is for me to know everything that is going on and exactly what is going on. Yeah there is a JVM in here. Yeah it does TLS v1.2 encryption. It's been fun.

The real problem though is that it is difficult to find good and complete technical information on the net. Yes there is RFC 1951 defining DEFLATE. It does not tell how to do it just tells you what you need to do. And, some aspects of it are not clear until you encounter it (or recreate it) in action. It describes LZ77 but you don't realize that this is very difficult to implement and not have it take 5 minutes to compress 100 KB.

There are numerous web pages discussing DEFLATE and some by reputable universities. These usually include a good discussion on Huffman coding. Yet I have found none the creates a Huffman table that actually meets the additional requirements for DEFLATE. If you are going to describe Huffman in connection with DEFLATE, shouldn't it be compatible? Would it have helped if you actually had implemented DEFLATE before describing it?

The procedure to create a compatible Huffman tree is not described anywhere that I have found, Most don't even mention that you need 3 separate Huffman tables (one for literals, one for distance codes and one for code lengths) and that there is a limit of a 15 bit code length for 2 of the tables and 7 bit for the third. Then they say only to "adjust the Huffman table" accordingly. So there is no procedure for generating a less than optimum Huffman tree meeting DEFLATE restrictions. I had to get creative myself.

Enough of my rant. The result really is that I have had to reinvent the wheel. I am not the only one to have done so. I am going to try to document it here for your edification.

Post Reply