This Community site is new! Please help us build a community around the JNIOR.
Sign up, and help share your knowledge. Please sign-up even if you do not plan to post as a sign of support.
If there is evidence of a demand we will continue to develop the content here.

DEFLATE Compression Algorithm

INTEG's own Real-time Operating System. No 3rd party code here. 100% Supportable.
bscloutier
Posts: 390
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Fri Jan 26, 2018 12:39 pm

Overview

First let me greatly over-simplify the process and provide an outline for the compression procedure.
  1. Perform efficient LZ77 scanning the uncompressed data byte-by-byte filling a 64 KB interim buffer with raw unmatched literal bytes and escaped length-distance references.
  2. Scan the interim LZ77 data creating two DEFLATE compatible Huffman tables, one for literals and length codes combined and one for distance codes.
  3. Assign code lengths (15 bits max) to the used alphabet for both tables.
  4. Determine the size of the alphabet required for the length and distance Huffman tables.
  5. Combine code lengths into a single run-length compressed array.
  6. Determine the DELFATE compatible Huffman table needed to code this compressed code length array.
  7. Assign code lengths (7 bits max) to the used alphabet for code lengths.
  8. Sort the resulting code lengths for this 3rd Huffman table into the unique order specified for DEFLATE and determine the length of the array (trim trailing zeroes),
  9. Output the block header marking only the last as BFINAL and output the alphabet sizes.
  10. Output the reordered code lengths.
  11. Output the Huffman codes compressing the run-length encoded combined code length array. Insert extra bits where required.
  12. Output the Huffman codes compressing the LZ77 data. Use the literal table for literals and sequence lengths. Use the distance table for distance codes. Insert extra bits where required.
  13. Output end-of-block code.
  14. If not the final block keep the bit stream going and continue LZ77 at step #1.
Uh. That about summarizes it. All we can do is to push through this step by step. It amounts the two phases. The first compresses the data using LZ77 and determines the Huffman coding requirements. The second outputs a bit stream encoding the Huffman tables and then the actual data. We first determine everything that we need for the bit stream and then generate the bit stream itself.

bscloutier
Posts: 390
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Fri Jan 26, 2018 1:49 pm

Time-Efficient LZ77 Compression

First off let me note that we end up trading off compression ratio for processing speed. As we had experimented earlier in Java we could discover every possible matching sequence for a block of data and then analyze those matches selecting the optimum set. This arguably would create the best compression possible. This is certainly doable if we are running on a multiple core GHz processor and it is coded carefully. Still it would be lengthy and possibly not appropriate even then for some applications. The gain in compression ratio is expected to be only slight and not worth the processing cycles. It is certainly not critical for JANOS. So we will not go down this path.

Another approach is what is called lazy matching. Here we are concerned that a matched sequence might prevent a longer matching sequence starting in the next byte or two from being used. In the analysis it appears that 1 or 2 bytes may be unnecessarily forced into the output stream in these situations. Those may be rare but for certain types of data it could be more of a concern. Again the gain in compression ratio if we were to take the time to perform lazy matching is assumed not to be necessary for JNIOR.

As a result we are just going to go ahead and perform straight up sequence match detection for each byte position in the data. Even with this the amount of processing involved prevents us from using any kind of brute force scanning. Imagine how much processing would be involved if for each byte in the file we have to compare it directly against 32 KB of bytes in the preceding sliding window. For a large file this is starts to become a very large number of processor cycles. It gets even worse when bytes match and you need to check following bytes to determine the usability of the sequence.

I had implemented the brute force search with no trickery at first. Small files produced proper LZ77 output in a short time but a 20 KB JSON file took almost 30 seconds (JNIOR remember). A large binary file basically stalled the system. The JSON performed more poorly due to the high occurrence of certain characters such as curly braces, quotes and colons. That code has long been discarded or I would include it here for better understanding.

In the prior Java experimentation I used a linked list to track the occurrences of matching byte values. This eliminated the need to scan the entire 32 KB sliding window saving time. For the JANOS implementation I decided to use a pointer queue for each byte value. Each queue holding a maximum of 256 pointers. So basically we would test only the last 256 matching byte positions for usable sequences. This might miss some good sequence matches deep in the sliding window for very frequently occurring byte values but not for those appearing less often. Again it's a trade off. It is an interesting approach.

I had figured that I could adjust the depth of these pointer queues. Since there are 256 possible literal values and each with 256 pointers which require 4 bytes this results in a matrix of 65,536 pointers or 256 KB of memory. There is room for that in the JANOS heap. Increasing the depth increases the memory requirement as well as the time spent in sequence detection. I was pleased with the results at a depth of 256. Perhaps later I will conduct some experiments plotting the effects of this parameter on compression ration and execution time.

I will present the resulting routine in the next post.

bscloutier
Posts: 390
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Fri Jan 26, 2018 2:39 pm

Here is the resulting routine. This handles only the LZ77 leaving all of the Huffman to the routine responsible for flushing the interim buffer. Note that I have made both the SLIDING_WINDOW and DEPTH parameters adjustable through the Registry. I will use this later to measure performance.

/* -- function ---------------------------------------------------------------
** FZIP_deflate()
**
** Compresses the supplied buffer.
**
** -------------------------------------------------------------------------*/
int FZIP_deflate(char *inb, uint32_t insize, char **outbuf, uint32_t *outsize)
{
	char *obuf;
	int optr;
	int err = FALSE;
	int curptr, len, seqlen;
	char *seqptr, *p1, *p2, *s1, *s2;
	struct bitstream_t stream;
	int *matrix, *mat, *track;
	int ch, trk;
	int window, depth;

	// obtain sliding window size
	window = REG_getRegistryInteger("Zip/Window", 16384);
	if (window < 2048)
		window = 2048;
	else if (window > 32768)
		window = 32768;

	// obtain tracking queue depth
	depth = REG_getRegistryInteger("Zip/Depth", 256);
	if (depth < 16)
		depth = 16;
	else if (depth > 1024)
		depth = 1024;

	// check call
	if (outbuf == NULL || outsize == NULL)
		return (FALSE);

	// initialize bit stream
	memset(&stream, 0, sizeof(struct bitstream_t));
	stream.buffer = MM_alloc(insize + 1024, &_bufflush);

	// create an output buffer
	obuf = MM_alloc(64 * 1024, &FZIP_deflate);
	optr = 0;

	// initialize matrix
	matrix = MM_alloc(256 * depth * sizeof(int), &FZIP_deflate);
	track = MM_alloc(256 * sizeof(int), &FZIP_deflate);

	// process uncompressed stream byte-by-byte
	curptr = 0;
	while (curptr < insize)
	{
		// get current byte value
		ch = inb[curptr];

		// Locate best match. This is the longest match located the closest to the curPtr. This
		//  is intended to be fast at the slight cost in compression ratio. We do not handle lazy
		//  matches or block optimization (selective matching). Only seqlen of 3 or more matter so
		//  we initialize seqlen to 2 to limit unnecessary best match updates. Try to limit cycles in
		//  this loop.
		mat = &matrix[depth * ch];
		trk = track[ch] - 1;
		if (trk < 0)
			trk = depth - 1;
		seqlen = 2;
		p2 = &inb[curptr];
		while (trk != track[ch])
		{
			if (mat[trk] == 0 || curptr - mat[trk] >= window)
				break;

			s1 = p1 = &inb[mat[trk] - 1];
			s2 = p2;
			while (*s1 == *s2)
				s1++, s2++;

			// check for improved match
			len = s1 - p1;
			if (len > seqlen)
			{
				seqptr = p1;
				seqlen = len;
			}

			trk--;
			if (trk < 0)
				trk = depth - 1;
		}

		// track the character
		mat[track[ch]] = curptr + 1;
		track[ch]++;
		if (track[ch] >= depth)
			track[ch] = 0;

		// check validity (match past end of buffer)
		if (curptr + seqlen > insize)
			seqlen = insize - curptr;

		// If we have a good sequence we output a pointer and advance curPtr
		if (seqlen >= 3)
		{
			// check maximum allowable sequence which is 258 bytes but we reserve one
			//  for 0xff escaping
			if (seqlen > 257)
				seqlen = 257;

			// escape length-distance pointer
			obuf[optr++] = 0xff;
			obuf[optr++] = seqlen - 3;
			*(short *)&obuf[optr] = p2 - seqptr;
			optr += 2;

			// advance curPtr
			curptr += seqlen;
			PROC_yield();
		}

		// otherwise we output the raw uncompressed byte and keep searching
		else
		{
			// escape 0xff
			if (*p2 == 0xff)
				obuf[optr++] = 0xff;
			obuf[optr++] = *p2;
			curptr++;
		}

		// flush output buffer as needed. Because we escape 0xff our length-pointer encoding
		//  requires 4 bytes. This at times replaces a 3-byte match and so the compression
		//  ration in this buffer is compromised. That will be corrected as we move into
		//  Huffman coding. Blocks will then be slightly less than 64KB. We also want to
		//  flush this before over-running it.
		if (optr > 65530)
		{
			if (!_bufflush(obuf, optr, &stream, FALSE))
				err = TRUE;

			// if compression seems fruitless
			if (stream.length > curptr)
				err = TRUE;

			optr = 0;
		}

		if (err)
			break;
	}

	// Flush remaining data
	if (!err && !_bufflush(obuf, optr, &stream, TRUE))
		err = TRUE;

	// if compression was fruitless
	if (stream.length > insize)
		err = TRUE;

	// clean up
	MM_free(track);
	MM_free(matrix);

	// return
	*outbuf = stream.buffer;
	*outsize = stream.length;
	return (!err);
}

So the preliminaries are over by line 50 in this code. You might notice that when JANOS allocates memory it retains a reference pointer. That is used to locate memory leaks among other things. I can see quickly where any block is allocated.

The main loop which processes byte-by-byte through the uncompressed data is done by line 150. After that we merely flush any partially compressed block and be done.

From lines 55 to 95 we search for the best match. From 95 to about 130 we either output the sequence length-distance reference or the raw unmatched data byte. After that we check to see if we need to flush the interim buffer.

The magic occurs in the search where we employ the pointer matrix. The trk array keeps a list of current positions in the each byte value queue. This is where we add the pointer to the current byte once we've searched it. The search runs backward through the queue either until we've check all of the pointers (wraps back to the starting position) or we run out of the sliding window (pointer reference too far back in the data). This checks for closer matches first. When a longer sequence of characters matches the current position we update our best match.

After that if we have a sequence of 3 or more bytes we output the length-distance reference otherwise we output the current byte and move on to process the next.

I know that I can optimize this some more. I haven't gone through that step of trying to minimize processor cycles. We do use the RX compiler optimization. But... this is my approach for LZ77. There is no hash table or confusing prefix-suffix tricks. It seems to run fast enough at least for our needs at this point. The LZ77 search is where all of the time is consumed.

Next we will look at what happens when the interim buffer is flushed.

bscloutier
Posts: 390
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Fri Jan 26, 2018 4:46 pm

The FZIP_deflate() routine involves 90+ percent of the processing time and handles only step #1 in the prior outline. The buffer flush routine handles all of the remaining items.

We will be creating three different Huffman tables. These amount to a structure for each alphabet symbol.
struct huff_t {
	uint16_t code;
	uint16_t len;
	uint16_t cnt;
	uint16_t link;
};
Here the cnt will represent the symbol's frequency. It is the count of occurrences of the symbol's value in the data. The len will eventually be the code length in bits assigned to the symbol. The code will hold the bit pattern to be used in coding. And the link will be used in the code length determination routine.

The _bufflush() routine is called with a buffer of LZ77 compressed data. This includes raw data bytes that could not be included in a sequence and length-distance references for matches. Since data can take on all byte values I use a form of escaping. The escape character is 0xFF and a 0xFF in the data is represented by repeating the escape (e.g. 0xff 0xff). Otherwise the next byte is the length of the sequence -3 and the following short value the distance. To make this work I disallow the sequence length of 258 since that would be confused with an escaped 0xFF. I don't see this as having any significant impact. Later I could find a way around that but for now it works.

// Routine applies DEFLATE style Huffman coding to the buffer content.
static int _bufflush(char *obuf, int osize, struct bitstream_t *stream, int final)
{
	struct huff_t *littbl;
	struct huff_t *dsttbl;
	struct huff_t *cnttbl;
	int n, len, c, code, dst, lastc, ncnt;
	int *cntlist, *litcnts, *dstcnts, *bitcnts, *bitlens;
	int hlit, hdist, hclen;
	int *startcd;
	int totdist = 0;
	int err = FALSE;

	// Now we need to construct two Huffman trees (although I am going to avoid
	//  actual trees). One for the literal data and one for the distance codes.
	//  Note that extra bits are just extra bits inserted in the stream.
	littbl = MM_alloc(286 * sizeof(struct huff_t), &_bufflush);
	dsttbl = MM_alloc(30 * sizeof(struct huff_t), &_bufflush);

	// Not a loop. This allows the use of break.
	for (;;)
	{

		// Now we analyze the data to determine frequencies. Note that this is complicated
		//  just a bit because of the escaping that I have had to use. I will have to
		//  temporarily decode length and distance encoding. We'll have to do that again
		//  later when we stream the coding. We will also use one end-of-block code so we
		//  virtually count it first.
		littbl[256].cnt = 1;
		for (n = 0; n < osize; n++)
		{
			// tally literal if not escaped
			if (obuf[n] != 0xff)
				littbl[obuf[n]].cnt++;
			else
			{
				// check and tally escaped 0xff
				if (obuf[++n] == 0xff)
					littbl[0xff].cnt++;
				else
				{
					totdist++;

					// table defined above
					//static const int lcode_ofs[29] = {
					//	3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 15, 17, 19, 23, 27, 31,
					//	35, 43, 51, 59, 67, 83, 99, 115, 131, 163, 195, 227, 258
					//};

					// determine required length code for lengths (3..258). This code is
					//  coded in the literal table.
					len = (obuf[n++] & 0xff) + 3;
					for (c = 0; c < 29; c++)
						if (lcode_ofs[c] > len)
							break;
					code = 256 + c;
					littbl[code].cnt++;

					// table define above
					//static const int dcode_ofs[30] = {
					//	1, 2, 3, 4, 5, 7, 9, 13, 17, 25, 33, 49, 65, 97, 129, 193,
					//	257, 385, 513, 769, 1025, 1537, 2049, 3073, 4097, 6145,
					//	8193, 12289, 16385, 24577
					//};

					// determine required distance code for distances (1..32768). This code is
					//  coded in the distance table.
					dst = (obuf[n++] & 0xff) << 8;
					dst |= (obuf[n] & 0xff);
					for (c = 0; c < 30; c++)
						if (dcode_ofs[c] > dst)
							break;
					code = c - 1;
					dsttbl[code].cnt++;
				}
			}
		}

So here we create two of the Huffman tables, one for the literal alphabet (0..285) and one for the distance alphabet (0..29). Note that when JANOS allocates memory it is zero filled.

Don't be confused by my use of for(;;) { }. This is not an infinite loop. In fact it is not a loop at all. Rather it allows me to exit the procedure at any point using break; and I just have to remember to place a break; at the very end. There are other ways to achieve the same thing.

The first step is to determine the frequency of symbols from both alphabets. Here we scan the supplied data and count the literals. The escaped length-distance references are translated temporarily into their length codes and distance codes. Those are tallied in the appropriate table. The extra bits are ignored. Length codes are combined with literal data since when they are read you don't know which it will be. The distance codes use their own alphabet. We will have to do this same translation again later when we encode the references for output. Then, of course, we will insert the required extra bits.

Note that this also tallies one end-of-block code (0x100) as we will be using that.

If I were to dump these two tables after this they may look something like this. These are just the symbols that occur in a particular set of data - the jniorsys.log file. This is the symbol value followed by its count.

Code: Select all

 0x00a 2
 0x00d 2
 0x020 70
 0x027 4
 0x028 11
 0x029 2
 0x02a 2
 0x02b 7
 0x02c 7
 0x02d 20
 0x02e 133
 0x02f 16
 0x030 105
 0x031 153
 0x032 124
 0x033 124
 0x034 103
 0x035 110
 0x036 97
 0x037 90
 0x038 93
 0x039 84
 0x03a 74
 0x03c 1
 0x03d 3
 0x03e 3
 0x041 5
 0x043 4
 0x044 1
 0x046 1
 0x048 1
 0x04a 3
 0x04c 2
 0x04d 4
 0x04e 4
 0x04f 4
 0x050 8
 0x052 8
 0x053 4
 0x054 4
 0x055 1
 0x057 3
 0x05a 1
 0x05f 3
 0x061 34
 0x062 4
 0x063 15
 0x064 26
 0x065 37
 0x066 9
 0x067 9
 0x068 8
 0x069 36
 0x06a 6
 0x06b 3
 0x06c 21
 0x06d 13
 0x06e 36
 0x06f 41
 0x070 24
 0x071 1
 0x072 29
 0x073 20
 0x074 29
 0x075 11
 0x076 6
 0x077 7
 0x078 3
 0x079 6
 0x07a 4
 0x100 1
 0x101 962
 0x102 457
 0x103 136
 0x104 127
 0x105 55
 0x106 71
 0x107 47
 0x108 14
 0x109 67
 0x10a 208
 0x10b 103
 0x10c 40
 0x10d 52
 0x10e 44
 0x10f 15
 0x110 42
 0x111 180
 0x112 287
 0x113 30
 0x114 5
 0x115 12
 0x116 6

 0x002 1
 0x003 2
 0x008 2
 0x009 2
 0x00a 13
 0x00b 150
 0x00c 166
 0x00d 40
 0x00e 127
 0x00f 63
 0x010 157
 0x011 112
 0x012 155
 0x013 116
 0x014 198
 0x015 169
 0x016 240
 0x017 197
 0x018 288
 0x019 215
 0x01a 321
 0x01b 226

bscloutier
Posts: 390
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Fri Jan 26, 2018 5:08 pm

We need to assign code lengths to these alphabet symbols. Here there are two tables and later we will process a third. So the procedure is handled by a separate routine.

		// The next step is to determine optimum bit lengths for the various codes based upon their
		//  frequency of occurrence. The procedure is to sort the symbols by decreasing frequency
		//  and then to combine pairs incrementing the bit length for every symbol belonging to
		//  the pair. This works and it is fast. I don't need to build any tree. This terminates
		//  if bit length (code length) exceeds 15.
		if (!_bitlength(littbl, 286, 15) || !_bitlength(dsttbl, 30, 15))
		{
			err = TRUE;
			break;
		}

The _bitlength() routine assigns code lengths creating possibly a slightly less than optimal Huffman table that does not exceed a 15-bit maximum. The assignments must make the Huffman tables compatible with the DEFLATE requirements.

bscloutier
Posts: 390
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Mon Jan 29, 2018 4:01 pm

Huffman Tables for DEFLATE

The buffer full of data that we have and which has been compressed using LZ77 will be compressed further using Huffman coding. The DEFLATE format specifies two separate Huffman code sets. One to encode both the literal bytes in the data (0..255), the end-of-block code (256), and the sequence match length codes (257..285). The second Huffman code set will encode the distance codes (0..29). We can use a separate table for that because we know when we are reading a distance code as one always follows a length code. We never know whether we are reading a literal or a length code so those need to be decoded the same way and therefore from the same table.

Previously we scanned the data and counted the occurrences of each symbol. We now know which symbols occur in the data and how frequently. We have defined our alphabets for each Huffman table. Now we need to create the Huffman trees themselves. This is where things get tricky.

Creating a Huffman tree is not very difficult. But creating a Huffman tree that is compatible with the DEFLATE format is quite another thing altogether. The DEFLATE specification dictates that the Huffman trees must meet two additional rules. In fact they need to adhere to three rules. The third is mentioned later in the specification.
  1. All codes of a given bit length have lexicographically consecutive values, in the same order as the symbols they represent.
  2. Shorter codes lexicographically precede longer codes.
  3. The bit length cannot exceed 15 bits for the literal and distance code sets. It cannot exceed 7 bits for the code length set (comes into play much later).
The first trick is to not generate a tree at all. If you create a tree using the standard Huffman approach you are almost guaranteed to not have a tree that is usable for DEFLATE. All you need from that effort are the bit lengths that end up being assigned to each symbol. You can get those from the same procedure without dealing with right and left links and an actual tree structure. You then use the procedure defined in the DEFLATE specification to create the compatible tree.

The standard Huffman approach is to take a list of all the symbols that occur in the data and sort it in descending frequency. Combine the rightmost two least frequent symbols into a node whose frequency is the total of the two symbols. Next resort the list so this new node repositions itself according to its combined frequency. Now repeat the process combining the next two rightmost entries which may be symbols (leaves) or previously create nodes. This continues until you have just one node which is the head of your tree.

We are not going to bother to build the tree structure. We are only going to keep a list of the symbols that fall beneath a node. We are also going to realize that the combination of the two rightmost entries in the sorted list merely increases by one the bit length of each of the new node's member symbols. When we finally reach the point where there is only one node in the list we would have assigned a bit length to every symbol based on its frequency. That is all we need to then create the codes for Huffman coding in DEFLATE format using the procedure for that outlined in the specification.

Here is where things really get confusing. This process doesn't always create a DEFLATE compatible Huffman code set. Sometimes the bit length will exceed 15 (or 7 for the table later). We need a procedure for dealing with that. It amounts to being able to create a less than optimal Huffman tree with bit lengths limited to a maximum. This was a puzzle but I have a way to get it done.

So next I'll take us through examples.

bscloutier
Posts: 390
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Tue Jan 30, 2018 9:14 am

Let's use an example. Here we have an alphabet of 19 symbols (0..18). The data set consists of 150 of these and after counting the occurrences of each we have the following. To make discussion simpler I will assign these symbols uppercase names. On the right are the results of the tally for each.
 
A  0x000 4
B  0x001 0
C  0x002 0
D  0x003 2
E  0x004 6
F  0x005 3
G  0x006 2
H  0x007 2
J  0x008 53
K  0x009 26
L  0x00a 5
M  0x00b 4
N  0x00c 3
P  0x00d 1
Q  0x00e 1
R  0x00f 1
S  0x010 37
T  0x011 0
U  0x012 0
In the standard approach to creating a Huffman table we ignore the symbols that do not appear in the data and arrange the other in order of decreasing frequency.
Used symbols:
  A    D    E    F    G    H    J    K    L    M    N    P    Q    R    S 
  4    2    6    3    2    2   53   26    5    4    3    1    1    1   37

Sorted by decreasing frequency:
  J    S    K    E    L    A    M    F    N    D    G    H    P    Q    R
 53   37   26    6    5    4    4    3    3    2    2    2    1    1    1
Next we combine the lowest two frequency symbols into a new node with a combined total. I will name the nodes with lowercase characters just so you can track them. The shorter list is then resorted before proceeding to repeat. The process continues until this is only one node. Here I will show only the combining action for each step. We will get into more detail afterwards.
Starting set:
  J    S    K    E    L    A    M    F    N    D    G    H    P    Q    R
 53   37   26    6    5    4    4    3    3    2    2    2    1    1    1

Step #1 combines Q and R into (a) with new frequency of 2. The list is resorted.
  J    S    K    E    L    A    M    F    N    D    G    H   (a)   P
 53   37   26    6    5    4    4    3    3    2    2    2    2    1

Step #2 combines (a) and P into (b) with new frequency of 3.
  J    S    K    E    L    A    M    F    N   (b)   D    G    H
 53   37   26    6    5    4    4    3    3    3    2    2    2

Step #3 combines G and H into (c) with new frequency of 4.
  J    S    K    E    L    A    M   (c)   F    N   (b)   D
 53   37   26    6    5    4    4    4    3    3    3    2

Step #4 combines (b) and D into (d) with new frequency of 5.
  J    S    K    E    L   (d)   A    M   (c)   F    N
 53   37   26    6    5    5    4    4    4    3    3

Step #5 combines F and N into (e) with new frequency of 6.
  J    S    K    E   (e)   L   (d)   A    M   (c)
 53   37   26    6    6    5    5    4    4    4

Step #6 combines M and (c) into (f) with new frequency of 8.
  J    S    K   (f)   E   (e)   L   (d)   A
 53   37   26    8    6    6    5    5    4

Step #7 combines (d) and A into (g) with new frequency of 9.
  J    S    K   (g)  (f)   E   (e)   L
 53   37   26    9    8    6    6    5

Step #8 combines (e) and L into (h) with new frequency of 11.
  J    S    K   (h)  (g)  (f)   E
 53   37   26   11    9    8    6

Step #9 combines (f) and E into (i) with new frequency of 14.
  J    S    K   (i)  (h)  (g)
 53   37   26   14   11    9

Step #10 combines (h) and (g) into (j) with new frequency of 20.
  J    S    K   (j)  (i)
 53   37   26   20   14

Step #11 combines (j) and (i) into (k) with new frequency of 34.
  J    S   (k)   K
 53   37   34   26

Step #12 combines (k) and K into (m) with new frequency of 60.
 (m)   J    S
 60   53   37

Step #13 combines J and S into (n) with new frequency of 90.
 (n)  (m)
 90   60

Step #14 combines (n) and (m) into our final node (p) with new frequency of 150.
 (p)
150

We are done.

Did you notice how nodes that we created were often quickly reused in another combination? This is what leads to a tree structure exceeding the maximum code length. Imagine a tree with one long branch down its right edge. It is very common when there are a few symbols that appear with high frequency and the balance are relatively low frequency symbols.

This combining nodes exercise was all well and good but something else needs to occur to make it useful. In the typical Huffman case in creating a node you would assign one combining nodes to the left link (bit = 0) and the other to the right link (bit = 1). This would then develop the tree structure that would work for you although most likely not to be DEFLATE compatible.

I am going to avoid the tree but note that in the act of combination all of the symbols participating will have their code length increased by 1. Combining leaves into a node creates another level in the tree. That's what we are doing. In the code I will use a linked list to simply collect the symbols that are a member of (or lie below) any given node. In combining I will concatenate the member lists for the two leaves/nodes being combined and then increment the code length for each member.

For each step I going to show the node membership and the code length for our alphabet as we proceed through the process. Hopefully this table will make sense to you.
             Step   1    2    3    4    5    6    7    8    9    10   11   12   13   14  clen
A  0x000 4     0    0    0    0    0    0    0    1g   1g   1g   2j   3k   4m   4m   5p   5
D  0x003 2     0    0    0    0    1d   1d   1d   2g   2g   2g   3j   4k   5m   5m   6p   6
E  0x004 6     0    0    0    0    0    0    0    0    0    1i   1i   2k   3m   3m   4p   4
F  0x005 3     0    0    0    0    0    1e   1e   1e   2h   2h   3j   4k   5m   5m   6p   6
G  0x006 2     0    0    0    1c   1c   1c   2f   2f   2f   3i   3i   4k   5m   5m   6p   6
H  0x007 2     0    0    0    1c   1c   1c   2f   2f   2f   3i   3i   4k   5m   5m   6p   6
J  0x008 53    0    0    0    0    0    0    0    0    0    0    0    0    0    1n   2p   2
K  0x009 26    0    0    0    0    0    0    0    0    0    0    0    0    1m   1m   2p   2
L  0x00a 5     0    0    0    0    0    0    0    0    1h   1h   2j   3k   4m   4m   5p   5
M  0x00b 4     0    0    0    0    0    0    1f   1f   1f   2i   2i   3k   4m   4m   5p   5
N  0x00c 3     0    0    0    0    0    1e   1e   1e   2h   2h   3j   4k   5m   5m   6p   6
P  0x00d 1     0    0    1b   1b   2d   2d   2d   3g   3g   3g   4j   5k   6m   6m   7p   7
Q  0x00e 1     0    1a   2b   2b   3d   3d   3d   4g   4g   4g   5j   6k   7m   7m   8p   8
R  0x00f 1     0    1a   2b   2b   3d   3d   3d   4g   4g   4g   5j   6k   7m   7m   8p   8
S  0x010 37    0    0    0    0    0    0    0    0    0    0    0    0    0    1n   2p   2
In this table we follow the code length (clen) associated with each occurring symbol through the step by step combination process. Here as a symbol is combined into a new node (letter changes) we increment its bit depth or code length. The final column shows the resulting clen for this table.

So we mechanically have shown how to derive the code lengths for an alphabet symbol set with given frequencies. Next we create the DEFLATE Huffman code table for this.

bscloutier
Posts: 390
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Tue Jan 30, 2018 9:37 am

My code to perform the node creation and code length incrementing looks something like this.

// Establish bit length for the Huffman table based upon frequencies
static int _bitlength(struct huff_t *tbl, int ncodes, int maxb)
{
	uint16_t *list = MM_alloc(ncodes * sizeof(uint16_t), &_bitlength);
	int *freq = MM_alloc(ncodes * sizeof(int), &_bitlength);
	int nlist = 0;
	int n, c, p;
	int ret = TRUE;

	// List all of the symbols used in the data along with their frequencies. Note that
	//  we store pointers +1 so as to keep 0 as a linked list terminator.
	for (n = 0; n < ncodes; n++)
		if (tbl[n].cnt > 0)
		{
			list[nlist] = n + 1;
			freq[nlist] = tbl[n].cnt;
			nlist++;
		}

	// Note that there is a special boundary case when only 1 code is used. In this case
	//  the single code is encoded using 1 bit and not 0.
	if (nlist == 1)
		tbl[list[0] - 1].len = 1;

	// process this list down to a single node
	while (nlist > 1)
	{
		// sort the list by decreasing frequency
		for (n = 0; n < nlist - 1; n++)
			if (freq[n] < freq[n + 1])
			{
				// swap order
				c = list[n];
				list[n] = list[n + 1];
				list[n + 1] = c;
				c = freq[n];
				freq[n] = freq[n + 1];
				freq[n + 1] = c;

				// need to sort back
				if (n > 0)
					n -= 2;
			}

		// Combine the member lists associated with the last two entries. We combine the
		//  linked lists for the two low frequency nodes.
		p = list[nlist - 2];
		while (tbl[p - 1].link)
			p = tbl[p - 1].link;
		tbl[p - 1].link = list[nlist - 1];

		// The new node has the combined frequency.
		freq[nlist - 2] += freq[nlist - 1];
		nlist--;

		// Increase the code length for members of this node.
		p = list[nlist - 1];
		while (p)
		{
			tbl[p - 1].len++;
			p = tbl[p - 1].link;
		}

	}

	MM_free(freq);
	MM_free(list);
	return (ret);
}

You might note the check at line 20 handling a special case when there is only one used item in our alphabet. Here we need to use a code length of 1.

Now to be fair there is a lot more that I will be adding to this routine before we are done.

bscloutier
Posts: 390
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Tue Jan 30, 2018 9:52 am

Now that we know the lengths of the code that we will be using to compress the data we can predict the compression ratio.
        freq  clen     freq*clen
A  0x000 4     5           20
D  0x003 2     6           12
E  0x004 6     4           24
F  0x005 3     6           18
G  0x006 2     6           12
H  0x007 2     6           12
J  0x008 53    2          106
K  0x009 26    2           52
L  0x00a 5     5           25
M  0x00b 4     5           20
N  0x00c 3     6           18
P  0x00d 1     7            7
Q  0x00e 1     8            8
R  0x00f 1     8            8
S  0x010 37    2           74
                      -----------
                          416 bits (52 bytes)
If we multiply the frequency of a symbol times the code length and total that for the set we get total number of bits required to encode the original message. Originally we had 150 bytes or 1200 bits. When we are done we can store that same message in only 52 bytes. We've reduced the data to almost one third it's original size.

Let's see how to derive the bit codes that we will use in encoding the data.

bscloutier
Posts: 390
Joined: Thu Sep 14, 2017 12:55 pm

Re: DEFLATE Compression Algorithm

Post by bscloutier » Tue Jan 30, 2018 10:59 am

We want to derive the actual binary code patterns for encoding this symbol set. The DEFLATE specification tells us to first count the number of codes for each code length.
N  bl_count[N]
0      0
1      0
2      3
3      0
4      1
5      3
6      5
7      1
8      2
Next we find the numerical value of the smallest code for each code length. The following code is provided by the specification.
        code = 0;
        bl_count[0] = 0;
        for (bits = 1; bits <= MAX_BITS; bits++) {
            code = (code + bl_count[bits-1]) << 1;
            next_code[bits] = code;
        }
In performing this procedure we get the following. Here I will also show the codes in binary form.
N  bl_count[N]  next_code[N]
0      0
1      0            0
2      3            0     00
3      0            6     110
4      1           12     1100
5      3           26     11010
6      5           58     111010
7      1          126     1111110
8      2          254     11111110 
Now we assign codes to each symbol based upon its length. The DEFLATE specification provides this code snippet. Basically the above defines the starting code which we increment after each use.
        for (n = 0; n <= max_code; n++) {
            len = tree[n].Len;
            if (len != 0) {
                tree[n].Code = next_code[len];
                next_code[len]++;
            }
        }
And this ends up giving us the following codes for encoding this data.
        freq  clen     code
A  0x000 4     5      11010
D  0x003 2     6      111010
E  0x004 6     4      1100
F  0x005 3     6      111011
G  0x006 2     6      111100
H  0x007 2     6      111101
J  0x008 53    2      00
K  0x009 26    2      01
L  0x00a 5     5      11011
M  0x00b 4     5      11100
N  0x00c 3     6      111110
P  0x00d 1     7      1111110
Q  0x00e 1     8      11111110
R  0x00f 1     8      11111111
S  0x010 37    2      10
Take a few moments to picture what is going on. Basically with just 2 bits you can encode no more than 4 symbols. Since we have more than 4 symbols to encode we cannot use all 4 combinations of two bits. We reserve 1 or more bit combinations as a prefix indicting that an additional bit or more will be needed to identify other symbols. The decompressor will be processing the bit stream 1 bit at a time as it doesn't know in advance how many bits will be needed to identify the next symbol.

In this symbol set it turns out that we use all but the last combination of two bits and 11b becomes the prefix accessing the rest of the symbol set. Note how this coincides with the 3 high frequency codes. It then turns out to be most efficient not to use any 3 bit codes and to jump right to 4 bits for the next possible symbol encoding. In fact if for any bit length if you reserve only the last combination as a prefix then the next bit length has only two combinations (a node). For 3 bit codes here that would be 110b and 111b. If we save all for prefix then we can encode more symbols. Here for the 4 bit codes there are 4 combinations: 1100b, 1101b, 1110b, and 1111b. Again this tree decided to use only one of the 4 bit combinations for a symbol.

Another thing to notice is that the final code for the largest bit length corresponds to the rightmost leaf in the tree. For DEFLATE that lexicographically is the longest code and requires a series of 1 bits to reach. So for this 8 bit code the last symbol is identified by 11111111b. This last code should always be 2**N - 1 where N is the largest bit length (code length). Note that '**' indicates exponentiation here. Two is raised to the power of N.

If you think about it, the set of code lengths have to be just right to end up properly assigning codes to end up this way or to not overflow those available for any one bit length. This is assured by the procedure. If you try some random code lengths you will quickly see what I mean. In general you will be trying to create an impossible tree.

But wait!!

This table looks suspiciously like the code length encoding (third Huffman table that we have not discussed as yet). If it is, didn't you mention that it would be limited to a bit depth or code length of 7? This one is 8 bits. Is that okay?

No. Is is not okay. You are right. I purposely chose this real-world example which actually does violate that code length rule. So this is not a DEFLATE compatible table. At least not for that third Huffman coding. This table occurs in trying to compress the /etc/JanosClasses.jar file currently on my development JNIOR. And actually before this the literal table exceeds the 15 bit limit. The resulting compression fails.

Most of what you read now tells you to "adjust the Huffman tree" accordingly and prods on without a hint as to how you might do that. You can certainly detach and reattach nodes to get it done but how do you know that you haven't significantly affected the compression ratio? You could punt and use the fixed Huffman tables afforded by another DEFLATE block type BTYPE 01. You know that you can't just fiddle with the code lengths because you will end up trying to create an impossible tree. So what now?

Well, I can show you how to get it done.

Post Reply