The

`FZIP_deflate()`

routine involves 90+ percent of the processing time and handles only step #1 in the prior outline. The buffer flush routine handles all of the remaining items.

We will be creating three different Huffman tables. These amount to a structure for each alphabet symbol.

struct huff_t {
uint16_t code;
uint16_t len;
uint16_t cnt;
uint16_t link;
};

Here the

`cnt`

will represent the symbol's frequency. It is the count of occurrences of the symbol's value in the data. The

`len`

will eventually be the code length in bits assigned to the symbol. The

`code`

will hold the bit pattern to be used in coding. And the

`link`

will be used in the code length determination routine.

The

`_bufflush()`

routine is called with a buffer of LZ77 compressed data. This includes raw data bytes that could not be included in a sequence and length-distance references for matches. Since data can take on all byte values I use a form of escaping. The escape character is 0xFF and a 0xFF in the data is represented by repeating the escape (e.g. 0xff 0xff). Otherwise the next byte is the length of the sequence -3 and the following short value the distance. To make this work I disallow the sequence length of 258 since that would be confused with an escaped 0xFF. I don't see this as having any significant impact. Later I could find a way around that but for now it works.

// Routine applies DEFLATE style Huffman coding to the buffer content.
static int _bufflush(char *obuf, int osize, struct bitstream_t *stream, int final)
{
struct huff_t *littbl;
struct huff_t *dsttbl;
struct huff_t *cnttbl;
int n, len, c, code, dst, lastc, ncnt;
int *cntlist, *litcnts, *dstcnts, *bitcnts, *bitlens;
int hlit, hdist, hclen;
int *startcd;
int totdist = 0;
int err = FALSE;
// Now we need to construct two Huffman trees (although I am going to avoid
// actual trees). One for the literal data and one for the distance codes.
// Note that extra bits are just extra bits inserted in the stream.
littbl = MM_alloc(286 * sizeof(struct huff_t), &_bufflush);
dsttbl = MM_alloc(30 * sizeof(struct huff_t), &_bufflush);
// Not a loop. This allows the use of break.
for (;;)
{
// Now we analyze the data to determine frequencies. Note that this is complicated
// just a bit because of the escaping that I have had to use. I will have to
// temporarily decode length and distance encoding. We'll have to do that again
// later when we stream the coding. We will also use one end-of-block code so we
// virtually count it first.
littbl[256].cnt = 1;
for (n = 0; n < osize; n++)
{
// tally literal if not escaped
if (obuf[n] != 0xff)
littbl[obuf[n]].cnt++;
else
{
// check and tally escaped 0xff
if (obuf[++n] == 0xff)
littbl[0xff].cnt++;
else
{
totdist++;
// table defined above
//static const int lcode_ofs[29] = {
// 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 15, 17, 19, 23, 27, 31,
// 35, 43, 51, 59, 67, 83, 99, 115, 131, 163, 195, 227, 258
//};
// determine required length code for lengths (3..258). This code is
// coded in the literal table.
len = (obuf[n++] & 0xff) + 3;
for (c = 0; c < 29; c++)
if (lcode_ofs[c] > len)
break;
code = 256 + c;
littbl[code].cnt++;
// table define above
//static const int dcode_ofs[30] = {
// 1, 2, 3, 4, 5, 7, 9, 13, 17, 25, 33, 49, 65, 97, 129, 193,
// 257, 385, 513, 769, 1025, 1537, 2049, 3073, 4097, 6145,
// 8193, 12289, 16385, 24577
//};
// determine required distance code for distances (1..32768). This code is
// coded in the distance table.
dst = (obuf[n++] & 0xff) << 8;
dst |= (obuf[n] & 0xff);
for (c = 0; c < 30; c++)
if (dcode_ofs[c] > dst)
break;
code = c - 1;
dsttbl[code].cnt++;
}
}
}

So here we create two of the Huffman tables, one for the literal alphabet (0..285) and one for the distance alphabet (0..29). Note that when JANOS allocates memory it is zero filled.

Don't be confused by my use of

`for(;;) { }`

. This is not an infinite loop. In fact it is not a loop at all. Rather it allows me to exit the procedure at any point using

`break;`

and I just have to remember to place a

`break;`

at the very end. There are other ways to achieve the same thing.

The first step is to determine the frequency of symbols from both alphabets. Here we scan the supplied data and count the literals. The escaped length-distance references are translated temporarily into their length codes and distance codes. Those are tallied in the appropriate table. The extra bits are ignored. Length codes are combined with literal data since when they are read you don't know which it will be. The distance codes use their own alphabet. We will have to do this same translation again later when we encode the references for output. Then, of course, we will insert the required extra bits.

Note that this also tallies one end-of-block code (0x100) as we will be using that.

If I were to dump these two tables after this they may look something like this. These are just the symbols that occur in a particular set of data - the

`jniorsys.log`

file. This is the symbol value followed by its count.

Code: Select all

```
0x00a 2
0x00d 2
0x020 70
0x027 4
0x028 11
0x029 2
0x02a 2
0x02b 7
0x02c 7
0x02d 20
0x02e 133
0x02f 16
0x030 105
0x031 153
0x032 124
0x033 124
0x034 103
0x035 110
0x036 97
0x037 90
0x038 93
0x039 84
0x03a 74
0x03c 1
0x03d 3
0x03e 3
0x041 5
0x043 4
0x044 1
0x046 1
0x048 1
0x04a 3
0x04c 2
0x04d 4
0x04e 4
0x04f 4
0x050 8
0x052 8
0x053 4
0x054 4
0x055 1
0x057 3
0x05a 1
0x05f 3
0x061 34
0x062 4
0x063 15
0x064 26
0x065 37
0x066 9
0x067 9
0x068 8
0x069 36
0x06a 6
0x06b 3
0x06c 21
0x06d 13
0x06e 36
0x06f 41
0x070 24
0x071 1
0x072 29
0x073 20
0x074 29
0x075 11
0x076 6
0x077 7
0x078 3
0x079 6
0x07a 4
0x100 1
0x101 962
0x102 457
0x103 136
0x104 127
0x105 55
0x106 71
0x107 47
0x108 14
0x109 67
0x10a 208
0x10b 103
0x10c 40
0x10d 52
0x10e 44
0x10f 15
0x110 42
0x111 180
0x112 287
0x113 30
0x114 5
0x115 12
0x116 6
0x002 1
0x003 2
0x008 2
0x009 2
0x00a 13
0x00b 150
0x00c 166
0x00d 40
0x00e 127
0x00f 63
0x010 157
0x011 112
0x012 155
0x013 116
0x014 198
0x015 169
0x016 240
0x017 197
0x018 288
0x019 215
0x01a 321
0x01b 226
```