This sample shows you how to use the Immutable class. The immutable class represents persistent storage that can be accessed at the low level as an array of primitive types. This information will withstand reboots and application terminations. You can view Immutable data by typing the ‘nv’ command in the command line of a JNIOR. If using only one data type, you can express to create an immutable array of that data type, but if you want your array to have mixed data types inside it, you have to use a byte array. 

Using Immutable Blocks

This example shows the use of an array of longs. The long values in this application are date values that represent when the application was started. The long array is part of an Immutable block, therefore the application start times don’t go away until overwritten or manual removed. This application holds up to the last five application start times.

View on GitHub

I put the built jar file of this example application into the JNIOR’s flash folder and ran it from the Web UI’s console tab. After it has successfully run, I run the application multiple times, and it shows each time it runs a new application start time added to its immutable array.

Sometimes it is vital to ensure that only one copy of an application is running.  To do that JANOS has provided us the ability to register a process with the operating system.  Using the registerProcess(uid) call we can get the number of processes running with the given unique identifier.  This can become an issue if you have multiple run keys set to start the same application.

Here is the demo application.  The one instance application is started with the & parameter.  This tells the console session to run the application as a separate process.  The First time the application run it returns a process count of 1.  The process is allowed to proceed.

Since the application is running as a separate process from the console session we can start another instance now.  The returned count is now 2.  The code checks for a count greater than 1 and decides to exit.

Using this feature will help you be sure that there is only ever one instance of your application running at a time.  Here is the source for the sample application.

package oneinstance;

import com.integpg.system.JANOS;



public class Oneinstance {

    public static void main(String[] args) {
        // we assign a unique identifier to this application.  This can be ANY string.
        String uid = "abcdef";

        // we register the process with JANOS.  this call will return the number of processes 
        //  running with this uid.  The result should be 1 indicating that our process is now running.
        int processCount = JANOS.registerProcess(uid);
        System.out.println("# of processes running using this UID: " + processCount);
        if (processCount > 1) {
            System.out.println("There is another copy of this application already running. exit now.");
            return;
        }

        // now sleep for a while to give us time to start other instances of this application as a test
        try {
            System.out.println("We are allowed to continue as the only instance of this applicaiton");
            Thread.sleep(60000);
        } catch (InterruptedException ex) {
            ex.printStackTrace();
        }
    }

}

The JNIOR is a very flexible and powerful controller. Utilize our bundled or add-on software applications. If those don’t meet your needs, let INTEG quickly develop an application for you.

The JNIOR offers superb functionality with its included and available software. However, if you require a custom application to run on your JNIOR, INTEG can develop it for you, or you can develop it yourself using the JNIOR Software Development Kit (SDK).INTEG has already developed a number of custom applications for a variety of customers. Some of these applications have become our ‘add-on’ applications because they have met the needs of a large group of customers.

Other times the applications have been focused and developed to meet the needs of a specific customer.  After the user requirements are gathered it doesn’t take long for INTEG to deliver something for the customer to test.  Often times we dont get devices sent to the office that the customer wants to interface with.  This makes it tough for INTEG to complete full testing in the office.  Sometimes we write test applications to mimic the communications between the JNIOR and the end device.

If you have an application that you have in mind and want to talk to INTEG about the JNIOR please call the office, 724-933-9350, or email support@integpg.com.  You can also fill out the contact form.

Thank you for your interest in the JNIOR from INTEG.

JANOS supplies a Message Pump wherein messages of various types may circulate between processes.

 

JANOS supplies a Message Pump wherein messages of various types may be circulated between processes or application. These messages may be user defined.
Message numbers below 1024 (0x400) are RESERVED by the system.
The following are the system defined message types.

SM_SHUTDOWN (0x01)

This message is generated by the system prior to shutdown. When received applications MUST forward the message by returning it to the pump and exit in an expeditious fashion. The JNIOR is about to reboot.

SM_PROBE (0x02)

This message is generated by the system periodically. When receive applications MUST forward the message by returning it to the pump. This is used to detect listeners that are no longer responding or that are not properly forwarding messages. The system expects to see this message return to it in a prompt fashion.

SM_GCRUN (0x10)

This message indicates that the Garbage Collection (GC) has completed. When received applications MUST forward the message by returning it to the pump.

SM_WATCHDOG (0x11)

This message is generated by a application watchdog configured to send then message on timer expiration.

SM_SYSLOGMSG (0x12)

System log messages can be sent to an external Syslog Server. This message also passes the log information to listening applications.

SM_PWRLOST (0x20)

When Ride-Thru Power support is available this indicates the lost of external power.

SM_PWRGOOD (0x21)

When Ride-Thru Power support is available this indicates that external power has been restored.

SM_PWRREADY (0x22)

When Ride-Thru Power support is available this indicates that the supply is fully charged and ready to provide maximum holding capacity.

SM_REGUPDATE (0x40)

This message is generated whenever a registry entry is updated or removed. When received application MUST forward the message by returning it to the pump.

SM_WEBSTARTUP (0x60)

Message sent when the Web Server process is activated.

SM_WEBSHUTDOWN (0x61)

Message sent when the Web Server process is terminated.

SM_PROTCMDMSG (0x70)

This message is generated when the JNIOR Protocol receives a custom command message. When received an application MUST either forward the message or provide a SM_PROTCMDRESP response.

SM_PROTCMDRESP (0x71)

This message is generated by an application in response to a SM_PROTCMDMSG command message. It is intended for the JNIOR Protocol server. When received applications MUST forward the message by returning it to the pump.

SM_PIPEOPEN (0x80)

This message is sent by the Web Server when a piped websocket connection has been established. The message contains the client IP Address and Port as well as the target message number.

SM_PIPECLOSE (0x81)

This message is sent by the Web Server when a piped websocket connection has terminated. The message contains the client IP Address and Port as well as the original targeted message number.

SM_USER (0x400)

Lowest allowed user defined message number. Applications that intend to exchange messages SHOULD attempt to define globally unique message identifiers. These must be values from 1024 and up. Message numbers below SM_USER are RESERVED by the system.

The JNIOR Model 410 looks very much like your familiar model 310 but just how similar is it? This insiders’ guide takes you on a tour of the JANOS Operating System highlighting what’s new and what’s different. It’s a technical peek into the next generation of JNIOR.
This Guide is for those who are very familiar with the JNIOR Model 310, 312 or 314. While the new series of JNIOR provides all of the capabilities that you have come to enjoy there are many (sometimes subtle) improvements. It is here that we will outline those for you. Our goal is to remove any mystery, to make you comfortable with the new generation, and to help you efficiently assimilate these devices into your installations.

1 Appearance

First of all the Model 410 looks different but not all that different from the Model 310. It is important that you be able to continue to use the JNIOR 410 in every application where previously you would the 310. For this reason the product housing has not changed; There has been no change to the connections; And, the power supply requirements remain the same.
At the same time it is important that you be able to quickly identify a 410 limiting any possible confusion. To this aim the color scheme used in the labeling has changed. The 410 provides a sleek dark look with its predominantly black front label. In addition the GREEN power LED is no longer green. The model 410 uses a BLUE LED to indicate power which can be easily identified from a distance. Other than this color difference the LEDs serve an identical purpose.

2 Connections

All of the connections to the Model 410 are 100% compatible with the Model 310. This is so you are guaranteed to be able to remove a 310 from its application and directly replace it with the new 410. There is one very subtle change which you would likely be hard-pressed to identify unless we point it out. The tab on the Sensor Port jack is now located upwards. We flipped this connection over because with the JNIOR mounted to the wall and depending on the hardware used to mount it getting your finger under the RJ-l2 plug to press the tab was sometimes difficult. This is certainly one of the least significant differences in the product if not THE least significant but we did promise to outline ALL of the differences.

3 Under the Hood

With the Model 310 there was no real need to remove the cover to access the circuit board inside. For the most part this is true for the 410 as well. Here there are two differences under the hood that should be mentioned.

3.1 Battery

All JNIORs use a 3V non-rechargeable lithium battery cell. This battery is used to maintain the content of the non-flash memory holding that portion of the file system outside of the /flash folder. The battery has a life expectancy of several years. This is especially true when the JNIOR remains in use and powered during most of that period. With the Model 310 this battery is NOT user replaceable.
The Model 410 uses a standard 3V CR2032 coin cell which is readily available wherever batteries are sold. While we have not experienced any significant issue with the life of the batteries used in the 310 they are quite expensive and do not provide the advantages of a removable cell. If your battery has failed you would notice a lack of any entries in the jniorsys.log file prior to the removal of power. The loss of file data may or may not impact your application. With the 410 at least you can easily replace the battery by simply sliding the old one out of the holder and slipping a new one in. The CR2032 coin cell actually has more capacity than the permanent batteries that we have used in the 31X models.

3.2 Configurable Relays

As you know our relay outputs are normally open (N.O.) dry contact. Those familiar with the Model 312, which has 12 relay outputs and 4 isolated digital inputs, may know that there are two relays which can be configured by internal jumper to alternatively function as normally closed (N.C.). We have carried this feature into the Model 410 where Relay Output 1 and Relay Output 2 can be configured to be either N.O. or N.C. simply by moving the associated jumper. All units come factory configured for N.O. operation.

4 Performance

By far the majority of differences in the new Model 410, those beyond what have been mentioned already, become apparent when power is applied. The most significant of which is product performance. The Model 410 is considerably faster and much more responsive than its predecessor.
This key difference becomes immediately obvious when power is applied. You are likely aware that the YELLOW/ORANGE Status LED to the right of the Power LED on JNIORs remains on a during product boot. This is true also for the Model 410 however try not to be distracted by the BLUE Power LED as you may miss the boot indication entirely. The Model 410 operating system (OS) literally boots in a second. The status LED correspondingly merely flashes.
After a second any configured and enabled applications have been lunched and have begun to initialize. The network connection becomes active after about 5 seconds as the hardware completes the normal auto-negotiation with the interconnected hub. This means that after only 5 seconds the Model 410 JNIOR is ready to accept browser connections, process FTP transfers or otherwise handle all network traffic. Even the most complicated applications are generally up and running in no more than 15 seconds.

4.1 Processing Core

The Model 410’s leap in performance comes from a significant upgrade in processor. The internal system clock, which governs the pace of machine instructions, is 3 times faster than the Model 310. Furthermore the new processor uses leading-edge technology to combine a variable length instruction set with an advanced instruction pipeline to achieve a one instruction per clock cycle execution rate. On average the Model 310 requires about 3 clock cycles per instruction. On this basis alone we would expect a 10X speed improvement with the Model 410.
The new processor is 32-bit where the Model 310’s processor was only 8-bit. Numbers are generally represented by programming languages as 4-byte (32-bit) integers. The simple act of adding two integers in an 8-bit system requires dozens of machine instructions whereas with a 32-bit processor it takes just one. In our case this difference is perhaps on average a factor of 20. This would imply that the Model 410 might be as much as 200 times faster than the 310.
In addition, on the Model 410, the operating system code is run out of flash memory internal to the processor which has a 10 nano-second access time. This is extremely fast for flash especially when compared to the external flash used by the Model 310 where access requires 70 nanoseconds. In order for the prior 310 to avoid having to wait for memory, programs are copied to RAM and executed there. The impact on performance in having to copy code and the increased load on memory management cannot be directly quantified.
As a further advantage the new processor has built-in functionality to perform multiplications, divisions and floating point calculations. The benefit of all of this is also difficult to quantify. The bottom line is that on the basis of hardware alone the Model 410 should be some 250 times faster than the 310 if not much more.

4.2 JANOS Operating System

As with any product the operating system brings life to the hardware. This is complex programming that is tightly coupled with the hardware design. In order for INTEG to move the JNIOR line to a new processor platform a new operating system had to be developed. For this new operating system to retain functionality closely matching (and indeed indistinguishable from) the existing 31X line we had to create it completely in-house. Therefore with the new Model 410 we introduce the JNIOR Automation Network Operating System of “JANOS” for short.
In ancient Roman religion and myth, Janus is the god of beginnings and transitions, and also of gates, doors, passages, endings and time. With this in mind the acronym JANOS is appropriate for an I/O system controlling inputs and outputs.
Unlike its predecessor the JniorOS (built upon the Dallas/Maxim TINI OS) which was written in Java, JANOS is written in C language directly generating efficient machine instructions which are optimized for the new processor core. Where previously the Java code interpreted sequences of bytecodes pacing OS performance, the new operating system performs dramatically faster and significantly more efficiently executing directly in machine language.
As a result the new Model 410 out-performs the Model 310 by orders of magnitude opening up possibilities for new and exciting applications.

5 File Storage

Beyond the improvement in processing speed the Model 410 also has the capability to store more file data. By default the area preserved for the /flash folder is 32MB which is over 40 times that provided by the Model 310. This providing ample space to store a fully featured website inclusive of graphics or for the storage of data covering long periods of time in data logging applications. The internal flash memory may even be expanded by special order to as much as 128MB and perhaps beyond.

6 JANOS Command Line

As with its predecessor the Model 410’s command line is accessed through a serial connection to the RS-232 port or via the network through a Telnet connection or application support such a connection. The prompt should be very familiar and all of the commands that you have used to manage your JNIORs are there. There are some differences which will be high-lighted in the next section.

6.1 Editing

The entry of commands has been improved. In addition to the UP/DOWN ARROW command history that JNIOR provides you may now use a number of other editing keys. The RIGHT/LEFT ARROW keys together with the HOME, END, BACKSPACE, DEL and INS keys can be used to flexibly edit and enter commands. This is welcome relief and a definite improvement over the previously limited editing available in the Model 31X series.

6.2 Auto-fill (TAB)

The TAB key is used to auto-fill file names. While entering any command you may type one or more characters representing the beginning of a file or folder name and then use the TAB keystroke to to toggle through potential files and folders matching that criteria. This is very useful to avoid having to enter lengthy file names in their entirety. The TAB key can even be used to construct paths to files located deep in the file system. For instance the sequence
cat f[TAB]/j[TAB][RETURN]
will likely execute the following as if entered directly.
cat flash/jnior.ini[RETURN]
The TAB key also enhances the registry command discussed in the next section.

6.3 Prompt Abbreviation (DEL)

The prompt itself displays the current Hostname together with the current working folder (directory). If you have used the cd command to move about folders in the file system and depending on the length of the Hostname defined, your prompts can be quite lengthy. You might be starting to enter commands 1/3rd of the way across a line and depending on the application used for access may have to contend with command line wrap.
If this becomes an issue, you may now use the DEL key immediately at a new prompt to squelch the display of the Hostname. If you do so, all subsequent command line prompts will omit the Hostname. Once another character is type after the prompt the DEL key will perform its editing function as expected.

6.4 Command History

As with its predecessor the Model 410 provides a command history wherein the UP/DOWN ARROW keys may be used to access and repeat a previously entered command. This is very useful.
The 410 however remembers up to 8 unique command entries sorted so that the UP ARROW supplies the most recently entered commands first. By doubling the number of remembered lines, eliminating duplicates, and sorting by age the command history becomes a more effective time saver. Add this to other editing and auto-fill functionality and you find that the command line is much easier to use.

6.5 Custom Command Creation

The java command is used to execute application programs. As will be covered later these are stored in .JAR files. The Model 410 allows you to execute a program using simply its name. So these two command lines equivalently execute the target application. Note also that file names are case-independent.
java MyApp.jar -debug
myapp -debug
This in effect allows you to create a custom command.

7 Commands

All of the commands available with the Model 310 are also provided in the new Model 410. In many cases the output may be formatted slightly differently, some additional information might be provided, or there may be additional functionality. We will review each command here focusing on significant differences.

7.1 The arp Command (New!)

7.2 The bye Command (New!)

The bye command terminates the current Command session. It is equivalent to the exit command.

7.3 The cat Command

Ctrl-C can be used to interrupt the listing of a lengthy file. If you cat a lengthy file and decide that you do not need to see the whole thing, hit the Control-C key combination to stop the listing.
The cat command can be used with the -h option to dump the content of a file in hexadecimal. This allows you to view the binary content of a file. For example:

JANOS_Rev04 /> cat -h jniorboot.log
00000000  30 38 2f 31 32 2f 31 33  20 31 39 3a 33 36 3a 31  08/12/13 .19:36:1
00000010  31 2e 35 39 38 2c 20 4d  6f 64 65 6c 20 34 31 30  1.598,.M odel.410
00000020  20 2d 20 4a 41 4e 4f 53  20 76 30 2e 38 2e 35 2d  .-.JANOS .v0.8.5-
00000030  72 63 34 2e 31 0d 0a 30  38 2f 31 32 2f 31 33 20  rc4.1..0 8/12/13.
00000040  31 39 3a 33 36 3a 31 31  2e 35 39 39 2c 20 43 6f  19:36:11 .599,.Co
00000050  70 79 72 69 67 68 74 20  28 63 29 20 32 30 31 32  pyright. (c).2012
00000060  2d 32 30 31 33 20 49 4e  54 45 47 20 50 72 6f 63  -2013.IN TEG.Proc
     .
     .
     .

This is useful in debugging applications that may store information in a binary form.

7.4 The date Command

The date command has 3 new options. The -s option disables the use of Daylight Savings Time for the current Timezone; Correspondingly the -d option enables Daylight Savings Time; And, the -v (verbose) option provides additional detail regarding the current time. For example.

JANOS_Rev04 /> date -v
 utc: 1376337540
 Mon Aug 12 15:59:00 EDT 2013
 Current Timezone is EST for the America/New_York area.
 Abbrieviated EDT when Daylist Savings is in effect.
 Daylight Savings Time begins at 02:00 on the Sun on or after Mar 8th.
 Daylight Savings Time ends at 02:00 on the Sun on or after Nov 1st.
 When in effect Daylight Savings Time sets clocks ahead by 1 hour.
 Daylight Savings Time is currently in effect.
JANOS_Rev04 />

7.5 The extern Command

The extern command manages external devices. In JANOS it remembers and displays the addressing and type for each module used with the unit.

JANOS_Rev04 /> extern
  TypeFB_1 = CD111090708109FB  present
  TypeFB_2 = BE111120220410FB  not present
  TypeFB_3 = 79111130517082FB  not present
  TypeFA_1 = C7100511100083FA  not present
  TypeFE_1 = 23111130619007FE  not present
  TypeFD_1 = 4B111110510241FD  not present
JANOS_Rev04 />

Note that the first two TypeFB (4ROUT) devices (TypeFB_1 and TypeFB_2) extend the Relay Output functionality of the unit. On the 410 the first represents Relay Outputs 9 through 12 and the second 13 through 16. JANOS like its predecessor works to maintain the proper association between the Relay Output and the physical module. This is done through this addressing.
Unlike its predecessor, the Model 410 does not automatically forget modules should they be removed and the unit rebooted. This greatly reduces the risk that the order of 4ROUTs and their associated Relay Outputs be confused and improperly assigned. If you do need to reset this addressing, use the -r option as follows:
extern -r
This will ‘remove’ any modules that are no longer present. The 4ROUTs will be reassigned in the order that they are detected.
The Model 410 also scans for new modules and verifies existing modules every 5 seconds. A reboot is no longer required (or use of the extern command) to detect newly connected devices.
Procedure for Assigning 4ROUT Modules
1) Remove all modules.
2) Issue the extern -r command.
3) Connect the 4ROUT module to be associated with Relay Output channels 6 through 12.
4) Wait 5 – 10 seconds (or use extern command until you see that the module is assigned).
5) Connect the 4ROUT for channels 13 through 16.
The 4ROUT and Power 4ROUT modules are TypeFB and are interchangeable as far as channel assignments are concerned. All other modules are addressed directly by their address in all protocols.

7.6 The iolog Command (New!)

7.7 The jar Command (New!)

7.8 The jrupdate Command (New!)

7.9 The manifest Command (New!)

7.10 The mode Command (New!)

7.11 The nv Command (New!)

7.12 The reg Command (New!)

The reg command is an alias (abbreviation) for the registry command.

7.13 The registry Command

Listing Registry Content
The registry command can now be used with wildcards to list matching Registry entries. Wildcards adhere to the DOS standards using ‘?’ and ‘*’ wilds. For example:

JANOS_Rev04 /> reg Ip*
    IpConfig/Hostname = JANOS_Rev04
    IpConfig/DHCP = disabled
    IpConfig/IPAddress = 10.0.0.71
    IpConfig/SubnetMask = 255.255.255.0
    IpConfig/GatewayIP = 10.0.0.1
    IpConfig/PrimaryDNS = 10.0.0.4
JANOS_Rev04 />

Auto-fill Registry Keys
The TAB key can be used to auto-fill Registry Key names. For example:
reg Ip[TAB]/[TAB][TAB][TAB] =
This results in the following command line wherein you may complete it by adding the IP address.
reg IpConfig/IPAddress =
Note the use of repeated TAB keys to toggle through the various subkeys to end up with the one you want. The keys appear in alphabetical order and if you continue through all available the original command line re-displays. You may then proceed through the list again. This only works for existing defined keys. To define a new key you will need to type it out.
Recall Current Value
You may use the TAB key immediately following the ‘=’ to recall the current value of an existing Registry key. For example:
reg IpConfig/IpAddress =[TAB]
This results in the following line which may then be edited if desired.
reg IpConfig/IPAddress = 10.0.0.71
Specifying Files
The TAB key functions as in any other command line when used following the the ‘=’ sign but not immediately after. For example:
reg Run/TaskManager = f[TAB]/T[TAB][RETURN]
This results in the following entry and can be used to easily define the key to start TaskManager.
reg Run/TaskManager = flash/TaskManager.jar
Combined with the newly available LEFT/RIGHT ARROW, BACKSPACE, DEL and INS editing this can greatly improve the command line experience.

7.14 The touch Command (New!)

7.15 The usermod Command (New!)

7.16 The users Command (New!)

8 User Management

9 Registry

10 Web Server

The JANOS Web Server introduces 2 new enhancements in addition to improvements in performance. The JANOS Web Server supports built-in websocket functionality as well as a form of server-side scripting consistent with the Hypertext PreProcessor (PHP).

10.1 Websockets

JANOS allows a web connection to be promoted to the websocket protocol through the HTML port (default 80). In this case JSON formatted messages can be exchanged over a single persistent connection with the client browser providing AJAX type services in support of dynamic web pages. This offers an alternative to the older Java Applets and implements the approach which has become the norm for many websites.

10.2 Server-side Scripting

The JANOS Web Server implements a small subset of the well-known scripting language known as PHP. Documented separately this server-side scripting function complements websocket and dynamic HTML by providing the ability to generate context specific web content on demand.

11 JAVA Applications Programming

12 I/O Logging

13 External Modules

14 Firmware Updates

A key advantage with the new Model 410 is the ability to update 100% of the operating firmware. With its predecessor, the Model 31X series, a percentage of the operating firmware is supplied by a third party in binary form. It was not possible to field update that portion of the JNIOR Model 310 code. We in fact had not changed that portion of the operating system through the entire life of the Model 31X product. This has forced us to work-around some (permanent) deficiencies that had been discovered over the years.
JANOS on the other hand was completely developed by INTEG and this includes every single byte of operating code. Correspondingly we are able to update the system in its entirety. We are able to service an issue that may arise and better yet, we are able to extend the function of the operating system in any way conceivable.
Each Model 410 may be updated manually or through programs like the JNIOR Support Tool. The firmware update is supplied in a .UPD file which must first be copied into the JNIOR file system using FTP. This file is typically between 600KB and 700KB and so it is recommended that it be transferred to the /flash folder where the 410 provides ample space. Unlike its predecessor the Model 410 does not automatically detect the .UPD file on boot nor does it automatically remove the file after updating. The jrupdate command is used to install the update. For example:
jrupdate -u flash/filename.upd
This initiates the firmware update. Note that a reboot will be required to complete the OS replacement.

14.1 Java System Library

The Java library is stored in the /etc/JanosClasses.jar file. This is the system built-in library residing in the Read-Only /etc folder. This contains all of the base classes required to build Java applications to run on the JNIOR Model 410. A .UPD file my optionally carry new content for the /etc folder. Note that the folder is replaced immediately upon execution of the jrupdate command in the form shown above.
Since Java applications cache referenced classes, the library .JAR can be swapped while Java applications are running. An application may throw an unexpected exception if it should attempt to load a new class during the update. If this is a concern you might want to stop any running application before performing the update.

14.2 OS Update and Rollback

The Model 31X series maintained two JniorOS images, the field update and the original factory installed OS. These, of course, are images of that part of the OS that can be field updated. It is possible with the 310 to rollback to the factory installed OS. We have found that doing so is rarely desirable given that the factory installation can quickly become outdated. It is likely that JNIOR applications would fail to run under the original OS. The rollback in this case has not been recommended and practically never used.
The Model 410 on the other hand also supports two JANOS images one of which is the currently executing operating system. The other image is a copy of the previously installed version. The rollback then becomes seriously useful in that it will restore any previous version of OS which can be assumed to have been recently operational unlike a potentially aged original factory version. The jrupdate -r command performs the rollback.
In actuality the jrupdate -r command schedules a swap of the OS images on reboot. In this case on reboot the “Saved OS” becomes the executing OS and the previously running OS is “saved”. One can use the jrupdate -r command to toggle between an updated version of JANOS and a prior version. The stats command displays the current and saved versions of the operating system.
The jrupdate -u command as described above merely overwrites the “Saved OS” image with the supplied update and schedules the OS swap on the next reboot. The jrupdate -c command can be used to cancel any scheduled swap.
The Model 410 then performs any scheduled OS swap on boot bringing up the desired copy of JANOS in its entirety. While we recommend that the reboot be caused using the reboot command it is not strictly required with the new JNIOR. The Model 410 has been designed to perform properly in the face of the practice of pulling power to force a reboot. This is not the case with the 31X series where issues (namely the loss of configuration changes) may result if the reboot command is not used. We continue to recommend that the reboot command be used with with all Model 310, 312 and 314 JNIORs.

14.3 Removing Power During Firmware Update

You might be familiar with the warning: “Updating firmware DO NOT remove power.” if you, like everyone else, use new products that support network firmware updates. The assumption is that if you remove power and their update process has not completed you will be left with a product with half an operating system which is potentially no longer operational. This is not a concern with the Model 410 JNIOR.
On reboot the Model 410 handles a scheduled OS swap. The JANOS images are indeed physically swapped in program flash memory. If a traditional memory copy operation were employed then we would indeed need to warn you. But the Model 410 uses an algorithm for the swap that insures success even if you flip power off and on as fast as you can possibly do so during the procedure. Try it!
The YELLOW/ORANGE status LED flashes slowly during the swap which typically takes only a few seconds. The updated JANOS will subsequently boot in just another second no matter the stability of power during the swap. This was a critical concern because otherwise if power is pulled to effect the reboot and not restored decisively a lazy update process would risk failure. This risk is unacceptable and unlike other products we address the issue head-on with a fault-tolerant update procedure so you can update with confidence.

15 Safe Mode

The Model 410 may be started in SAFE MODE. In this mode applications that are programmed to start automatically through Registry “Run” keys are not started.
In order to access SAFE MODE a jumper must be inserted onto the pins accessible through the small opening between the Ethernet connector and the RS-232 Command Port. The unit is then rebooted or powered up. When the command line mode is subsequently accessed either through the serial connection or via the network, “SAFE MODE” will be indicated below the welcome banner. This is the only indication that the mode has been enabled. The jumper must be removed and the unit must be rebooted in order to exit SAFE MODE.
Note that you may ‘borrow’ a jumper from the N.O./N.C. relay jumpers if you remove the unit’s cover. Do so only if disconnecting that relay will not adversely affect any system connected to it. Use a unused channel if available. Once you are done with SAFE MODE be sure to return the jumper to the original position. Jumpers placed close to the relay output connector are set for Normally Open (N.O.) operation (default).
There are two situations in which SAFE MODE is useful.

15.1 Application Generated Boot Loops

If an application that is programmed to automatically start upon boot misbehaves and immediately causes a reboot, a boot loop will result. In this case the JNIOR will rapidly reboot and it will be impossible to regain control of the unit through normal means. The solution is to insert the SAFE MODE jumper. On the next reboot the application will not be restarted and you will be able to log into command line mode. Once at the command line you can proceed with debugging. The SAFE MODE jumper should be removed and you might want to also remove the application’s “Run” key until you are certain the issue is resolved.

15.2 Forgotten Administrator Username or Password

If you have lost and forgotten the administrator’s password (‘jnior’ user for instance), you will need to contact INTEG to obtain the “backdoor” password for your unit. This password will allow you to log into all accounts (even disabled accounts) but only in SAFE MODE. Once you have logged into the administrator’s account you should use the passwd command to change the administrator’s password.
In case you have removed the default administrator’s accounts (‘jnior’ and ‘admin’) and replaced them with your own account name which now you may have forgotten, SAFE MODE will restore a DISABLED account for ‘jnior’ with the standard default password. You may log into disabled accounts in SAFE MODE using the backdoor password. Use the ‘jnior’ account to manage your user accounts.
A unit should not be left in SAFE MODE as this enables the backdoor password wherever a password is requested. That means that it would be valid for web page login, FTP, etc. as well. This would represent a serious security concern. Remember to remove the SAFE MODE jumper once you are done and reboot!

This sample shows you how to pulse multiple outputs and a single output. The method must take two binary masks. One describing the desired states during the pulse and the other describing which channels will be pulsed.

package com.integpg;
import com.integpg.system.JANOS;
public class PulseOutputs {
    public static void main(String[] args) {
        // to get the states of the outputs use the JANOS class and the getOutputStates method
        int outputStates = JANOS.getOutputStates();
        //print the Output States through telnet (console) connection.
        System.out.println("Output States are: " + outputStates);
        //Pulse 8 Relay Outputs On for 5 seconds (5000 milliseconds) after which outputs will return to previous state.
        //All channels (1111 1111b)
        JANOS.setOutputPulsed(255, 255, 5000);
        //Sleep 10 seconds to so that there is a noticable difference between on and off states.
        try {
            Thread.sleep(10000);
        } catch (InterruptedException ex) {
            ex.printStackTrace();
        }
        int counter = 0;
        while(counter<5){
        //Pulse Channel 5 Relay Output On for 5 seconds (5000 milliseconds) after which output will return to previous state.
        //Channels   8765 4321
        //Channel 5 (0001 0000b)
        JANOS.setOutputPulsed(16, 16, 5000);
            try {
                Thread.sleep(10000);
            } catch (InterruptedException ex) {
                ex.printStackTrace();
            }
            counter++;
            System.out.println("Counter: " + counter);
        }
    }
}

The classic Hello World application that runs on the JNIOR!

package helloworld;
public class HelloWorld {
    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) {
        System.out.println("hello world");
    }
}

Write Outputs

Sometimes you want to control the outputs. The outputs may be wired to things like lights, sirens, valves and maybe fans. (Note: You may need our Power 4 Relay Output Module for high loads).  This example will show you how to set the output states programmatically.

package com.integpg;
import com.integpg.system.JANOS;
public class WriteOutputs {
    public static void main(String[] args) {

        // to get the states of the outputs use the JANOS class and the getOutputStates method
        int outputStates = JANOS.getOutputStates();

        //print the Output States through telnet (console) connection.
        System.out.println("Output States are: " + outputStates);

        //set output relay for channel 5 (channel n-1) and true top turn the relay on, false to turn relay off.
        JANOS.setOutputRelay(4 , true);

        //Relay Output 5 should now be on.
        //print the Output States through telnet (console) connection.
        System.out.println("Output States are: " + outputStates);
    }
}

The JNIOR is a very flexible and powerful controller. Utilize our bundled or add-on software applications. If those don’t meet your needs, let INTEG quickly develop an application for you.

The JNIOR offers superb functionality with its included and available software. However, if you require a custom application to run on your JNIOR, INTEG can develop it for you, or you can develop it yourself using the JNIOR Software Development Kit (SDK).INTEG has already developed a number of custom applications for a variety of customers. Some of these applications have become our ‘add-on’ applications because they have met the needs of a large group of customers.

Other times the applications have been focused and developed to meet the needs of a specific customer.  After the user requirements are gathered it doesn’t take long for INTEG to deliver something for the customer to test.  Often times we dont get devices sent to the office that the customer wants to interface with.  This makes it tough for INTEG to complete full testing in the office.  Sometimes we write test applications to mimic the communications between the JNIOR and the end device.

If you have an application that you have in mind and want to talk to INTEG about the JNIOR please call the office, 724-933-9350, or email support@integpg.com.  You can also fill out the contact form.

Thank you for your interest in the JNIOR from INTEG.

NOTE: Before creating the setup described in this post on your JNIOR 410, please download the update project below and install it on your 410 using the JNIOR Support Tool. This is required to get the JNIOR to create a DMX connection.

Name Version Release Date Size MD5
DmxPort for enabling DMX on 410 Mar 17 2021 3.5 KB 299c66717c03a9c9b702716d9d56d095

This article is the first of two addressing the issues encountered in, and a means to simplify, the processing of the DMX Universe data stream using a standard UART serial receiver. The follow-up article is entitled JNIOR as a DMX Fixture Revisited.

The Model 412DMX generates a DMX512 Universe and allows the JNIOR to control DMX fixtures like those used in stage lighting. What if you needed a fixture with relays that can be controlled by DMX? Perhaps you need to output channels over a 4-20ma loops. Maybe you need a 10 VDC output signal to control LED house lighting. Can the JNIOR receive DMX? Can the JNIOR be a DMX Fixture?

We showed you how you could control DMX fixtures with a standard Model 410 in a White Paper available here:

  AN01 DMX512 Implementation [ Jul 20 2017, 101.27 KB, MD5: e1b0203f177d1866e56cfbfdd0e221d4 ]

Now we have the 412DMX JNIOR designed for that purpose. Can the Model 410 also serve as a DMX fixture? Yes, it can. I’ll show you how here and we’ll see how we manage to accommodate some of the unique aspects of the DMX512 format with the JNIOR.

Cabling

We can use the JNIOR Model 410 because the AUX port is compatible with RS-485. In the white paper explaining how the 410 can be used to control DMX fixtures we described an adapter cable taking the DB9 output from the JNIOR and presenting the proper female XLR connector for DMX. Now since a DMX fixture always has both a male and female 5-pin XLR connectors, our cabling has to be slightly different. Note that you can do this with the 3-pin XLR (as I have) if that is appropriate for your situation.

Here is an example of one that we put together.

modified Series 4 DMX AUX port cable

This can be constructed by splicing into a standard DMX extension cable. A number of DB9 adapters with screw terminals like the one pictured can be found on Amazon. Note that you will want one with large screws compatible with larger wire sizes. DMX wiring is typically of a larger diameter and you will need to successfully clamp two wires in each of three positions on the adapter.

Here is the pin numbering. Note that wire colors vary.

        Signal           XLR      DB-9 Male
--------------------  ---------  -----------
Signal Ground (GND)       1          5
Data (D-)                 2          2
Data (D+)                 3          8
Not Used (NC)            4,5     1,3,4,6,7,9

This cable allows the JNIOR to be a DMX FIXTURE.

THE RESULTING DMX CONNECTION IS NOT ISOLATED. We recommend using an isolated power supply for the JNIOR and not sharing that voltage with other circuits. Take great care in making ground connections. Note that the JNIOR relay outputs are naturally isolated.

Serial Connection

Connect the adapter to the Model 410 AUX serial port as I have in this photo and connect this to the DMX network. Note that the 412 and 414 are not RS-485 compatible and cannot be used for this purpose.

modified Series 4 DMX AUX port cable

The serial port parameters should be set as follows. This is done through the Dynamic Configuration Pages (DCP) that should come up when accessing the JNIOR using your browser. You enable the RS-485 mode here so the AUX port output doesn’t disrupt the DMX communications before you have a chance to run the DMXFIXTURE application that I will describe. That application will also configure the AUX port just to make sure that all is well.

JNIOR Aux port settings

If you encounter “Applets” instead of the DCP then your Series 4 needs to be updated or you have a Series 3. The latter also cannot be used for this application. You will need to update your JNIOR to JANOS v1.6.6 or later for the functionality to be described here.

Data

With the JNIOR Model 410 wired to the DMX network and the AUX serial port properly configured the unit should be receiving data. There is a simple way to check that. You can see data without any application running just by using the IOLOG command. Here we enter the Console (or Command Line Interface) and use this command.

InfoComm_LED /> help iolog
IOLOG

Options:
 -T             Indicate transitions
 -R             Reset logs
 -A             AUX Serial log
 -S             Sensor Port log
 -O             Output to stdout

Generates jniorio.log file from available logs.

InfoComm_LED /> iolog -ao
--  07/02/18 15:42:46.098
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--80--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--FF--00--FF--80--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--83--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--80--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--FF--00--FF--80--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--83--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--80-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--FF--00--FF--80-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--83--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-80--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--FF--00--FF-    ................
-80--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--83--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-    ................
-00--00--00--00--00--00--00--00--00--00--00--00--00--00--00-        ...............

InfoComm_LED />

If you scan down in the above output and look through the data you will see that there are a couple of channels at 100% (0xFF) and a couple near half (0x80’ish). There are two pretty major issues in trying to read these bytes with standard library read() functions.

  1. How do we know how to find Channel 1? There are many more than 512 bytes shown here. If you read 512 bytes what you get could start anywhere.
  2. The data rate at 250 Kbaud supplies 44 complete channel sets per second! That absolutely will overrun the buffer before you can process any of it. The overrun would likely further obfuscate the data.

The fact is that the standard serial communications routines that you may be used to are just not usable here. JANOS will come to the rescue. But first let’s take a look at the data stream so we understand how this is to be resolved.

DMX Format

The DMX data on the RS-485 lines conforms to standard asynchronous serial data with 8 data bits, 2 stop bits and no parity. The bits are marched out from least significant (LSB) D0 to the most significant (MSB) D7. Each byte is called a “Slot”. The standard implementation transfers a START CODE and 512 channels in a total of 513 slots. The START CODE for normal DMX data is 0x00.

The beginning of the sequence is signaled with a Break Condition. This BREAK can be detected by the fixtures which allows them to synchronize with the stream. After the BREAK comes the START CODE (0x00 – NULL START) followed by the value for Channel 1 on through to Channel 512. Not all 512 channels need to be part of the transmission. The number of channels may vary by DMX controller. The complete implementation provides all 512.

On the oscilloscope the BREAK looks like this. Here all of the channels are 0X00 and so you see only the STOP BITS. That long low pulse is the BREAK.

The issue is that the BREAK is difficult to handle with the standard serial port. It results in a FRAMING ERROR. During the break the signal is held at a low level. When the receiving serial UART expects the STOP BITS and they aren’t there it throws a FRAMING ERROR. While that can be detected and your application can be notified there still is no way to insure that the next bytes read from the port are those that follow the break. They may have been buffered some time before. They may be overrun by oncoming data.

In order to handle this and properly capture a reliable channel set, there must be a special function for that purpose in the AUX port class (AUXSerialPort). Of course, being the author of JANOS, I have implemented exactly what we need. And, those details are next…

Packet Capture

To read the DMX Packet (START CODE plus up to 512 Slots/Channels) we need to detect the Break Condition and then reliably collect as many as 513 serial bytes that immediately follow. Under many other RTOS implementations we would need to write an interrupt driven routine both to detect the Break condition and then also to collect the data. The JNIOR executes application programs written in a managed language (Java) and one does not have low level access to write things like serial interrupt routines. That is actually a good thing as the user generally does not have the programming experience. Such low level user programming often leads to unstable/unpredictable operation.

Here we rely on JANOS to maintain reliable operation. Low level interrupt routines have already been implemented to buffer incoming serial data and otherwise issue a notification of errors. Recall that the Break Condition manifests itself and one or more FRAMING ERRORS. But we have already established that reading buffered serial data and receiving asynchronous notifications is not going to be sufficient for capturing a DMX packet. This is where we benefit from having developed JANOS in-house and having authored 100% of it. Here we identify a need and are able to promptly and correctly implement a solution.

AUXSerialPort.readAfterBreak(byte[] buffer)

I have added the readAfterBreak() method to the AUXSerialPort class in the JANOSClasses.jar library. From the naming its use is self-explanatory. Here you create a buffer as a byte array and pass it to JANOS. The operating system enables the capture and then blocks the thread until the data collection completes. At the low-level JANOS sets up the buffer with a pointer and goes into a kind of ‘armed’ state. The interrupt routine that detects FRAMING errors has a tiny bit of code that checks for an armed capture and ‘triggers’ the collection of data. The interrupt routine that collects and buffers serial bytes from the port has code to set each byte aside into the buffer that you have provided. Once triggered the capture passes into the a ‘collection’ mode. When the buffer is full (or when another Break Condition is detected) the capture is ‘complete’ and the application program can proceed now with a byte array containing the DMX data.

Now to benefit from this new feature, you will need to update your Series 4 to run JANOS v1.6.6 or later. At the moment this is Beta code. We would make it available if you were to want to try this before its release. All you need do is ask.

Next we need to try it out…

DMX Capture Test

Here we create a project in Netbeans (making the few settings needed to target it for JANOS) and create the following test program. This merely takes control of, and fully configures, the AUX port in case it has not been configured through the DCP. Lines 23 and 24 test our new method. The rest merely dumps the byte array content for review.

package dmxfixture;
 
import com.integpg.comm.AUXSerialPort;
import com.integpg.comm.SerialPort;
 
public class Dmxfixture {
 
    public static void main(String[] args) throws Throwable {
 
        // AUX port access and configuration.  We need to open the port to gain exclusive access and
        //  set the proper baud rate and format.  We enable RS-485 mode and make sure that the receivers
        //  are enabled.  With normal RS-485 you would disable the transmit drivers.  Our adapter doesn't
        //  bridge the transmit and receive lines anyway and the DCP configuration automatically disables
        //  the drivers.  It is here for clarity.
        AUXSerialPort aux = new AUXSerialPort();
        aux.open();
        aux.setSerialPortParams(250000, 8, 1, SerialPort.PARITY_NONE);
        aux.setRS485(true);
        aux.enableReceivers(true);
        aux.enableDrivers(false);
        
        // capture a complete frame using our new method
        byte[] data = new byte[513];
        aux.readAfterBreak(data);
        
        // The remainder here is a fancy dump (skipping the START CODE).  Note how JANOS implements the
        //  printf formatting for us.
        for (int i = 1; i < data.length; i++) { 
            if (i % 10 == 1)
                System.out.printf("%04d  ", i);
            System.out.printf("%4d ", data[ i ] & 0xff);
            if (i % 10 == 0)
                System.out.println("");
        }
        System.out.println("");
    }
    
}

To run this we first build it in Netbeans. Then using the DCP we open the Folders tab and select the /flash folder. We then drag the dmxfixture.jar file from the project to the /flash folder (it can be executed from the root too). Then under the Console tab we log in and execute the application. The following is the result.

InfoComm_LED /> dmxfixture
0001     0    0    0    0    0    0    0    0    0    0 
0011     0    0    0    0    0    0    0    0    0    0 
0021     0    0    0    0    0    0    0    0    0    0 
0031     0    0    0    0    0    0    0    0    0    0 
0041     0    0    0    0    0    0    0    0    0    0 
0051     0    0    0    0    0    0    0    0    0    0 
0061     0    0    0    0    0    0    0    0    0    0 
0071     0    0    0    0    0    0    0    0    0    0 
0081     0    0    0    0    0    0    0    0    0    0 
0091     0    0  255    0  255  128    0    0    0    0 
0101     0    0    0    0    0    0    0    0    0    0 
0111     0    0    0    0    0    0    0    0    0    0 
0121   131    0    0    0    0    0    0    0    0    0 
0131     0    0    0    0    0    0    0    0    0    0 
0141     0    0    0    0    0    0    0    0    0    0 
0151     0    0    0    0    0    0    0    0    0    0 
0161     0    0    0    0    0    0    0    0    0    0 
0171     0    0    0    0    0    0    0    0    0    0 
0181     0    0    0    0    0    0    0    0    0    0 
0191     0    0    0    0    0    0    0    0    0    0 
0201     0    0    0    0    0    0    0    0    0    0 
0211     0    0    0    0    0    0    0    0    0    0 
0221     0    0    0    0    0    0    0    0    0    0 
0231     0    0    0    0    0    0    0    0    0    0 
0241     0    0    0    0    0    0    0    0    0    0 
0251     0    0    0    0    0    0    0    0    0    0 
0261     0    0    0    0    0    0    0    0    0    0 
0271     0    0    0    0    0    0    0    0    0    0 
0281     0    0    0    0    0    0    0    0    0    0 
0291     0    0    0    0    0    0    0    0    0    0 
0301     0    0    0    0    0    0    0    0    0    0 
0311     0    0    0    0    0    0    0    0    0    0 
0321     0    0    0    0    0    0    0    0    0    0 
0331     0    0    0    0    0    0    0    0    0    0 
0341     0    0    0    0    0    0    0    0    0    0 
0351     0    0    0    0    0    0    0    0    0    0 
0361     0    0    0    0    0    0    0    0    0    0 
0371     0    0    0    0    0    0    0    0    0    0 
0381     0    0    0    0    0    0    0    0    0    0 
0391     0    0    0    0    0    0    0    0    0    0 
0401     0    0    0    0    0    0    0    0    0    0 
0411     0    0    0    0    0    0    0    0    0    0 
0421     0    0    0    0    0    0    0    0    0    0 
0431     0    0    0    0    0    0    0    0    0    0 
0441     0    0    0    0    0    0    0    0    0    0 
0451     0    0    0    0    0    0    0    0    0    0 
0461     0    0    0    0    0    0    0    0    0    0 
0471     0    0    0    0    0    0    0    0    0    0 
0481     0    0    0    0    0    0    0    0    0    0 
0491     0    0    0    0    0    0    0    0    0    0 
0501     0    0    0    0    0    0    0    0    0    0 
0511     0    0 

InfoComm_LED />  

We note that channels are correct. Here we go over to the 412DMX controlling this DMX network and check Kevin’s DMX panel page for comparison.

DMX control panel

Putting it to Work

Now we can receive a DMX frame and read the individual channels what can we do with it? I mean other than dump it?

Well Kevin has defined an eight channel fixture starting at DMX channel 121. The idea being that each channel would correspond to a JNIOR Relay Output. Channel settings from 0-127 would result in an open/off relay and values in the range 128-255 would close the relay. You can imagine any use that you would want given the flexibility that you now have in JNIOR programming. Let’s implement this particular fixture.

The approach will be to sample a DMX packet periodically and set the relays appropriately. There is no need to catch every DMX packet and in fact we are not likely going to be able to do that. We are also going to be considerate of the JNIOR CPU and anything else that the unit might want to be doing. We will sample say every 1/4 second and sleep in between.

Here is the program. This uses an infinite loop to sample the DMX stream about 4 times a second. The starting address must be defined in the Registry. This could be cached. With this implementation you can change the starting address without rebooting or restarting the DMXFIXTURE program. It is presume that you would start the DMXFIXTURE program automatically at boot with a Registry Run key.

package dmxfixture;
 
import com.integpg.comm.AUXSerialPort;
import com.integpg.comm.SerialPort;
import com.integpg.system.JANOS;
 
public class Dmxfixture {
 
    public static void main(String[] args) throws Throwable {
 
        // AUX port access and configuration.  We need to open the port to gain exclusive access and
        //  set the proper baud rate and format.  We enable RS-485 mode and make sure that the receivers
        //  are enabled.  With normal RS-485 you would disable the transmit drivers.  Our adapter doesn't
        //  bridge the transmit and receive lines anyway and the DCP configuration automatically disables
        //  the drivers.  It is here for clarity.
        AUXSerialPort aux = new AUXSerialPort();
        aux.open();
        aux.setSerialPortParams(250000, 8, 1, SerialPort.PARITY_NONE);
        aux.setRS485(true);
        aux.enableReceivers(true);
        aux.enableDrivers(false);
 
        // here we create an infinite loop to continuously process the DMX data
        byte[] data = new byte[513];
        for (;;) {
            
            // capture a complete frame
            aux.readAfterBreak(data);
            
            // Obtain the starting address.  If it is invalid or not defined no action is taken.
            int addr = JANOS.getRegistryInt("DMX/Address", 0);
            if (addr > 0 && addr < 505) {
                
                // Although we don't have to we are going to collect all of the relay states
                //  and set them simultaneously.  This will also take advantage of signed values
                //  in Java.  Values in the range 128-255 will appear to be negative if we don't
                //  mask them with 0xff.
                int bits = 0;
                for (int i = 0; i < 8; i++) {
                    if (data[addr++] < 0)
                        bits += (1 << i);
                }
                JANOS.setOutputStates(bits, 0xff);
            }
            
            // sleep for a quarter second
            System.sleep(250);
        }        
    }
    
}

This program should be pretty easy to follow. Let’s test it.

Demonstration

A video can best demonstrate the operation of this program. Here we have a DMX application running on a 412DMX (10.0.0.242) allowing us to vary the channels that we associate with our 410 fixture. A separate Model 410 running our DMXFIXTURE (10.0.0.250) program can be monitored remotely through its DCP page. Here we overlap the two browser entities and we can see how modifying the channel fader results in the relay status change out across the DMX network.

Reliability

Let’s look into potential error conditions and the reliability of this approach. The DMX format typically supplies nearly 44 frames per second. If there is a communications error, due to electrical noise for instance, one and possibly up to a few frames might be in error. For a light fixture this might cause a minute flicker or some small flinch in pointing. But, given the frame rate it is quickly corrected and might not be even noticeable. If we are interpreting a frame with our program we need to be extra careful not to trigger a chain of events based upon an error packet.

Typically in data protocols we would have some form of checksum or CRC which we can use to identify an erroneous transmission so it can be ignored. There is no such thing in the DMX512 protocol. So what steps can we take?

Well to start we should verify that the START CODE is the expected NULL START 0x00 and ignore any frame with a different code. The controller might actually be inserting those and we must ignore them. I will adjust the program to check this.

Well… The START CODE is returning 128 (0x80) and the channels appear to be properly registered (e.g. in the right place). Now to look into this.

Synchronization After Break

The DMX512 specification defines the width of the Break Condition as something greater than 92 microseconds. It is important to note that it is something greater than twice that of a single slot time (the time to receive a single byte) of 44 microseconds (11 bit times – start bit, 8 data bits and 2 stop bits). It is not a precise multiple of slot times or even bit times. This forces the receiver to synchronize with each and every packet.

Given this I could make the argument that the Mark After Break should be at least one slot time of 44 microseconds in order to insure that the leading start bit of the first slot is successfully interpreted. The DMX512 specification however specifies the minimum Mark after Break of 12 microseconds. This puts us at the mercy of the UART design and its ability to synchronize following a Break Condition of arbitrary length. There are a number of possible outcomes that depend on what the UART decides is the first STOP BIT once the Break Condition passes.

  • For example, if the beginning of the Mark After Break is seen as a valid STOP BIT then a 0x00 byte is received AHEAD OF the normal NULL START code 0x00. This extra 0x00 can be interpreted as a valid START CODE but all of the channel slots are off by 1. Channel 2 would have the value for Channel 1. This is an ERROR!
  • If the Mark condition just slightly into to Mark After Break is interpreted as a valid start bit then an extra 0x80 is received AHEAD of the START CODE. This might be seen as a bad packet if the START CODE is verified. Channels are also shifted if values are used. This is a ERROR!
  • The above continues with each bit time advance into the Mark After Break generating an initial extra byte of 0xC0, 0xE0, 0xF0, 0xF8, 0xFC, 0xFE and 0xFF depending on the length of the Mark After Break. In each case the START CODE would then be considered the Channel 1 value. ERRORs result!
  • With a short Mark After Break the UART might look at a low bit value in the START CODE as a missing STOP BIT and generate yet another FRAMING ERROR. Again depending on the timing the START CODE might be returned as 0x80 with the first STOP BIT actually being interpreted as the MSB. In this case the Channel data is properly positioned. This is actually the most common mode I am seeing in the current set up. It is timing sensitive. This is also an ERROR!

If you follow this logic you might see that it is possible that it may take a couple of regular slot times before the UART grabs something it is happy about. It is all about the synchronization aspect of the hardware design.

The question is how to know when you are receiving valid data and properly aligned slots? Is there a solution to this?

A UART that requires a Marking Condition before attempting to detect a START BIT (falling edge) would function properly. Apparently they don’t work this way. At least not all of them.

UART Issue

The problem that we run into is an ancient design flaw in serial ports.

A Framing Error results when the UART (RX SCI) expects a Stop Bit and none is detected. A Stop Bit is a high (1 Marking) and during a Break Condition the signal is held low (0 Space) so a Framing Error is quickly encountered. Now most descriptions of UART logic suggest that after a Break the UART locates the next Start Bit (0 Marking) and that this is detected by a high to low transition of signal (1 -> 0). Logically it is done that way for asynchronous reception as the UART clock needs to synchronize and then sample the middle of each bit period.

In reality after a Framing Error the UART seems to see the next low (0 Space) as a Start Bit and continues to read bit data. As a result Framing Errors are repeated throughout the break period. A bogus byte value might appear to be properly read if the tail end of the Break Condition aligns with the UART in a way to make the high (1 Marking) after the Break look like a Stop Bit.

The likelihood of this bogus data byte and its content can vary depending on the length of the Break and the length of the Marking after the Break and before actual data is present. Since bytes are serialized LSB first these extra bytes look like one of 0xFF, 0xFE, 0xFC, 0xF8, 0xF0, 0xE0, 0xC0, 0x80 and even 0x00.

If the Marking after Break is brief (only a few bit times) and the alignment falls such that the UART looks at bits in the first byte of data for that magic Stop Bit, you will receive an incorrect value for the fist byte. It is conceivable that the UART might take several bytes before synchronizing and providing real data.

If the UART simply fell into a mode whereby it actually did search for the next Start Bit by looking for a valid high to low transition (1 -> 0), you would get a single Framing Error followed by the proper collection of data. But no… after 50+ years we have not addressed this issue. I half recall struggling with this exact thing maybe 30 or so years back now. The fact that it is still an problem is not impressive.

I guess I shouldn’t be surprised in that these hi-tech MCU processors all still include the Real Time Clock (RTC) circuit first designed for the very first digital watches in the late 1960’s. This forces us to parse time into Day, Month , Year, Hour, Minute and Seconds as if setting a watch on your wrist. In fact Seconds can only be reset to 00 and not directly set. On boot we have to read the time and reassemble it into Linux or Internet time as a tally of milliseconds since some epoch. Lots of work that causes loss of precision. And the ideal would be a non-volatile battery-backed 64-bit millisecond counter. Sometimes silicon space is limited and this counter would save lots of that. But no… these integrated circuit companies aren’t as swift as we would like to think.

Since DMX512 signals can have different lengths of Break and Marking after Break and these can vary depending on source, and since the protocol has no leading header that can be used in identifying valid frames, we are NOT ABLE to reliably receive data. Note that if the DMX512 Standard had forced the Mark After Break to be at least one data Slot long (> 44 microseconds) then UARTs would likely properly synchronize and reliably present the first byte of data. But the spec does not and the problem is that changing the standard now does not correct all of the DMX controllers already in use all over the world. So it is what it is.

So for us to insure that we read a valid frame, we need to resort to some trickery, filtering and indeed AI. While that can be fun, it’s unfortunate.

  • Corrected FTP listing issue created by the v1.6.4 release
  • Corrected getRegistryList method memory leak
  • Corrected 412DMX light Flickering
  • Corrected 412DMX NAND Flash processing issue
  • Corrected FTP transfer restart issue

Beginning with JANOS v1.6.4 you will be able to adjust the Time-To-Live (TTL) parameter used by the network stack.

The IpConfig/TTL Registry key defines the lifespan of a network packet. The time-to-live value is a kind of upper bound on the time that an IP datagram can exist in the Internet system. The value is reduced with the passage through a router. If it reaches 0 the packet is discarded. The default value has been increased to 128 from the value of 64 used prior to JANOS v1.6.4.

The TTL setting can be considered to limit the maximum radius (in terms of hops) of the network within reach of the JNIOR. The default setting should allow packets to reach the far end of the globe. A low setting would limit access to the unit as only those in the local vicinity could communicate. In this respect the TTL setting can be used to improve device security.

A very low setting of 1 or 2 would constrain the JNIOR to the local network. One must consider the need to reach Doman Name Servers (DNS) and Network Time Servers (NTP). There may also be the requirement for email transfers wherein the JNIOR needs to reach out to a SMTP Server. To help determine the minimum setting you may be able to use your PC’s TRACERT command to detect the hop count involved in reaching those destinations. The JNIOR does not support a route tracing function.

Real World Test

Luckily we have a neat way to test the effect of reducing TTL. We have a JNIOR we call HoneyPot sitting on the open Internet. Naturally it comes under a constant level of attack. For instance there is a fairly constant level of random login attempts on the Telnet port. On the JNIOR the Telnet port provides access to the JANOS command line interface. We log failed login attempts to a @/access.log@ file.

Log files on the JNIOR rollover to BAK files when they reach 64 KB in size. We keep only one BAK file for each log. Typically an application would archive BAK files when longer term logging is desired. A syslog server can be used for the system log @/jniorsys.log@ for longer term logging.

On HoneyPot we have an application that takes the access.log when it rolls over and analyzes the hosts attempting to log into the unit. IP addresses are added to a database (JSON based) covering data from the past 24 hours. The application uses a locating service to identify the geographical location of the host. A simple web page http://honeypot.integpg.com/map.php receives the database and uses the Google Maps API to plot these locations.

By default JANOS uses a TTL of 128. The map typically appears as follows:

If we reduce the TTL to 16 the map changes. Note that this seems to thin out the number of hosts able to communication with the unit. It does not seem to create a geographical radius.

The thinning effect is useful but one gets the feeling that systems within our own country may no longer be able to communicate with the unit.

The further reduction of TTL to 12 begins to suggest a geographical radius. Note in the following how the unit now seems to be invisible in China. This might suggest that our friends in far away places might actually be using shortcuts in the network to gain access to systems in the United States.

Of course, for a controller the most important aspect of this kind of security is whether or not YOU can access your own unit. In that case you might also use the IP filtering functionality of the device and limit access to only YOU.

One note. With the TTL limited to 16 the HoneyPot unit had trouble reaching some of the @pool.ntp.org@ NTP servers for synchronizing the clock. By limiting the radius of the network you may limit the useful services such as DNS and NTP.

So this test fails in that the service that is used to determine a location for an IP address is about 12 hops away. Here we see it is 13 from inside INTEG.

Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation. All rights reserved. 
C:\Windows\system32>tracert ip-api.com 
Tracing route to ip-api.com [69.195.146.130]
over a maximum of 30 hops: 

1 <1 ms <1 ms <1 ms 10.0.0.1
2 1 ms 1 ms 1 ms 50-197-34-78-static.hfc.comcastbusiness.net [50.197.34.78] 
3 12 ms 9 ms 9 ms 96.120.62.245 
4 10 ms 9 ms 8 ms te-0-1-1-1-sur01.westdeer.pa.pitt.comcast.net [68.86.146.225] 
5 32 ms 15 ms 15 ms be-62-ar01.mckeesport.pa.pitt.comcast.net [69.139.195.37] 
6 28 ms 21 ms 37 ms be-7016-cr02.ashburn.va.ibone.comcast.net [68.86.91.25] 
7 21 ms 20 ms 20 ms be-10130-pe04.ashburn.va.ibone.comcast.net [68.86.82.214] 
8 20 ms 20 ms 19 ms 23.30.206.206 
9 61 ms 61 ms 72 ms xe-0-2-1.cr2-kan1.ip4.gtt.net [213.254.215.121] 
10 62 ms 52 ms 51 ms ip4.gtt.net [69.174.12.26] 
11 52 ms 53 ms 60 ms 10.0.1.137 
12 * * * Request timed out. 
13 52 ms 51 ms 51 ms us-mo-1.free.ip-api.com [69.195.146.130] 

Trace complete. 
C:\Windows\system32>

And as a result with TTL restricted to 10 I get a lot of these errors.

04/05/18 08:29:12.949
** Uncaught java/io/IOException thrown: "Unable to connect to remote host"
  in java/io/IOException.<init>:(Ljava/lang/String;)V 
  in java/net/PlainSocketImpl.connect:(Ljava/net/InetAddress;I)V 
  in java/net/Socket.<init>:(Ljava/net/InetAddress;ILjava/net/InetAddress;IZ)V 
  in java/net/Socket.<init>:(Ljava/lang/String;I)V 
  in jaccess/JAccess.main:([Ljava/lang/String;)V at line 71

Just a note that I generally create application programs that are not destined for customer deployment with a throws Throwable clause. This insures that every exception is logged to the errors.log file and I don’t need to busy the code with try-catch structures. The application uses the com.integpg.system.Watchdog class which restarts the application after a timeout. You can see this in the system log up until I removed the TTL restriction.

In summary…

Reducing the TTL reduces the “radius” of the the accessible Internet but that does not precisely correspond to a geographic radius. Sites in Russia appear to have access to our Internet vicinity through less hops than some citizens in this country. Still it is a good defense in limiting access to the JNIOR so long as the resources your application uses can still be reached.

I had been thinking about this.

In testing by running with a low TTL we ran into problems where the JNIOR had difficulty reaching services it requires (like NTP) while locations perhaps even in Russia could still reach us. It seems to me that the standard large TTL should still be used for all outgoing communications. But a reduced TTL applied only to incoming connections. Specifically to UDP replies and TCP/IP SYN ACK responses. This would prevent distant (Internet radius wise) hosts from initiating connections or soliciting UDP replies.

The issue with UDP is that the original source TTL is unknown. So we cannot filter on it. The UDP would be received and would be processed. That packet would represent a vulnerability. All we can do is prevent any response from making it back to the malicious host.

The JNIOR Model 410, 412 and 412 each have two available serial ports. Each port providing at least a 3-wire RS-232 interface. A 3-wire connection contains only the Transmit (Tx), Receive (Rx) and Signal Ground (GND) circuits. This is the bare minimum for Duplex communication or interfaces utilizing software handshakes. The Rx line may be omitted if only sending data. Similarly the Tx might be omitted if only receiving data.

In addition to the 3-wire signals the AUX port supports optional hardware handshaking using the Request To send (RTS) and Clear To Send (CTS) signals. The Model 410 AUX port also provides a configuration for RS-422 and RS-485 communications.

While there are a number of parameters that must be properly configured in order to achieve functional and reliable communication, the biggest issue is (and has always been) proper cabling. If an RS-232 connection is not working and it is the first time the connection has been made, the connections are probably not correct.

Originally the RS-232 standard was created to support the connection of a modem. Before networking the modem was used to extend communications over standard telephone lines. Typically a computer (an IBM 360 for instance) would connect to a modem. At home a user would connect their terminal to another modem and establish a remote connection via dial-up. There are two types of equipment in this scenario: the computer stuff and the communications stuff (modems). The RS-232 standard defines two acronyms for this: DTE and DCE. These are used extensively to define connector types and signal definitions.

This is where the confusion begins. The acronym DTE refers to Data Terminal Equipment and in our example above this includes both the Computer and the Terminal (CRT or Teletype). That would be the stuff that you would be trying to connect together had you not needed the modems. The term DCE is often confused and is meant to refer to Data Circuit-Terminating Equipment or Data Communications Equipment. That being the modem in the above example. It does not stand for Data Computing Equipment which implies the computer. These terms are often confused and, perhaps, never really understood. As a result even the engineers who design the equipment (including myself) often employed the incorrect connectors, signal terminology and pin assignments. So let’s not use these designations.

JNIOR Serial Ports

The JNIOR has a COM port (labelled RS-232) and an AUX port (labelled AUX Serial). Both are DB-9F Female 9-pin D-sub connectors. The AUX port has 4 active signals and the COM port 2. The pin assignments are as follows:

2 >> RS232 TX / RS485 TX-
3 << RS232 RX / RS485 RX-
5    GND
7 << RS232 RTS / RS485 RX+
8 >> RS232 CTS / RS485 TX+

Here is how it shows on the schematic. Note that even the pin numbering on the the connector itself can be confused. The (>>) indicates an output. The JNIOR generates a voltage on this pin and it must be connected to an input at the other end. The (<<) indicates an input. This should be connected to an output at the other end. We will cover RS485 in a little bit.

You can see that we do not use DCD, DSR, DTR and RI. These are unconnected. The COM port follows the same assignments but ONLY pins 2, 3, and 5 are used.

Here is the source of additional confusion. The JNIOR transmits data on Pin 3 and therefore from the JNIOR’s point of view THAT is Transmit Data (TX or TxD). But when that signal reaches the other end (say your PC) it is incoming data or Receive Data (RX or RxD). That is because from the point of view at the PC it is data that would be received. So you connect RXD to TXD and visa versa.

Not everyone labels it that way. You will find an input pin labelled TxD. The thinking is that you would connect TxD to TxD. After all you do connect CTS to CTS as the signal is Clear To Send regardless as to who generated it and who is listening to it. The same goes for Request To Send (RTS).

It is not surprising that we sometimes have to grab a voltmeter to see if a pin is generating an RS232 voltage level (an output) or not (an input). Even that can be misleading when pull-up resistors are used. I used to have a couple of really sweet RS-232 break-out boxes. Those have gotten lost but were life savers back in the day. You know, nice colored LEDs showing outputs and jumper wires that you could use to test various cabling solutions before soldering the final cable.

JNIOR to PC Connection

Well today if you want to connect the JNIOR to your PC you will need a USB-To-Serial adapter. You would likely want to do that to gain access to the JANOS Console (command line interface) available over the COM port (115.2Kbaud, 8 data bits, 1 stop bit, no parity). The adapter will present you with a DB-9M Male connector identical to what you would have found on an older PC as a COM or AUX port connection. The connector (DTE) can be directly plugged into the JNIOR COM (or AUX) port (DCE).

Some USB-To-Serial adapters provide a length of cable and others are relatively short. If you need a longer cable then you either use a USB extension or an Male-To-Female Straight-Thru Serial Extension cable. The latter would need only be 3-wire unless your application optionally employed the hardware handshake. I will cover that a little later.

You would use this same approach to connect the JNIOR’s AUX port to a PC-based media server or other system that uses the standard PC serial ports. An application on the JNIOR can then send and receive data or commands to the remote server.

Connecting a Device to the JNIOR

If you plan to connect a barcode scanner or other device to the JNIOR then you might need a little help. You may need a 9-pin Gender Changer. There are two kinds: F-F and M-M. You may need the Male-To-Male (m-M) Gender Changer. This has pins on both sides and when plugged into the JNIOR it changes the connector from a Female DB-9F to the equivalent of a Male DB-9M. Unfortunately this does not alter the pin assignments and if the device was designed to be plugged into a PC then you will need a cross-over adapter or cable. The cross-over exchanges pins 2 and 3 (as well as 7 and 8). Remember that you want to always connect an output to an input. Sometimes this is called a Null Modem adapter, the name coming from the need to interconnect two DTE devices without modems.

Perhaps in hindsight it would seem that the JNIOR AUX port should have been DTE. In fact in the beginning we did not use a DB9 connector at all and provided screw terminals for the 5 signals since we would be required to connect to either DTE or DCE. The reality is that in Cinema (which was an early and big market for JNIORs) we connected often to media servers (which are essentially PCs) and the current DCE arrangement worked best for those customers. That stuck.

So as a result you end up with stuff like this.

Of course if you are handy with the soldering iron and get some solder-cup DB9 connectors and hoods from Digi-Key, you can clean this up nicely. They had hoped to solve all of this with USB but that has created other issues.

It didn’t help RS-232 that from the beginning no one fully understood how to document it. Some of us might remember the detailed signal diagrams explaining plus and minus 12V states, start and stop bits, and little endian order in the back of manuals. That level of detail was just adding to the confusion.

Here is a modern day failure. This is from a product received in 2017. At first glance you would think this is good documentation.

Here only the boldface signals are available or can be used. Perhaps only those should be shown. But beyond that picky item the important piece that is missing is any indication or what is an output and what is an input. You can naturally make your own assumptions. You might correctly assume that Received Data (RxD) is information generated (or output from) the remove connection and therefore an input at this connector. The TxD would then be an output. I mean you only have two choices here and chances of being correct are 50/50. If you are working a soldering iron though you won’t appreciate making the wrong guess.

It is not so obvious as to whether the CTS or RTS connection is an input or output. These signals are shown here but are they used? Are they required? Is there an option setting some place of which you should be aware?

So if you have the diagram for the other piece of equipment that you are connecting should you wire straight thru? Do you wire TxD to RxD and vice versa? If that ends up crossing over from pin 3 to pin 2 and vice versa should you also cross over RTS and CTS? Who knows. RS-232 failure.

My point though is that this nice little picture doesn’t eliminate the chance that your cabling or the cable you make might not work. And, if it doesn’t work you don’t have enough information to decide what to change. Come on man! You can do better.

In JANOS 1.6.3 there are new security measures to harden the Beacon protocol. This has been an issue since its inception. Any action that commands or configures the JNIOR will require credentials to be supplied. Those credentials along with a valid NONCE will be evaluated by JANOS to determine if the action or configuration attempt will be allowed.

The NONCE will be supplied at the end of the ALL_INFO packet. To get a valid NONCE an ALL_INFO packet will need to be requested shortly before the NONCE will be used. In the Support Tool, for example, when someone wants to save the configuration we will request the ALL_INFO packet when displaying the new login dialog. Then when the user clicks OK with the new credentials we will have received the NONCE. The nonce is then used along with the credentials to provide the new authentication.

Here is code from the Support Tool that adds the code to extract the NONCE from the ALL_INFO packet if it exists.

        private void ParseAllInfo(ref JniorInfo jnrInfo, BinaryReader br)
        {
            jnrInfo.Gateway = ReadString(br);
            jnrInfo.PrimaryDns = ReadString(br);
            jnrInfo.SecondaryDns = ReadString(br);
            jnrInfo.DNSTimeout = IPAddress.NetworkToHostOrder(br.ReadInt32());
            jnrInfo.DHCPServer = ReadString(br);
            jnrInfo.DomainName = ReadString(br);
            if ("n/a".Equals(jnrInfo.DomainName))
                jnrInfo.DomainName = "";
 
            //stringLength = IPAddress.NetworkToHostOrder(br.ReadInt16());
            jnrInfo.Timezone = ReadString(br); // ASCIIEncoding.ASCII.GetString(br.ReadBytes(stringLength));
 
            jnrInfo.DHCPEnabled = (br.ReadByte() == 1);
 
            // if there is more information then the nonce is provided
            if (br.BaseStream.Position < br.BaseStream.Length)
            {
                jnrInfo.Nonce = ReadString(br);
            }
        }

Once the NONCE is known, the stored credentials can be used to send the security string to the SET_INFO command

        public static byte[] SetInfo(JniorInfo jnrInfo)
        {
            using (MemoryStream ms = new MemoryStream())
            using (BinaryWriter bw = new BinaryWriter(ms))
            {
                WriteString(bw, "SET_INFO");
                WriteString(bw, jnrInfo.IPAddress);
                WriteString(bw, jnrInfo.SubnetMask);
                WriteString(bw, jnrInfo.Gateway);
                WriteString(bw, jnrInfo.PrimaryDns);
                WriteString(bw, jnrInfo.SecondaryDns);
                bw.Write(BitConverter.GetBytes(IPAddress.HostToNetworkOrder((Int32)jnrInfo.DNSTimeout)));
                WriteString(bw, jnrInfo.DomainName);
                bw.Write((byte)(jnrInfo.AutoAnnounce ? 1 : 0));
                bw.Write((byte)(jnrInfo.IsNew ? 0 : 1));
                WriteString(bw, jnrInfo.Timezone);
 
                // use the NONCE and the stored jniorinfo.credentials to send the security string.
                SendSecurity(bw, jnrInfo);
 
                return ms.ToArray();
            }
        }

The Support Tool will prompt the user for the credentials every time they are needed.

                        /**
                         *  check to see if the NONCE was filled in via the ALL_INFO packet.  
                         *  This is new in 1.6.3.  if the NONCE was provided we will prompt 
                         *  for credentials.
                         */
                        var nonceAvailable = null != configureJnrInfo.Nonce;
                        if (nonceAvailable)
                        {
                            var loginDlg = new Common.LoginDialog(configureJnrInfo.IPAddress);
                            /**
                             * if the user cancelled providing the credentials then we cancel 
                             * the configuration update
                             */
                            if (DialogResult.Cancel == loginDlg.ShowDialog(this))
                                return;
 
                            /**
                             * update the saved credentials so they can be used when sending the beacon commands
                             */
                            configureJnrInfo.UserName = loginDlg.UserName;
                            configureJnrInfo.Password = loginDlg.Password;
                        }
 
                        BeaconService.Beacon.Broadcast(BeaconService.Beacon.SetInfo(configureJnrInfo), m_configSerial);

The credentials are needed when issuing a Reboot. In the support tool we ask for an updated ALL_INFO packet before displaying a reboot confirmation.

                /**
                 * request a new ALL_INFO packet is sent with a new NONCE
                 */
                BeaconService.Beacon.Broadcast(BeaconService.Beacon.RequestInfo(), jnrInfo.SerialNumber);
 
                /**
                 * confirm with the user the disire to reboot the selected jnior
                 */
                if (Interaction.MsgBox("Are you sure you want to REBOOT the selected JNIOR?", MsgBoxStyle.YesNo, "Reboot?") == MsgBoxResult.No)
                    return;
 
                /**
                 * check to see if the NONCE was filled in via the ALL_INFO packet.  
                 * This is new in 1.6.3.  if the NONCE was provided we will prompt 
                 * for credentials.
                 */
                var nonceAvailable = null != configureJnrInfo.Nonce;
                if (nonceAvailable)
                {
                    var loginDlg = new Common.LoginDialog(configureJnrInfo.IPAddress);
                    /**
                     * if the user cancelled providing the credentials then we cancel 
                     * the configuration update
                     */
                    if (DialogResult.Cancel == loginDlg.ShowDialog(this))
                        return;
 
                    /**
                     * update the saved credentials so they can be used when sending the beacon commands
                     */
                    configureJnrInfo.UserName = loginDlg.UserName;
                    configureJnrInfo.Password = loginDlg.Password;
                }
 
                /**
                 * send the reboot command
                 */
                BeaconService.Beacon.Broadcast(BeaconService.Beacon.Reboot(jnrInfo), jnrInfo.SerialNumber);

And here is the Beacon reboot code

        public static byte[] Reboot(JniorInfo jnrInfo)
        {
            using (MemoryStream ms = new MemoryStream())
            using (BinaryWriter bw = new BinaryWriter(ms))
            {
                WriteString(bw, "REBOOT");
 
                // use the NONCE and the stored jniorinfo.credentials to send the security string.
                SendSecurity(bw, jnrInfo);
 
                return ms.ToArray();
            }
        }

Digital Files are a given entity in the programming world. They can contain 0 to many bytes each of 256 values. Those could be ordered to represent everything from common text (ASCII) to binary bit streams (compressed data) and everything in between. Files therefore have a size and they also have a timestamp. These days that timestamp represents the date and time of the last modification to the file. They also carry permissions which can control who can access the file or even know of its existence.

There is also location. In part that means where the file is positioned within a directory or folder structure. We are also concerned with the type of media in which the file is stored. You know, did you put the file on a memory stick, on the hard drive or in the Cloud? This aspect of “location” is what we are going to consider here in this topic. Files stored on your JNIOR end up in one of four different areas but yet all appear to be in the same place. Where your file is located can affect performance and the longevity of the your data.

Each JNIOR contains multiple memory components each of which can provide for file storage. These are integrated into a single File System. There are 4 types of storage: RAM, DRAM, Flash and ROM. You actually utilize them all in routine operation. Let’s look into it.

Non-volatile Battery-Backed Static Random Access Memory (SRAM)

When you enter the JNIOR’s Console (Command Line Interface) either through the serial port, using Telnet or by opening the DCP, your working directory is the root of the File System or “/”. By performing a DIR/LS command with the -L option you see content details generally containing the system’s basic log files. You can also see that there are a number of sub-directories or sub-folders. I struggle with terminology here. Do you use “directory” or “folder”? I think that I haphazardly vacillate between the two.

bruce_dev /> dir -l
total 10
drwxrwxrwx   1 root      root           8 Jan 26 08:22 .
drwxrwxrwx   1 root      root           8 Jan 26 08:22 ..
dr-xr-xr-x   1 root      root           1 Dec 31 1999  etc
drwxr-xr-x   1 root      root          58 Jan 26 07:38 flash
drwxrwxrwx   1 root      root           0 Jan 25 15:13 temp
-rw-r--r--   1 root      root       40968 Jan 26 08:13 jniorsys.log
-rw-r--r--   1 root      root         956 Jan 26 08:13 jniorboot.log
-rw-r--r--   1 root      root        1005 Jan 26 08:00 jniorboot.log.bak
-rw-r--r--   1 root      root       40302 Jan 26 07:38 web.log
-rw-r--r--   1 jnior     root       22434 Jan 25 14:53 manifest.json
  1763.2 KB available

bruce_dev />

Now you might immediately notice that there is only 1763 KB available. That’s not very much! Is that it?

No. But the File System root is located in a 2MB SRAM. This content is protected from loss by a battery. In fact, the battery is there for this purpose and to retain the current time and date during power outage. We built some JNIORs with a more expensive 4MB part but eventually realized that it wasn’t necessary. The bulk of your file storage will be located elsewhere.

The advantage of the SRAM is it’s speed and re-usability. In addition to the file system root, JANOS locates the Registry and other immutable memory blocks here. But space here is limited and it is best to preserve this area for system use. Data stored here does come with the risk of loss. This is a small probability but not an insignificant one. First of all the battery could die. If your JNIOR is powered 24/7 the battery should be there for you for 10 years and more. But if you power down the JNIOR routinely you may get 5 or 6 years out of it. Thankfully the Series 4 batteries are replaceable and you can get them at your local convenience store. Some customers though are happy to leave the dead battery not caring if their root folder is then volatile.

Perhaps more likely is that you decide to wipe the memory. You may have an application issue that gets the system into a problem condition. It is possible and we might recommend that you “business card’ the battery. So by that we mean that you remove power from the unit, open it and slip a piece of something (business cards work well) under the battery tab for a few seconds. This clears the SRAM (and the clock). Typically you only lose the logs. The Registry and therefore your configuration is backed up by another file stored in another area. But don’t worry, we recommend that procedure very very infrequently.

If you are programming your own JNIOR you might get yourself into a reboot loop. Basically your application starts up and performs something incorrectly that throws an assertion (system restart). The JNIOR reboots and restarts your application and another assertion ensues. Okay, not a great situation. JANOS eventually will detect some forms of reboot looping and it may decide to reformat the SRAM as a last ditch effort to restore access to your JNIOR. It sounds terrible but again it is a very very rare thing.

The point is that data stored in the root of the file system offers good performance and immediate data retention. It is not your best choice for long term storage. For that you want to use the Flash memory.

Flash File System

Flash memory retains data even in the absence of power. Files written in Flash memory are therefore retained even when the battery is removed. For that reason it is the best location for long term data storage. This is where you should place all of your programs, web site files and whatever else needs to be kept around. Everything under the /flash directory/folder is located in Flash memory.

bruce_dev /> dir -l flash
total 60
drwxr-xr-x   1 root      root          58 Jan 26 07:38 .
drwxrwxrwx   1 root      root           8 Jan 26 08:22 ..
drwxr-xr-x   1 root      root           1 Dec 06 11:15 cinema_backup
drwxr-xr-x   1 jnior     root           2 Dec 10 2015  generators
drwxr-xr-x   1 root      root           1 Jan 15 09:31 logs
drwxr-xr-x   1 root      root           2 Jan 15 09:32 public
drwxr-xr-x   1 root      root           2 Feb 06 2017  somepath
drwxr-xr-x   1 jnior     root          25 Jan 26 07:03 www
-rwxr-xr-x   1 jnior     root        1081 Jan 26 07:37 JTest.jar
-rw-r--r--   1 jnior     root       22434 Jan 25 14:53 manifest.json
-rw-r--r--   1 root      root        5449 Jan 23 08:33 jnior.ini
-rw-r--r--   1 jnior     root          13 Jan 11 15:16 gogo.dat
-rw-r--r--   1 jnior     root      183358 Jan 11 09:44 www.zip
-rwxr-xr-x   1 jnior     root        3043 Jan 05 10:21 JTest2.jar
-rw-r--r--   1 jnior     root         278 Dec 12 13:28 pubkey.pem
-rw-r--r--   1 jnior     root        1092 Dec 08 12:48 honeypot.cer
-rw-r--r--   1 jnior     root         272 Dec 06 13:27 key.pub
-rwxr-xr-x   1 jnior     root       20266 Dec 06 09:31 Cinekey.jar
-rwxr-xr-x   1 jnior     root      313835 Dec 04 13:44 Cinema.jar
-rwxr-xr-x   1 jnior     root        8329 Nov 21 12:04 Hmi.jar
-rwxr-xr-x   1 jnior     root        2189 Oct 04 14:24 JScan.jar
-rwxr-xr-x   1 jnior     root        3201 Sep 29 15:33 JUptime.jar
-rwxr-xr-x   1 jnior     root       58619 Aug 08 15:05 ModbusServer.jar
-rwxr-xr-x   1 jnior     root        4476 Jul 20 2017  Dmx.jar
-rw-r--r--   1 jnior     root         304 May 18 2017  test.txt
-rwxr-xr-x   1 jnior     root      169011 Apr 24 2017  snmp.jar
-rw-r--r--   1 jnior     root        1041 Feb 28 2017  key.pem
-rw-r--r--   1 jnior     root         902 Feb 15 2017  bruce_dev.cer
-rwxr-xr-x   1 root      root        4820 Jan 30 2017  jAccess.jar
-rwxr-xr-x   1 root      root        2174 Jan 23 2017  jPing.jar
-rwxr-xr-x   1 root      root        5651 Jan 23 2017  JManifest.jar
-rwxr-xr-x   1 root      root        1510 Dec 22 2016  ctrlc.jar
-rwxr-xr-x   1 jnior     root       74743 Oct 10 2016  Environ.jar
-rwxr-xr-x   1 jnior     root        9680 Oct 06 2016  ftp.jar
-rwxr-xr-x   1 jnior     root        4180 Aug 16 2016  TimeSearch.jar
-rwxr-xr-x   1 jnior     root        2616 Aug 03 2016  clktest.jar
-rwxr-xr-x   1 jnior     root       13079 Jul 27 2016  rz.jar
-rwxr-xr-x   1 jnior     root        2992 Jul 19 2016  Display.jar
-rwxr-xr-x   1 jnior     root       95325 Jun 30 2016  Buffer.jar
-rwxr-xr-x   1 jnior     root      112411 Jun 08 2016  slaveservice.jar
-rwxr-xr-x   1 jnior     root        5811 Jun 07 2016  UdpTest.jar
-rwxr-xr-x   1 jnior     root        5580 Jun 06 2016  jModule.jar
-rwxr-xr-x   1 jnior     root         969 Jun 02 2016  IntelliJ.jar
-rwxr-xr-x   1 jnior     root        1903 Jun 02 2016  Benchmark.jar
-rwxr-xr-x   1 jnior     root        4532 Mar 08 2016  SerialTest.jar
-rw-r--r--   1 root      root         898 Feb 10 2016  current.key
-rwxr-xr-x   1 jnior     root       32187 Dec 17 2015  serialcontrol.jar
-rwxr-xr-x   1 jnior     root      106794 Dec 10 2015  Utility.jar
-rwxr-xr-x   1 jnior     root      163902 Sep 04 2015  AnalogPresets.jar
-rwxr-xr-x   1 jnior     root        5053 Jul 28 2015  0-10vtest.jar
-rw-r--r--   1 jnior     root         898 Jul 24 2015  jnior1024.key
-rwxr-xr-x   1 jnior     root          56 Jul 10 2015  clean.bat
-rwxr-xr-x   1 jnior     root          17 Jun 30 2015  dirs.bat
-rwxr-xr-x   1 jnior     root        3862 Jun 18 2015  Test4to20.jar
-rwxr-xr-x   1 jnior     root       46590 Jun 18 2015  task.jar
-rwxr-xr-x   1 jnior     root        3601 Jun 18 2015  ThreadTest.jar
-rw-r--r--   1 jnior     root        4311 Jun 08 2015  task.ini
-rwxr-xr-x   1 jnior     root       25266 Jun 05 2015  serialethernet.jar
-rwxr-xr-x   1 jnior     root        2993 Jul 12 2013  4routtest.jar
-rwxr-xr-x   1 jnior     root        3142 Jan 17 2013  jPanel.jar
  26.85 MB flash available

bruce_dev />

Okay so my development unit is full of all kinds of stuff. Here you will notice that even so there is some 26 MB of file storage available. For the JNIOR that is a lot. You aren’t dealing with large graphics files and such on the JNIOR. But if you were to develop a really sophisticated website hosted by the JNIOR you might fill that. If that is the case you might want the new 412DMX.

412dmx_r00 /> dir -l flash
total 22
drwxr-xr-x   1 root      root          20 Jan 11 09:45 .
drwxrwxrwx   1 root      root          16 Jan 23 13:27 ..
drwxr-xr-x   1 jnior     root          13 Oct 17 14:06 www
-rw-r--r--   1 jnior     root      183358 Jan 11 09:45 www.zip
-rw-r--r--   1 root      root        2055 Dec 12 15:16 jnior.ini
-rwxr-xr-x   1 jnior     root        4526 Dec 05 14:15 Dmx.jar
-rwxr-xr-x   1 jnior     root        1597 Nov 17 07:34 ident.jar
-rw-r--r--   1 jnior     root       15584 Nov 07 09:04 manifest.json
-rw-r--r--   1 jnior     root       46000 Oct 12 12:37 string-test.dat
-rw-r--r--   1 jnior     root       20000 Oct 12 12:37 four-byte-test.dat
-rwxr-xr-x   1 jnior     root       42138 Oct 12 12:36 Benchmark.jar
-rw-r--r--   1 jnior     root       65481 Oct 11 11:10 lorem-ipsum.txt
-rwxr-xr-x   1 jnior     root       37110 Oct 05 14:51 MidNiteSolar.jar
-rwxr-xr-x   1 jnior     root       98569 Oct 03 13:51 ModbusClasses.jar
-rwxr-xr-x   1 jnior     root       58620 Oct 03 13:45 ModbusServer.jar
-rwxr-xr-x   1 jnior     root        3971 Oct 03 13:45 Simulator.jar
-rwxr-xr-x   1 jnior     root       95488 May 08 2017  SNMP.jar
-rwxr-xr-x   1 jnior     root      115448 May 08 2017  task.jar
-rwxr-xr-x   1 jnior     root       54247 Feb 03 2017  SlaveService.jar
-rwxr-xr-x   1 jnior     root       87637 Feb 03 2017  serialethernet.jar
-rwxr-xr-x   1 jnior     root       31640 Feb 03 2017  serialcontrol.jar
-rwxr-xr-x   1 jnior     root        9563 Feb 03 2017  ftp.jar
  509.70 MB flash available

412dmx_r00 />

Here there is close to 1/2 GB of file space. Actually we will be shipping the 412DMX with 1/4 GB capacity.

The existing JNIOR line uses a 32 MB serial Flash component. Data is written to and read from this Flash device using a serial (SPI) channel. This memory is therefore slower. This is not an issue though as JANOS uses a sophisticated caching system to handle Flash I/O. And if power is lost in the midst of a lengthy Flash write the device’s integrity is not damaged. The JANOS Flash File System uses a fault tolerant form of transaction processing. In the event of power loss (or crash) the Flash File System rolls back to the last known good configuration. As a result data stored here is likely to remain until purposely deleted. You can reformat the Flash File System but generally there is hardly ever a need to do so.

The 412DMX introduces a different Flash technology to the line. Here we employ a parallel NAND Flash memory. In addition to greater capacity the read and write access timing has significantly improved. Files stored here are accessed with almost the same performance as SRAM. In fact, in the future we may move the File System root to Flash and eliminate the SRAM altogether. Potentially the NAND Flash can be implemented on the 410, 421 and 414 and it will be considered when PCB revisions occur on those models.

Temporary Storage

Files stored in the /temp folder are considered temporary. That folder is actually located in the Heap which as I mentioned is DRAM memory. That memory is reformatted on boot. So the /temp folder always comes up being empty.

bruce_dev /> dir -l /temp
total 2
drwxrwxrwx   1 root      root           0 Jan 25 15:13 .
drwxrwxrwx   1 root      root           8 Jan 26 08:22 ..
  62.87 MB available (temporary)

bruce_dev />

The JNIORs are shipping now with 64 MB of Heap memory. The system normally utilizes only about 3 or 4 MB of that. So the /temp folder has reasonable capacity. This is twice what is available in the standard JNIOR Flash but much less than will be available in the 412DMX Flash. This is a great place to create temporary files. This provides the best performance as well.

We recommend that you transfer UPD files for updates first to the /temp folder. The advantage being that the file disappears once the update has been completed. UPD files are quite large and generally don’t fit into the File System root. You certainly wouldn’t want to leave one in the root for very long. And placing the UPD in Flash is not necessary and slow to accomplish.

An application might first create a file here and should the procedure complete properly then move it to long term storage. This is also great for files that will be accessed randomly (using a lot of fseek). You might improve an application’s performance by copying a database to /tempfirst. It would remain until reboot. Of course that is heap memory and the same memory where a large byte buffer would be allocated. So to improve performance an application might read the entire file into a byte buffer and access that directly. The load on the heap would be the same and random access would be greatly simplified.

The /etc Folder

Lastly there is the /etc folder. This is not a writable area and it is actually built into JANOS. This is where JANOS provides system files as might be necessary for application execution. That is the case now for the JanosClasses.jar file.

bruce_dev /> dir -l /etc
total 3
dr-xr-xr-x   1 root      root           1 Dec 31 1999  .
drwxrwxrwx   1 root      root           8 Jan 26 08:22 ..
-r-----r--   1 root      root      266601 Jan 11 09:58 JanosClasses.jar
  0 KB available (read only)

bruce_dev />

So since this is read-only there is no space available. This is stored within the processor in its Program ROM. Access is very fast.

It is important though as you can download this JAR file and use it in compiling your applications for the JNIOR. I would recommend getting the JAR from us or this site that not only contains these classes but source stubs and JavaDoc as well. Clearly that would help you more in development.

JNIORs are shipped with a number of default files in /flash. Some of those should be updated when JANOS is updated. In the future there may be additional files included in /etc. So it is something to keep an eye on.

In Summary

The JANOS File System appears to be centrally located and of a single directory structure. Yet it covers storage in a variety of media. One needs to keep this in mind when deciding where to place files either for temporary use or long term availability. Files in different areas experience different performance levels and different risks. Keeping this in mind you can better manage your JNIOR controller and create great applications.

ZIP is an alias for the JAR command. The JAR command gives you the ability to check and extract files from a file collection. JAR and ZIP files are of the same format. JANOS uses JAR files for Java programs which are collections of class files best handled as a group. This is the HELP for the command:

JAR filespec [pattern]

Options:
 -C             Check integrity
 -T             Lists library contents
 -X             Extracts library contents
 -V             Verbose

List/Extract files from a ZIP/JAR library.
Aliases: JAR, ZIP

Even though JAR collections store content generally in a compressed format the files can be quite large. If you ever question the integrity of a JAR/ZIP file you can use this command to verify it. Remember that you can also use the MANIFEST command to verify a file’s checksum.

bruce_dev /> jar -c flash/jAccess.jar                  
 4 entries found
 content verifies!
bruce_dev /> 

bruce_dev /> jar -cv flash/jAccess.jar
  verifying: META-INF/
  verifying: META-INF/MANIFEST.MF
  verifying: jaccess/
  verifying: jaccess/JAccess.class
 4 entries found
 content verifies!
bruce_dev />

You can see that the -V verbose option enumerates the entries as they are verified.

The -T option displays the table of entries in the collection. Recently with JANOS v1.6.3 we have enhanced this listing. Here is an example with and without the verbose option.

bruce_dev /> jar -t flash/jAccess.jar
META-INF/MANIFEST.MF
jaccess/JAccess.class

bruce_dev /> jar -tv flash/jAccess.jar
     Size   Packed          CRC32        Modified
      227      227    0%  6180ffe5  Jan 30 2017 14:40  META-INF/MANIFEST.MF
     4143     4143    0%  639ebba5  Jan 30 2017 14:40  jaccess/JAccess.class

bruce_dev />

Recently I have been interested in implementing DEFLATE compression. The existing JAR/ZIP command in JANOS has been able to decompress DEFLATE (inflate?) for years. We just haven’t had a strong need for creating or modifying an archive on the JNIOR. Beginning with JANOS v1.6.4 which is now in Beta there will be some new capabilities involving DEFLATE.

New to v1.6.4 is a greatly improved JAR/ZIP command that not only can list or test an archive but that can create, update and even freshen them. This would be useful for those who need to retain log files for extended periods of time. The jniorsys.log file compresses some 80% for example. The available command options are as follows:

ZIP libraryfile [filespec]...

Options:
 -V             Verify archive
 -T             List contents
 -X             Extracts contents
 -C             Create new archive
 -U             Update archive
 -F             Freshen archive
 -S,-R          Recurse folders
 -L             Verbose format

List/Add/Extract files from a ZIP/JAR library.
Aliases: JAR, ZIP

Some options have been reassigned. For instance the -V option now implies (V)erify as opposed to (V)erbose as it has been previously. Hopefully those changes will not cause difficulties. It was our opinion that the JAR/ZIP command in the past was relatively obscure and unused.

With this new implementation one or more file specifications inclusive of wildcards may be specified when appropriate. Recursion through the directory/folder structure is now not assumed. You must use the -S (or -R alias) option for that. Relative paths in the archive are maintained and created as you might expect. I will provide some examples.

The root on my JNIOR contains a few typical files.

bruce_dev /> dir -l
total 10
drwxrwxrwx   1 root      root           8 Jan 25 14:21 .
drwxrwxrwx   1 root      root           8 Jan 25 14:21 ..
dr-xr-xr-x   1 root      root           1 Dec 31 1999  etc
drwxr-xr-x   1 root      root          59 Jan 25 14:21 flash
drwxrwxrwx   1 root      root           0 Jan 25 13:26 temp
-rw-r--r--   1 root      root       37994 Jan 25 14:21 jniorsys.log
-rw-r--r--   1 jnior     root       22280 Jan 25 14:21 manifest.json
-rw-r--r--   1 root      root         953 Jan 25 14:12 jniorboot.log
-rw-r--r--   1 root      root        1002 Jan 25 13:37 jniorboot.log.bak
-rw-r--r--   1 root      root       35938 Jan 25 09:16 web.log
  1853.9 KB available

bruce_dev />

I can now create an archive of these files using the ZIP command. I can use JAR as it is the very same command. It is just an alias. I tend to use the command name appropriate to the archive I am working with. If I am creating a ZIP I use the ZIP command but there is no particular requirement to do so.

bruce_dev /> zip -c test.zip /
 5 files saved
bruce_dev /> 

bruce_dev /> zip test.zip
     Size   Packed          CRC32        Modified
    37994     7545   80%  bce2daff  Jan 25 2018 14:21  jniorsys.log
    35938     5797   84%  d393e4a3  Jan 25 2018 09:16  web.log
     1002      472   53%  afae59c3  Jan 25 2018 13:37  jniorboot.log.bak
      953      458   52%  b473efb0  Jan 25 2018 14:12  jniorboot.log
    22280    10086   55%  06c9451f  Jan 25 2018 14:21  manifest.json
 5 files listed
bruce_dev />

Here I specified the root folder. No wildcard was needed since that is a folder and it assumes in that case that I mean all of the contents. When the command is issued without option and verbose listing is assumed.

Note that the compression ratios are reasonable even though I have made some trade-offs in the interest of speed. The verbose output can provide interesting information. For example here is the same archive creation with the long/verbose output.

bruce_dev /> zip -cl test.zip /
  deflate: /jniorsys.log (37994 bytes)
   saving: jniorsys.log (compressed 80.1%) 0.758 secs
  deflate: /web.log (35938 bytes)
   saving: web.log (compressed 83.9%) 0.547 secs
  deflate: /jniorboot.log.bak (1002 bytes)
   saving: jniorboot.log.bak (compressed 52.9%) 0.044 secs
  deflate: /jniorboot.log (953 bytes)
   saving: jniorboot.log (compressed 51.9%) 0.044 secs
  deflate: /manifest.json (22280 bytes)
   saving: manifest.json (compressed 54.7%) 1.851 secs
 5 files saved
bruce_dev />

Keep in mind when you consider timing that the JNIOR runs on a 100 MHz 32-bit micro-controller and not a multi-core GHz processor.

The (U)date option (-U) allows you to add or replace files in the archive. For example:

bruce_dev /> zip -us test.zip *.ini *.bat
 4 files saved
bruce_dev /> 

bruce_dev /> zip test.zip
     Size   Packed          CRC32        Modified
    37994     7545   80%  bce2daff  Jan 25 2018 14:21  jniorsys.log
    35938     5797   84%  d393e4a3  Jan 25 2018 09:16  web.log
     1002      472   53%  afae59c3  Jan 25 2018 13:37  jniorboot.log.bak
      953      458   52%  b473efb0  Jan 25 2018 14:12  jniorboot.log
    22280    10086   55%  06c9451f  Jan 25 2018 14:21  manifest.json
     4311      913   79%  36a57579  Jun 08 2015 12:47  flash/task.ini
     5449     2014   63%  88996b53  Jan 23 2018 08:33  flash/jnior.ini
       56       56    0%  3b661614  Jul 10 2015 08:54  flash/clean.bat
       17       17    0%  6a11f77a  Jun 30 2015 15:17  flash/dirs.bat
 9 files listed
bruce_dev />

Here I have added any INI and BAT files present on the JNIOR.

Yes, the JNIOR can do BAT batch files. These are not scripting files like you may know from MSDOS but still useful. For example I do a lot of testing on my development JNIOR and that ends up creating error files and sometimes dump files. My clean.bat file creates a CLEAN command that removes any errors.log or dump.log file. It also resets the attention flag using the STATS command.

bruce_dev /> cat flash/clean.bat    
@rm errors.log
@rm dump.log
@stats -a
@echo Cleaned

bruce_dev />

If you are concerned that an archive may not have transferred to the JNIOR properly, you can use the (V)erify (-V) option. Here are both the normal and verbose versions of the command.

bruce_dev /> zip -v test.zip
 9 entries found - content verifies!
bruce_dev /> 

bruce_dev /> zip -vl test.zip
  verifying: jniorsys.log (compressed)
  verifying: web.log (compressed)
  verifying: jniorboot.log.bak (compressed)
  verifying: jniorboot.log (compressed)
  verifying: manifest.json (compressed)
  verifying: flash/task.ini (compressed)
  verifying: flash/jnior.ini (compressed)
  verifying: flash/clean.bat
  verifying: flash/dirs.bat
 9 entries found - content verifies!
bruce_dev />

Note that beginning with v1.6.4 this verification not only check file integrity but decompresses the entries and verifies CRC32 checksums.

Here we see that JAR files can also be processed (regardless of command name).

CODE: SELECT ALL

bruce_dev /> zip -v flash/ModbusServer.jar
 42 entries found - content verifies!
bruce_dev /> 

bruce_dev /> jar -vl flash/ModbusServer.jar
  verifying: META-INF/
  verifying: META-INF/MANIFEST.MF (compressed)
  verifying: appinfo.ini (compressed)
  verifying: com/
  verifying: com/integpg/
  verifying: com/integpg/janoslib/
  verifying: com/integpg/janoslib/datastructures/
  verifying: com/integpg/janoslib/debug/
  verifying: com/integpg/janoslib/io/
  verifying: com/integpg/janoslib/system/
  verifying: com/integpg/janoslib/utils/

The (F)reshen command will update files in an archive ONLY if a newer version of the file is found. This does not add new files to the archive. If you do not provide a file specification the command will attempt to freshen all of the archive contents. For example, we haven’t changed anything and the freshen command does nothing.

bruce_dev /> zip -f test.zip
 nothing to do
bruce_dev />

But if we execute the MANIFEST command which adjusts the manifest.json database then we have a newer version. The archive can then be freshened.

bruce_dev /> manifest -ul
JNIOR Manifest      Thu Jan 25 14:52:55 EST 2018
  Size                  MD5                  File Specification
 37994    5627aaee400338b1b3479842cecabe29  [Updated] /jniorsys.log
 28304    2a8a593cc66fa62117497c28bf565d20  [Added] /test.zip
End of Manifest (2 files listed)

bruce_dev /> zip -f test.zip
 2 files saved
bruce_dev />

bruce_dev /> zip test.zip
     Size   Packed          CRC32        Modified
    35938     5797   84%  d393e4a3  Jan 25 2018 09:16  web.log
     1002      472   53%  afae59c3  Jan 25 2018 13:37  jniorboot.log.bak
      953      458   52%  b473efb0  Jan 25 2018 14:12  jniorboot.log
     4311      913   79%  36a57579  Jun 08 2015 12:47  flash/task.ini
     5449     2014   63%  88996b53  Jan 23 2018 08:33  flash/jnior.ini
       56       56    0%  3b661614  Jul 10 2015 08:54  flash/clean.bat
       17       17    0%  6a11f77a  Jun 30 2015 15:17  flash/dirs.bat
    38036     7559   80%  b2b18320  Jan 25 2018 14:53  jniorsys.log
    22434    10129   55%  059a09d9  Jan 25 2018 14:53  manifest.json
 9 files listed
bruce_dev />

The MANIFEST update both alters the database and posts to the system log file. So two files are updated.

To demonstrate the E(X)tract option I will move the ZIP file to the /temp folder so we don’t overwrite any existing files. Here I will extract the manifest database and take a look at its content.

CODE: SELECT ALL

bruce_dev /> mv test.zip /temp

bruce_dev /> cd /temp

bruce_dev /temp> dir -l
total 3
drwxrwxrwx   1 root      root           1 Jan 25 14:58 .
drwxrwxrwx   1 root      root           8 Jan 25 14:58 ..
-rw-r--r--   1 jnior     root       28361 Jan 25 14:53 test.zip
  61.98 MB available (temporary)

bruce_dev /temp> zip -x test.zip *.json

bruce_dev /temp> dir -l
total 4
drwxrwxrwx   1 root      root           2 Jan 25 14:59 .
drwxrwxrwx   1 root      root           8 Jan 25 14:58 ..
-rw-r--r--   1 jnior     root       28361 Jan 25 14:53 test.zip
-rw-r--r--   1 jnior     root       22434 Jan 25 14:53 manifest.json
  61.95 MB available (temporary)

bruce_dev /temp> cat manifest.json -j
{
  "model":"410",
  "serno":614070500,
  "vers":"v1.6.4-b4",
  "date":"01/25/18 14:52:55",
  "files":{
    "/etc/janosclasses.jar":{
      "length":243492,
      "date":1515682735,
      "md5":"bb85898d4e208a388fb958f1fb90fcc5",
      "crc":"20916587",
      "sha":"a9eb59e9c709ff4ceba82b1e55c841ec5860cc42"
    },
    "/flash/serialcontrol.jar":{
      "length":31344,
      "date":1450364184,
      "md5":"b349e02b7efc64c0dfe5eb74292a5ee6",
      "crc":"3a005104"
    },
    "/flash/serialethernet.jar":{
      "length":25266,
      "date":1433505362,
      "md5":"ee5e266bb8418b4223a666bd046a8c56",
      "crc":"c3961df2"
    },
    "/flash/modbusserver.jar":{
      "length":51907,
      "date":1502219129,
      "md5":"77c16d6134dbd7ec93313fbad2b00d93",
      "crc":"b7456b42",
      "sha":"fad4ecc3d1607aafe0a385a10fb5ee90eff521bd"
    },
    "/flash/snmp.jar":{
      "length":239949,
      "date":1493062048,
      "md5":"b77d35c322ef6645f1eca9d22b29400b",
      "crc":"a4073dcb",
      "sha":"44a3c2b41a2375ef603063cc9b04642903dad973"
    },
    "/flash/www/base64.js":{
      "length":3493,
      "date":1433505378,
      "md5":"1138db1b5a6e165beae3ed81739dd2ec",
      "crc":"baceb6f6"
    },
    "/flash/www/configure/index.html":{
      "length":1349,
      "date":1433505382,
      "md5":"0454014aecfd0b7d9e4ce1efe0979139",
      "crc":"11ba5486"
    },
    "/flash/www/jr310applet.jar":{
      "length":287159,
      "date":1441207703,
      "md5":"f9c4840e7244824b75858a1a40dfb163",
      "crc":"3d1d0c72"
    },
    "/flash/www/jniorprotocol.jar":{
      "length":115148,
      "date":1441207710,
      "md5":"404b40c4293bf3c334e3b88e2fe0dd10",
      "crc":"5143ec4f"
    },
    "/flash/www/jniorprotocolhelpers.jar":{
      "length":34991,
      "date":1433505394,
      "md5":"b08e33e0c21e6c075b9b242bf092b68e",
      "crc":"48990308"
    },
    "/flash/www/task/index.html":{
      "length":1415,
      "date":1433505397,
      "md5":"bbdc32dce371881b3eebd15f5b3fce96",
      "crc":"cdbe02e4"
    },
    "/flash/www/taskmanagerinterface.jar":{
      "length":123052,
      "date":1433505400,
      "md5":"077cddccee476fab552d52a5eefd26a7",
      "crc":"647bb4b3"
    },
    "/flash/www/jquery/jquery-1.9.0.min.js":{
      "length":93071,
      "date":1433505404,
      "md5":"2b869ea9c8edd4c2243c5d44f665f632",
      "crc":"6a2a8434"
    },
    "/flash/www/jquery/jquery-ui.css":{
      "length":33441,
      "date":1433505405,
      "md5":"c6bd2971b8e625f2ae43ede9f655a27b",
      "crc":"0497b7a6"
    },
    "/flash/www/jquery/jquery-ui.min.js":{
      "length":96395,
      "date":1433505409,
      "md5":"8f636d4c90ea0abfcbb25528c635bf7d",
      "crc":"820662f5"
    },
    "/flash/www/vendor/bowser/bowser_0.7.2.min.js":{
      "length":3359,
      "date":1433505412,
      "md5":"61a36d48aad1298b17284b53f6ce3fd1",
      "crc":"22deb9e6"
    },
    "/flash/www/text":{
      "length":1336,
      "date":1434044220,
      "md5":"bab65804218b18b9e1a79f2d8e873259",
      "crc":"dda17d61"
    },
    "/flash/www/cycle":{
      "length":419,
      "date":1434044214,
      "md5":"9eb9bbdae70c1f994ebb7f51b18783b8",
      "crc":"9e496eb9"
    },
    "/flash/slaveservice.jar":{
      "length":73323,
      "date":1465435094,
      "md5":"cd6f5e177d75675607e9523d52e133f7",
      "crc":"9a871cd7"
    },
    "/flash/ftp.jar":{
      "length":9563,
      "date":1475783634,
      "md5":"793e460054f07867685e87f98fd402e6",
      "crc":"36fd641e"
    },
    "/flash/task.ini":{
      "length":4311,
      "date":1433782061,
      "md5":"b1f877ac198306b266311eab557ed1dd",
      "crc":"36a57579"
    },
    "/flash/task.jar":{
      "length":102655,
      "date":1434645611,
      "md5":"1979b16970127f2c38912777cb105133",
      "crc":"ed4d6ad7"
    },
    "/flash/jnior.ini":{
      "length":4874,
      "date":1516714407,
      "md5":"58d36d44e807564035fa88ad63e2b80c",
      "crc":"88996b53",
      "sha":"0f8b5112e66d27fcee64b8fdd9309e4e850f18c7"
    },
    "/jniorsys.log":{
      "length":32844,
      "date":1516908086,
      "md5":"5627aaee400338b1b3479842cecabe29",
      "crc":"bce2daff",
      "sha":"9c10cd81e308e594c47f2f9509721380b2648cdd"
    },
    "/jniorboot.log.bak":{
      "length":1041,
      "date":1516905441,
      "md5":"4f99b5c09ba93b48222183cddb9e7802",
      "crc":"afae59c3",
      "sha":"9442209de78327134b6ab0d87965d6e09c8bdc27"
    },
    "/jniorboot.log":{
      "length":995,
      "date":1516907554,
      "md5":"945b6dcbb03349fa9fd4ef8f91898bb6",
      "crc":"b473efb0",
      "sha":"4c17d7d0f6f2fa3bf7740541ec8104ade157a402"
    },
    "/flash/benchmark.jar":{
      "length":24351,
      "date":1464873509,
      "md5":"987f4044786771f31e0656cf91ed73f3",
      "crc":"1eed095a"
    },
    "/flash/threadtest.jar":{
      "length":3601,
      "date":1434645124,
      "md5":"902ce61cbd2524ca9b83dea335c395d3",
      "crc":"cd2479ff"
    },
    "/flash/test4to20.jar":{
      "length":3862,
      "date":1434659455,
      "md5":"a2e309c9d6dd112e5303aa76d2470740",
      "crc":"976f8208"
    },
    "/flash/dirs.bat":{
      "length":87,
      "date":1435691869,
      "md5":"531d655733ee668d829f9b3bdad96038",
      "crc":"6a11f77a"
    },
    "/flash/www/console/index.php":{
      "length":4347,
      "date":1438974987,
      "md5":"8728680bbc36d369429f7ca2c73cce7d",
      "crc":"c939c423"
    },
    "/flash/clean.bat":{
      "length":56,
      "date":1436532855,
      "md5":"ac9ce6553e1629412fb426b342440493",
      "crc":"3b661614"
    },
    "/flash/jnior1024.key":{
      "length":887,
      "date":1437746752,
      "md5":"b76b5351a92fdcc8d9b6b38ca62d8d71",
      "crc":"7983e14c"
    },
    "/flash/www/config/md5.js":{
      "length":5693,
      "date":1433505379,
      "md5":"a60fec5a81f207ff99ec1b97e3ccad0e",
      "crc":"e2a43d16"
    },
    "/flash/www/config/node.png":{
      "length":253,
      "date":1440435886,
      "md5":"1a8dbfaf1771a06e48dea0e3dc604392",
      "crc":"799c6dfc"
    },
    "/flash/www/config/tabs-styles.css":{
      "length":970,
      "date":1477590404,
      "md5":"68bca7015f51e26ab42199b5eb17a356",
      "crc":"f8870a33"
    },
    "/flash/www/config/tabs.js":{
      "length":3662,
      "date":1449678641,
      "md5":"ff728c86018341548ee70028062c89e0",
      "crc":"1a813112"
    },
    "/flash/www/config/styles.css":{
      "length":4450,
      "date":1504814044,
      "md5":"9ad78cca1b794dbcf9db3c55f1be5f1b",
      "crc":"acbd2e14",
      "sha":"3cf0bbc864840994a49f62d0ae00df6d8eb47ef3"
    },
    "/flash/www/config/comm.js":{
      "length":3541,
      "date":1507912287,
      "md5":"e7d2e56a443176d6150bbcc8b56e1911",
      "crc":"0ac0ed26",
      "sha":"5e66b96227779c5ef3736a7ca891a43cacffbbf1"
    },
    "/flash/www/config/console.js":{
      "length":5137,
      "date":1515680981,
      "md5":"58995da21198553a37d666ef043c289b",
      "crc":"ce8780d4",
      "sha":"bbe576a9bb28caa82306184ac38e8c5e0e1f1243"
    },
    "/flash/www/config/config.js":{
      "length":12639,
      "date":1515676686,
      "md5":"ae2d4b763f10adef65d65f9024ea809e",
      "crc":"cb109f41",
      "sha":"bb80d401bbc977695ee7c79a21487c2bbb3d7564"
    },
    "/flash/www/config/index.php":{
      "length":22103,
      "date":1515677508,
      "md5":"bdf0df657f4988b7e5abe86ac8ce6956",
      "crc":"6cd2ae57",
      "sha":"4d9883b4f3bf833831bb26a54b6b97698f074dd4"
    },
    "/flash/www/jnior.ico":{
      "length":3262,
      "date":1439548680,
      "md5":"1c3b3dda6b10c6259fcf7c068b760f09",
      "crc":"051803eb"
    },
    "/flash/www/favicon.ico":{
      "length":156790,
      "date":1486410493,
      "md5":"07cb90c7f3573eff80222269625ed1dd",
      "crc":"7e367afa",
      "sha":"284add71fe3d3ba48fba059b88ff5143d3964b1d"
    },
    "/flash/analogpresets.jar":{
      "length":163902,
      "date":1441372806,
      "md5":"25eacc647412535e320302d3680ce327",
      "crc":"e6b656fc"
    },
    "/flash/www/config/config.css.php":{
      "length":1045,
      "date":1475072901,
      "md5":"1692861e9abd7f8d81f5b7cf8a176046",
      "crc":"4c386a21"
    },
    "/flash/www/config/inputs.png":{
      "length":18047,
      "date":1443116143,
      "md5":"e2151c93b6cdeaa154d15fab486ae61b",
      "crc":"16290877"
    },
    "/flash/www/config/loading.gif":{
      "length":3236,
      "date":1264096270,
      "md5":"d96f6517e00399c37a9765e045eaaf22",
      "crc":"16f442ed"
    },
    "/flash/jtest.jar":{
      "length":1832,
      "date":1515959298,
      "md5":"051517cc7a8978d97746bb7acb0a57ed",
      "crc":"509a17f2",
      "sha":"beefc003bf3a076871b7eb0df2931db677b2bca1"
    },
    "/flash/www/vendor/angular_1.3.15/angular.min.js":{
      "length":125909,
      "date":1449498838,
      "md5":"ca1a58818682c3e858a585f283ab9beb",
      "crc":"9d8147d7"
    },
    "/flash/www/vendor/bootstrap_3.3.0/css/bootstrap-theme.css":{
      "length":21740,
      "date":1449498835,
      "md5":"c64043a3388612233d7eb947918a9bfc",
      "crc":"638f58a3"
    },
    "/flash/www/vendor/bootstrap_3.3.0/css/bootstrap-theme.css.map":{
      "length":41933,
      "date":1449498838,
      "md5":"c5da8241305bfe7e19919e6e943739eb",
      "crc":"11260772"
    },
    "/flash/www/vendor/bootstrap_3.3.0/css/bootstrap-theme.min.css":{
      "length":19199,
      "date":1449498840,
      "md5":"374df0ad5809a5314b0577802430a272",
      "crc":"8b3c47b7"
    },
    "/flash/www/vendor/bootstrap_3.3.0/css/bootstrap.css":{
      "length":137590,
      "date":1449498845,
      "md5":"ad6381ebfa541b55b0152349c6cabf76",
      "crc":"371e67da"
    },
    "/flash/www/vendor/bootstrap_3.3.0/css/bootstrap.css.map":{
      "length":366866,
      "date":1449498854,
      "md5":"4ba278e0c420d166e5a0eb71545f9509",
      "crc":"b7c9868d"
    },
    "/flash/www/vendor/bootstrap_3.3.0/css/bootstrap.min.css":{
      "length":114011,
      "date":1449498852,
      "md5":"78e7f91c0c4cca415e0683626aa23925",
      "crc":"34387388"
    },
    "/flash/www/vendor/bootstrap_3.3.0/fonts/glyphicons-halflings-regular.eot":{
      "length":20335,
      "date":1449498855,
      "md5":"7ad17c6085dee9a33787bac28fb23d46",
      "crc":"f171b590"
    },
    "/flash/www/vendor/bootstrap_3.3.0/fonts/glyphicons-halflings-regular.svg":{
      "length":62926,
      "date":1449498857,
      "md5":"ff423a4251cf2986555523dfe315c42b",
      "crc":"385cd4ad"
    },
    "/flash/www/vendor/bootstrap_3.3.0/fonts/glyphicons-halflings-regular.ttf":{
      "length":41280,
      "date":1449498858,
      "md5":"e49d52e74b7689a0727def99da31f3eb",
      "crc":"0617f1ff"
    },
    "/flash/www/vendor/bootstrap_3.3.0/fonts/glyphicons-halflings-regular.woff":{
      "length":23320,
      "date":1449498858,
      "md5":"68ed1dac06bf0409c18ae7bc62889170",
      "crc":"cec1a35c"
    },
    "/flash/www/vendor/bootstrap_3.3.0/js/bootstrap.min.js":{
      "length":34653,
      "date":1449498862,
      "md5":"281cd50dd9f58c5550620fc148a7bc39",
      "crc":"32d6c689"
    },
    "/flash/www/vendor/bootstrap_3.3.0/js/bootstrap.js":{
      "length":65813,
      "date":1449498862,
      "md5":"d5a03d9cca57637f008124916b86b585",
      "crc":"f504a7b3"
    },
    "/flash/www/vendor/bootstrap_3.3.0/js/npm.js":{
      "length":484,
      "date":1449498863,
      "md5":"ccb7f3909e30b1eb8f65a24393c6e12b",
      "crc":"cc50e34d"
    },
    "/flash/www/vendor/jquery_1.11.1/jquery-1.11.1.min.map":{
      "length":141680,
      "date":1449498870,
      "md5":"ffbeb16578d8cdf58104889baacbbef2",
      "crc":"e4e92bfd"
    },
    "/flash/www/vendor/jquery_1.11.1/jquery-1.11.1.min.js":{
      "length":95786,
      "date":1449498869,
      "md5":"8101d596b2b8fa35fe3a634ea342d7c3",
      "crc":"804ff984"
    },
    "/flash/www/config/integlogo.png":{
      "length":5773,
      "date":1449163436,
      "md5":"9111308273dadea73f5d09a5e02c7311",
      "crc":"60c4e184"
    },
    "/flash/utility.jar":{
      "length":106794,
      "date":1449773066,
      "md5":"ac559b91b537dfa70720a416f32f2960",
      "crc":"888936f1"
    },
    "/flash/generators/json/colour.js":{
      "length":4327,
      "date":1449774238,
      "md5":"c67e10d0e0e698fcdbbbadcaa55600d4",
      "crc":"19e8a38f"
    },
    "/flash/generators/json/ethernet.js":{
      "length":1409,
      "date":1449774238,
      "md5":"1b6bae08feb93f6bd345a3780c3acb69",
      "crc":"848097a7"
    },
    "/flash/generators/json/inputs.js":{
      "length":2825,
      "date":1449774239,
      "md5":"6959db5a769ff3ceea45bf606bda940a",
      "crc":"c544d780"
    },
    "/flash/generators/json/lists.js":{
      "length":12006,
      "date":1449774239,
      "md5":"5cc489ac77db7a3369b2ffc30cbd3a86",
      "crc":"ba761254"
    },
    "/flash/generators/json/logic.js":{
      "length":4404,
      "date":1449774239,
      "md5":"9cd1cf854976ebb69a6c20a7ac88d2f9",
      "crc":"6c2189f9"
    },
    "/flash/generators/json/loops.js":{
      "length":6040,
      "date":1449774239,
      "md5":"e8e9021b5d4eb2e0cc43f11ad5b3bfd7",
      "crc":"b30a758a"
    },
    "/flash/generators/json/math.js":{
      "length":14673,
      "date":1449774240,
      "md5":"fa22c29efc362e02d8f35838fcca46e5",
      "crc":"8fc62e67"
    },
    "/flash/generators/json/other.js":{
      "length":983,
      "date":1449774240,
      "md5":"dd77f555bc9b50ed17a215d7935f10ab",
      "crc":"3e07810d"
    },
    "/flash/generators/json/outputs.js":{
      "length":3861,
      "date":1449774240,
      "md5":"72a118cd7829b5a510e5a901d8863d6e",
      "crc":"bdd5e320"
    },
    "/flash/generators/json/procedures.js":{
      "length":3945,
      "date":1449774240,
      "md5":"cb9fb880bebb3375273353fafc12dc9c",
      "crc":"20d43aad"
    },
    "/flash/generators/json/text.js":{
      "length":1363,
      "date":1449774241,
      "md5":"a0bd39f638202a0800c100b4eac3cbc3",
      "crc":"b17b24d6"
    },
    "/flash/generators/json/timing.js":{
      "length":2638,
      "date":1449774241,
      "md5":"b1ee803dd8e6e00de74e0a3269f0a2ff",
      "crc":"489061b8"
    },
    "/flash/generators/json/variables.js":{
      "length":1500,
      "date":1449774241,
      "md5":"fecce79a400d5e4e1edbe521699fa604",
      "crc":"cb724c91"
    },
    "/flash/generators/json.js":{
      "length":4115,
      "date":1449774238,
      "md5":"cc72f2468eb970110f3f6f0278f43467",
      "crc":"25a98f30"
    },
    "/flash/www/config/link_to.png":{
      "length":259,
      "date":1450466976,
      "md5":"b1ed68183be4f97ce1793139496dbbb4",
      "crc":"a067876a"
    },
    "/flash/www/config/collapsed.png":{
      "length":232,
      "date":1452087215,
      "md5":"ef7dd392142824ec54b7b7188717411c",
      "crc":"c7bd8428"
    },
    "/flash/www/config/linked.png":{
      "length":174,
      "date":1452088114,
      "md5":"56d2755d08a0857ff6e7750c4b2822dd",
      "crc":"ff59187e"
    },
    "/flash/www/config/expanded.png":{
      "length":238,
      "date":1452097812,
      "md5":"905b26e96849524dd6c37e1878f66779",
      "crc":"68686921"
    },
    "/flash/www/config/registry.js":{
      "length":8276,
      "date":1452271284,
      "md5":"fc35855793b2bbfe577e420f34cb0dda",
      "crc":"6c73e25a"
    },
    "/flash/www/config/deletex.png":{
      "length":240,
      "date":1452284181,
      "md5":"2750f1e60d0222d7f3c0752207fb41e7",
      "crc":"386b823b"
    },
    "/flash/www/config/modules.js":{
      "length":13520,
      "date":1484149578,
      "md5":"5d79964a8ca70cc7dc0504c343be3e3c",
      "crc":"3c09b9e2",
      "sha":"d6f0b3ec60796662acd105694ef39543e3dc50a2"
    },
    "/flash/www/logging.php":{
      "length":4853,
      "date":1463582298,
      "md5":"170c17bd0962f434eebe699129491912",
      "crc":"dce15f4e"
    },
    "/flash/www/slaving.zip":{
      "length":113815,
      "date":1465493787,
      "md5":"b3e85080154b5a7dc10078a6c6fe75c7",
      "crc":"975c987e"
    },
    "/flash/0-10vtest.jar":{
      "length":5053,
      "date":1438104444,
      "md5":"3a7be82077e29c598bdd8694d47805f4",
      "crc":"05e27897"
    },
    "/flash/4routtest.jar":{
      "length":2993,
      "date":1373644405,
      "md5":"14381605ec8f2f0d0dbe34843b7178b8",
      "crc":"8240fc03"
    },
    "/flash/environ.jar":{
      "length":3881,
      "date":1476102546,
      "md5":"8d738f0145516d287174a00dda32dabc",
      "crc":"ff1ecc8b"
    },
    "/flash/current.key":{
      "length":898,
      "date":1455116261,
      "md5":"035a0d79bd6c8258c12111479fe7353e",
      "crc":"cbdd8ffe"
    },
    "/flash/serialtest.jar":{
      "length":4532,
      "date":1457448880,
      "md5":"48fc4bd9421a5cf275b42235d2f4e2cb",
      "crc":"6d86943b"
    },
    "/flash/intellij.jar":{
      "length":969,
      "date":1464918560,
      "md5":"aea445862e32190fa61abc5d97e5b25f",
      "crc":"959a1596"
    },
    "/flash/jmodule.jar":{
      "length":5580,
      "date":1465240063,
      "md5":"af7d42f427d0e711c4a79c8e1c1d341d",
      "crc":"40058988"
    },
    "/flash/udptest.jar":{
      "length":5811,
      "date":1465328251,
      "md5":"5bbc399b4eb1f5ec427ccbf93c8b135d",
      "crc":"3d976325"
    },
    "/flash/buffer.jar":{
      "length":95325,
      "date":1467321013,
      "md5":"0c66b2a130de483b64b91d87471eb952",
      "crc":"5d0819e2"
    },
    "/flash/display.jar":{
      "length":2992,
      "date":1468953410,
      "md5":"efcfc78470e98842f52579c81c088a2d",
      "crc":"5ec67fd0"
    },
    "/flash/rz.jar":{
      "length":13079,
      "date":1469638127,
      "md5":"c4b7e9f4072d64e3dde9fe5a62406a1e",
      "crc":"20367148"
    },
    "/flash/www/config/folder.png":{
      "length":329,
      "date":1454662486,
      "md5":"316b7810fa502618b4e85788a82617a8",
      "crc":"55f20187"
    },
    "/flash/www/config/file.png":{
      "length":286,
      "date":1454662486,
      "md5":"1b75c23448e9c6eed675404f6130491d",
      "crc":"d327c449"
    },
    "/flash/www/config/warning.png":{
      "length":3068,
      "date":1332275646,
      "md5":"9c96d831cfc50fdedfdc980bc2abb2cf",
      "crc":"e90bb05a"
    },
    "/flash/www/config/folders.js":{
      "length":19270,
      "date":1504815735,
      "md5":"c7a59ef1aea3aad95d3315627d3a3b29",
      "crc":"6b1adf25",
      "sha":"93d7e851c9a1a65ed45b7c1bbe4368d3d941b32f"
    },
    "/flash/clktest.jar":{
      "length":2616,
      "date":1470249535,
      "md5":"345b4a9a22ec05bc89bb291b7b047e0e",
      "crc":"270f1d8b"
    },
    "/flash/timesearch.jar":{
      "length":4180,
      "date":1471371624,
      "md5":"bf719e65d8f4be9d7348a621ac69bc2b",
      "crc":"25075aa7"
    },
    "/flash/www/config/relays.js":{
      "length":4189,
      "date":1484587793,
      "md5":"803af5c2431b8f58c110260b3f317838",
      "crc":"ee9ab3af",
      "sha":"21ec766fe220bd0618b43050851f9cd67dd1bf54"
    },
    "/flash/www/config/temperature.js":{
      "length":2870,
      "date":1475245816,
      "md5":"262c339513007cd746ee01da9a4a843f",
      "crc":"d062a444"
    },
    "/flash/www/config/dimmer.js":{
      "length":8255,
      "date":1475265861,
      "md5":"e7213c6fb8c263ac71acb766e62dc4ce",
      "crc":"b9edf051"
    },
    "/flash/www/config/range.css":{
      "length":2212,
      "date":1475499110,
      "md5":"6932c76ab79879ea4c5d826d9cb60db9",
      "crc":"3334dfd1"
    },
    "/flash/www/config/analog.js":{
      "length":7267,
      "date":1484587793,
      "md5":"87abcaf68dea5e2e203326a55bc2bca5",
      "crc":"9766b532",
      "sha":"dd788111904d41826164ea151f78dd4b3e3b84e6"
    },
    "/flash/www/config/ledon.png":{
      "length":626,
      "date":1475506220,
      "md5":"6018d69896fcba49da54c39d8ee19803",
      "crc":"32a65f15"
    },
    "/flash/www/config/panel.js":{
      "length":2038,
      "date":1475509052,
      "md5":"e0631cb06777f63f0a071f7aa5d198d0",
      "crc":"a38a7db3"
    },
    "/flash/www/config/ledoff.png":{
      "length":757,
      "date":1475509575,
      "md5":"4bb71e412a20ae6f098a29b195b10e13",
      "crc":"3fd16f7a"
    },
    "/flash/jpanel.jar":{
      "length":3142,
      "date":1358430294,
      "md5":"39825ccddf7b61c1ad41d261d84f4950",
      "crc":"446bee7f"
    },
    "/flash/www/config/syslog.js":{
      "length":1929,
      "date":1496773328,
      "md5":"4e8ecca50284c2aeae8e8b90db27ded8",
      "crc":"ac2a2541",
      "sha":"e413d70cc2bb6717448bc84c2980abc764bc3dd6"
    },
    "/flash/www/config/peers.js":{
      "length":5885,
      "date":1505835290,
      "md5":"2536fc521f916341b98183f6ce0b2453",
      "crc":"f2a44392",
      "sha":"5d949b8daa8e5081f19c88e42af968b24955e02c"
    },
    "/flash/www/index.php":{
      "length":356,
      "date":1477657721,
      "md5":"3ba20cf61f44f9ace09104261acf2711",
      "crc":"7f8eaed3"
    },
    "/flash/www/www.zip":{
      "length":85751,
      "date":1477663620,
      "md5":"296baa71d70bf40c1ad6ee0c71066c49",
      "crc":"69922bd1"
    },
    "/flash/www/download1.php":{
      "length":465,
      "date":1480616431,
      "md5":"1f69c84031dbdbe9aeecd634c0ab9607",
      "sha":"9770a8f6534f17f86eeb332309b7cbe07441022e",
      "crc":"c7b59619"
    },
    "/flash/www/short.php":{
      "length":273,
      "date":1516028524,
      "md5":"14687d4240d58955736ac2f6b31614a0",
      "sha":"2291bacbbd7aac09c488436efbe5c2be1f3936b6",
      "crc":"3cf41987"
    },
    "/flash/ctrlc.jar":{
      "length":1510,
      "date":1482421756,
      "md5":"b7ce2da5b761674e626ae62c4b9edbcc",
      "sha":"51a17a3f092333a0a48aa8e6dcebe0ce99cef3de",
      "crc":"bd2a0810"
    },
    "/flash/www.zip":{
      "length":87642,
      "date":1515681899,
      "md5":"c3cfda778bf0334684669fedb36180f7",
      "sha":"1aef18b365347aa0f13f38f315a04edbf7eb37d2",
      "crc":"1da88b8e"
    },
    "/flash/www/config/favicon.ico":{
      "length":766,
      "date":1486410493,
      "md5":"07cb90c7f3573eff80222269625ed1dd",
      "sha":"284add71fe3d3ba48fba059b88ff5143d3964b1d",
      "crc":"7e367afa"
    },
    "/flash/www/map.html":{
      "length":1170,
      "date":1485380108,
      "md5":"901c9971c3c591b3d736cd91516960de",
      "sha":"5ded94156ca71884af1afae0fcaf1e78d3bac23d",
      "crc":"71f8c837"
    },
    "/flash/jmanifest.jar":{
      "length":5651,
      "date":1485192866,
      "md5":"dfb84226c647a42295d9f671cfb99fa5",
      "sha":"a7331cca377c1f96e400ddd5044c01a175ee230f",
      "crc":"1a64c6d6"
    },
    "/flash/jping.jar":{
      "length":2174,
      "date":1485201152,
      "md5":"0d533008847888e0dfcf497c0cff1a96",
      "sha":"75fbff5a973b8dac3408fdda46e47e708b585e58",
      "crc":"f1203f43"
    },
    "/flash/jaccess.jar":{
      "length":4820,
      "date":1485805203,
      "md5":"29ce866873686dd133a724e4db29c690",
      "sha":"239bf75c1597a25fdbbbb78798fe72971ca15f63",
      "crc":"e5ae0d1c"
    },
    "/flash/somepath/path2/testx.php":{
      "length":5282,
      "date":1486397961,
      "md5":"ce1a071b258c936c65679d6bb67db198",
      "sha":"30342828ebaeb69cd8ecefd75f2dd01e80c6388b",
      "crc":"ecd9251a"
    },
    "/flash/bruce_dev.cer":{
      "length":902,
      "date":1487172768,
      "md5":"e9917f27384ddee36817c04c8cde9199",
      "sha":"4b2b82a042a0019679c1b071956278f6ddd1f27b",
      "crc":"115ed2ae"
    },
    "/flash/www/config/registrydoc.css":{
      "length":21460,
      "date":1504201641,
      "md5":"15423ca727b03e6b1581910c6ca2eab5",
      "sha":"f521b53a4518e7490768d2a8ae0e707c1dfb943b",
      "crc":"0d5fd8c9"
    },
    "/flash/www/config/registrydoc.html":{
      "length":169108,
      "date":1515600577,
      "md5":"f4b896b0cd0ead740985e4d8e8c20be4",
      "sha":"893b119002295f37afaa71c2f7f6d13fda14ea7c",
      "crc":"3b5a3493"
    },
    "/flash/www/panel/comm.js":{
      "length":4715,
      "date":1498074333,
      "md5":"44aa80868230fbfeee0a3c48c390896d",
      "sha":"37b479f65e7e8221d6fd9349439a8193cc645ba7",
      "crc":"0d5e92bd"
    },
    "/flash/www/panel/index.php":{
      "length":2648,
      "date":1501526934,
      "md5":"923ce6739971521191f9000662f38323",
      "sha":"a35d1d5f24da487be376595b46598e162e0f5310",
      "crc":"ffd86d7b"
    },
    "/flash/www/panel/panel.js":{
      "length":993,
      "date":1501527049,
      "md5":"9d9a2cbb435ffe8af5bd9d8c0598dccd",
      "sha":"2ef881dc8d90b4b0fb80a59d717c7125ca23fb04",
      "crc":"4fcd0f37"
    },
    "/flash/www/panel/panel.css":{
      "length":2586,
      "date":1501527291,
      "md5":"2a3a66d14d7bc6d4b01dfbd745205c7d",
      "sha":"886770297a07a594b88430d5db4ae9e23738d118",
      "crc":"2dd8a81d"
    },
    "/flash/www/graphr.zip":{
      "length":556637,
      "date":1506536442,
      "md5":"891b1dfa8d774b85aefcbd8791abe11f",
      "sha":"e5d204333658bd5c2f7c5b5ff682911124a10766",
      "crc":"62d153fb"
    },
    "/flash/public/dcp.zip":{
      "length":181914,
      "date":1504795829,
      "md5":"655e8587293f35f11c5c24fc38201d2f",
      "sha":"5fcfd8e38826e648f98f8d50f3613deb0d6312b6",
      "crc":"da99b7d0"
    },
    "/flash/test.txt":{
      "length":304,
      "date":1495131459,
      "md5":"fc9f1f5e67928ccb9be3aeaa66cd9e52",
      "sha":"6100d999f484f98ab476408c801dd000e579a62c",
      "crc":"765047c5"
    },
    "/flash/dmx.jar":{
      "length":4476,
      "date":1500567859,
      "md5":"3fd35bbe6bbf53a32aecf273275d1839",
      "sha":"4f702a87adb060294b553e6bd212672727d5d25f",
      "crc":"e81db9aa"
    },
    "/flash/juptime.jar":{
      "length":3201,
      "date":1506713589,
      "md5":"d4c2482fae18482727c1b2afabcf94b4",
      "sha":"86268b720b99760a4ebdb803db53f3f7fd18fd18",
      "crc":"44b0878c"
    },
    "/flash/jscan.jar":{
      "length":2189,
      "date":1507141493,
      "md5":"a0a42e17f003cedcac9c8e662ada6b36",
      "sha":"f1cafb56fdae33b66fff9b20cd2ff2705d96da9e",
      "crc":"60f00fe2"
    },
    "/flash/hmi.jar":{
      "length":8329,
      "date":1511283865,
      "md5":"1a1b247ccb5e3eb9623d12578c1ba833",
      "sha":"7a1f5868817e8a3e60fe8fb2c4d9ed168e53d141",
      "crc":"fb2a0367"
    },
    "/web.log":{
      "length":4735,
      "date":1516889801,
      "md5":"03febfe88d35e995a0d8a15f05e37f70",
      "sha":"4da80a3fb423a2e1ad8b05b6384326ef974a45f3",
      "crc":"d393e4a3"
    },
    "/flash/cinema.jar":{
      "length":313835,
      "date":1512413064,
      "md5":"45b29edcb85af51f58eda0f693b6c13e",
      "sha":"ba7f0da988e351b329e1c8af1929ab36dad99dec",
      "crc":"6e688a54"
    },
    "/flash/cinekey.jar":{
      "length":20266,
      "date":1512570698,
      "md5":"4b8adacc107abc577fae3c73db11d56a",
      "sha":"dde36076fe9a0613a40ccf78d9895bdfd92d93a2",
      "crc":"69db880f"
    },
    "/flash/key.pem":{
      "length":1041,
      "date":1488297708,
      "md5":"f643172f1cceb3703ce126df1f9293b9",
      "sha":"2cea702929e9cc04f6b4c003d2fb3ee507d5240e",
      "crc":"2e1cc611"
    },
    "/flash/key.pub":{
      "length":272,
      "date":1512584838,
      "md5":"344622d414a797bb9d992582c4d129b5",
      "sha":"1a45f21b80ee1ec8509d62fbfd5c71a96e400154",
      "crc":"4c1ce46a"
    },
    "/flash/honeypot.cer":{
      "length":1092,
      "date":1512755338,
      "md5":"51f65aaabc1f1f8d20c27dbe21389e8a",
      "sha":"d218400c2d82bb3766917e9139d0a21a54c56e4e",
      "crc":"ec194c40"
    },
    "/flash/pubkey.pem":{
      "length":278,
      "date":1513103302,
      "md5":"8077da7d24beedf7d0c56bd1d42bd062",
      "sha":"06631dbc5226ea3d3c3e6695c573877e351a7b72",
      "crc":"ce425129"
    },
    "/flash/jtest2.jar":{
      "length":3043,
      "date":1515165671,
      "md5":"c4b4ba07a459dd644abac99bbccbd31e",
      "sha":"35256db54659e900ffc9112bc0e769683ab8e818",
      "crc":"7beaf8b1"
    },
    "/flash/gogo.dat":{
      "length":13,
      "date":1515701808,
      "md5":"32201ddab35c4461b4cc8a555cc52125",
      "sha":"3a10b47bd880c61ab49b8d9c20a357ffb9905424",
      "crc":"c3d317fe"
    },
    "/flash/manifest.zip":{
      "length":8589,
      "date":1516717968,
      "md5":"cc9525181bd63a36f7a7c9bbdd263d52",
      "sha":"a9d9aa3d9f9e43bb77e00861cf1cae8c75307794",
      "crc":"aa1d1871"
    },
    "/flash/www/test.zip":{
      "length":183358,
      "date":1516103573,
      "md5":"c3cfda778bf0334684669fedb36180f7",
      "sha":"1aef18b365347aa0f13f38f315a04edbf7eb37d2",
      "crc":"1da88b8e"
    },
    "/flash/public/logs/file_list.php":{
      "length":1324,
      "date":1516026614,
      "md5":"dc00d3ff6e0dbde0d518cb031adb2ffc",
      "sha":"084e23a1c3920288fc77f5077af9e426d15a7070",
      "crc":"1619a010"
    },
    "/flash/logs/file_list.php":{
      "length":1324,
      "date":1516026614,
      "md5":"dc00d3ff6e0dbde0d518cb031adb2ffc",
      "sha":"084e23a1c3920288fc77f5077af9e426d15a7070",
      "crc":"1619a010"
    },
    "/flash/cinema_backup/macro_cineasia.csv":{
      "length":912,
      "date":1512576908,
      "md5":"3a9c04ed302b116828c6b1e34d90eee8",
      "sha":"0ba6c912592b8fcc94f325088bbf6e5e915b8095",
      "crc":"c08feb1b"
    },
    "/test.zip":{
      "length":28304,
      "date":1516908949,
      "md5":"2a8a593cc66fa62117497c28bf565d20",
      "sha":"d62543f024dfa510450d7be40ff5685269c042c9",
      "crc":"9c9d97ef"
    }
  }
}
bruce_dev /temp> 

We see here how the CAT command can format JSON for us.

Hmm… Perhaps before we release v1.6.4 JANOS I’ll have this command list the files it extracts. Seems like it should have at least indicated that it did what we wanted.

So in the past I have designed products that were capable of plotting collected data as a graphics file for display in the browser. This isn’t currently possible in the JNIOR and it is a feature that could be added to JANOS. It can be useful. I should also mention that JANOS executes Java and serves files out of JAR and ZIP files which generally utilize DEFLATE compression. So we already perform DEFLATE decompression. With a compressor we can add the ability to create/modify JAR and ZIP archives as well as to generate PNG graphics for plots.

The issue with the specifications is that they describe the compression and file formats but do not give you the algorithms. They don’t tell you how to do it. Just what you must do.

Sure often there is reference code or open source projects that you can find. Those often are complete projects and it is difficult to find the precise code in it that you need to understand if you are to implement the core algorithms. Then that code has been optimized and complicated with options over the years that end up obfuscating the structure. And, no, we don’t just absorb 3rd party code into JANOS. That is not the way to maintain a stable embedded environment. Not in my book.

So far this is the case for DEFLATE. I am going to develop the algorithm from scratch so that I fully understand it, know precisely what every statement does, and how the code will perform in the JANOS platform. Maybe more importantly that it will not misbehave and drag JANOS down.

Well I was thinking that I would openly do that here so you can play along.  Some approaches to LZ77 have been patented. I have no clue and am not going to spend weeks trying to understand the patent prose. Supposedly the LZ77 implementation associated with DEFLATE is unencumbered. Maybe so. Still I might get creative and cross some line. I am not overly concerned about it but would like to know, at least before getting some sort of legal complaint.

Yeah so… let’s reinvent the wheel. That’s usually how I roll…

If you look into DEFLATE you will be quickly distracted by data structure and Huffman coding. There is some confusing but genius ways of efficiently conveying the Huffman tables and the Huffman coding even seems to be recursively applied. All of that will boggle your mind even before you get to the LZ77 compression at the heart of it all. So, ignore all of that. We will get to it. I am going to start at the heart and build outwards.

LZ77 compression

The compression works by identifying sequences of bytes that occur in the data stream that are repeated later. The repeated sequence is replaced by a short and efficient reference back to the earlier data thereby reducing the size of the overall stream. A 20-year old document by Antaeus Feldspar describes it well by example (my link has since gone bad).

As everything has a limitation, DEFLATE defines a 32KB sliding window which may contain any referenced prior sequence. It just is not feasible to randomly access the entire data set and allow you to reference sequences all the way back to the beginning. This also keeps distances under control allowing only smaller integers to appear in the stream.

It all sounds great but then you realize that in compression you have search that entire 32KB window for matches to the current sequence of bytes each time a new byte is added. Lots of processor cycles are involved and the whole process could take forever. Of course while the decompressor needs to be ready to reference the whole 32KB window a compressor might use a smaller window. That would reduce the effort involved at the cost of compression efficiency. The specifications suggest window and sequence length factors that might be controlled in balancing speed and space efficiency.

It can all get more complex in that the prior sequence can actually overlap the current sequence (as in the example in that document). A further complication comes in if you consider that lazy matching might lead to better compression. A short sequence match might mask a potentially longer match which could have been more beneficial.

So how do I want to proceed here. Some of this reminds me of the fun in creating JANOS’ Regex engine. Hmm…

The Regex engine ends up compiling the expression into one large state machine through which multiple pointers can progress simultaneously. Pointers are advanced as each character is obtained from the data stream. The first pointer (or last depending on the mode) to make it through the whole expression signals a match. If it sounds complicated, it sort of is but at the same time its pretty cool. As far as Regex goes I’ve been able to apply this to almost all of the standard Regex functionality. But JANOS hasn’t implemented every Regex syntax.

For DEFLATE there is somewhat of a similar situation where we want to examine bytes from the data stream one at a time and have the compression algorithm raise its hand when a sequence can be optimally encoded by a [length, distance] pointer. But we want to consider all possibilities and to try to do what leads to the most efficient compression.

I will start by implementing the sliding window as a queue using my standard approach to such things. The size of the queue will be 32K entries or less. In fact, to start out I’ll probably keep it very small. We can enlarge it later when we benchmark the compression algorithm.

Two index pointers will bound the data in the queue. Each new byte will be inserted at the INPTR and that pointer will be incremented and wrapped as required. The oldest byte will be located at the OUTPTR. Once the queue fills we will have to advance the OUTPTR to make room for the next entry. Older data will be dropped and the queue will run at maximum capacity. This is the sliding window.

The sliding window caches the uncompressed data stream. The compressed data stream will be generated by the algorithm separately. We need one more pointer in the sliding widow indicating the start of the current sequence being evaluated. Call that CURPTR. If the current sequence cannot be matched we would output the byte from CURPTR and advance and wrap the pointer as necessary. If the sequence is matched we output the [length, distance] code and advance CURPTR to skip the entire matched sequence.

CURPTR then will lag behind INPTR. It will not get in the way of OUTPTR as DEFLATE specifies a maximum length for the sequence match of 238 bytes and our sliding window will be much larger.

Now lets think about the sequence matching procedure…

I am going to prototype my algorithm in Java on the JNIOR just to make it easier to test and debug. Later, once I have the structure, I can recast it in C and embed it into JANOS.

We’re going to focus on the LZ77 part of the compression first. Our main goal is simply to be compatible with decompression. Our LZ77 algorithm then basically doesn’t have to do anything. We wouldn’t need to find a single repeated sequence nor replace anything with pointers back to any sliding window. Of course our compression ratio wouldn’t be all that impressive. We would still gain through the latter Huffman Coding stages which I am leaving to later. But in the end we would still be able to create file collections and PNG graphics files that are universally usable.

But really, there no fun in a kludge. Let’s see if we can achieve a LZ77 implementation that we can be proud of. Well, at least one that works.

So for development I am going to create a program that will read a selected file compress it using LZ77 into another. I’ll isolate all of the compression effort into one routine and have the outer program report some statistics upon completion.

Here’s our program for testing compression. All of the compression work will be done in do_compress() and this will report the results. At this point we just copy the source file. Yeah, this will be slow. But it will let us examine what we are trying to do more closely than if I went straight to C and use the debugger. In that case I couldn’t really share it with you.

package jtest;
 
import com.integpg.system.JANOS;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
 
public class Main {
    
    public static void main(String[] args) throws Throwable {
        
        // Requires a test file (use log as default)
        String filename = "/jniorsys.log";
        if (args.length > 0)
            filename = args[0];
        
        // Open the selected file for reading
        File src = new File(filename);
        long srclen = src.length();
        BufferedReader infile = new BufferedReader(new FileReader(src));
        if (!infile.ready())
            System.exit(1);
        
        // Create an output file
        BufferedWriter outfile = new BufferedWriter(new FileWriter("/outfile.dat"));
        
        // perform compression
        long timer = JANOS.uptimeMillis();
        do_compress(outfile, infile);
        timer = JANOS.uptimeMillis() - timer;
        
        // Close files
        outfile.close();
        infile.close();
        
        // Output statistics
        File dest = new File("/outfile.dat");
        long destlen = dest.length();
        
        System.out.printf("Processing %.3f seconds.\n", timer/1000.);
        System.out.printf("Source %lld bytes.\n", srclen);
        System.out.printf("Result %lld bytes.\n", destlen);
        System.out.printf("Ratio %.2f%%\n", 100. - (100. * destlen)/srclen);
    }
        
}

This uses my throws Throwable trick to avoid having to worry about try-catch for the time being.

    
    // Our LZ77 compression engine
    static void do_compress(BufferedWriter outfile, BufferedReader infile) throws Throwable {
        
        // simply copy at first
        while (infile.ready()) {
            int ch = infile.read();
            outfile.write(ch);
        }
        
    }
bruce_dev /> jtest
Processing 32.360 seconds.
Source 36737 bytes.
Result 36737 bytes.
Ratio 0.00%

bruce_dev /> echo Blah blah blah blah > blah.dat

bruce_dev /> cat blah.dat
Blah blah blah blah

bruce_dev /> jtest blah.dat
Processing 0.023 seconds.
Source 21 bytes.
Result 21 bytes.
Ratio 0.00%

bruce_dev />

Now back to thinking about the actual compression and matching sequences from a sliding window.

To start let’s implement our sliding window. Recall that I would use a queue for that. You can see here that depending on the window size we will retain the previous so many bytes. New bytes are queued at the INPTR position and when the queue fills we will push the OUTPTR discarding older data.

    // Our LZ77 compression engine
    static void do_compress(BufferedWriter outfile, BufferedReader infile) throws Throwable {
        
        // create queue (sliding window)
        int window = 1024;
        byte[] data = new byte[window];
        int inptr = 0;
        int outptr = 0;
        
        // simply copy at first
        while (infile.ready()) {
            
            // obtain next byte
            int ch = infile.read();
            
            // matching (cannot yet so just output byte)
            outfile.write(ch);
            
            // queue uncompressed data
            data[inptr++] = (byte)ch;
            if (inptr == window)
                inptr = 0;
            if (inptr == outptr) {
                outptr++;
                if (outptr == window)
                    outptr = 0;
            }       
            
        }
        
    }

Now for a particular position in the input stream (CURPTR) we want to scan the queue for sequence matches. We could do that by brute force but for a large sliding window that would be very slow. Also there is the concept of a lazy match which if implemented might lead to better compression ratios. So how to approach the matching process?

So for some position the input stream we will be searching prior data for a sequence match (3 or more bytes). So we create the CURPTR. If we cannot find a match (and right now we cannot because we haven’t implemented matching) we will just output the data byte and bump the current position. Searching will continue for a match starting at the new position.

Right now CURPTR will track INTPTR. Later while we are watching for matches it will lag behind. Here we have created CURPTR. There is otherwise no major functional change.

    // Our LZ77 compression engine
    static void do_compress(BufferedWriter outfile, BufferedReader infile) throws Throwable {
        
        // create queue (sliding window)
        int window = 1024;
        byte[] data = new byte[window];
        int inptr = 0;
        int outptr = 0;
        int curptr = 0;
        
        // simply copy at first
        while (infile.ready()) {
            
            // obtain next byte
            int ch = infile.read();
            
            // matching (cannot yet so just output byte)
            outfile.write(ch);
            curptr++;
            if (curptr == window)
                curptr = 0;
            
            // queue uncompressed data
            data[inptr++] = (byte)ch;
            if (inptr == window)
                inptr = 0;
            if (inptr == outptr) {
                outptr++;
                if (outptr == window)
                    outptr = 0;
            }       
            
        }
        
    }

Now how are we going to do this “watching for matches” thing?

Let’s create the concept of an active match. At any given time we there will be from 0 to some number of active matches. Each will represent a match to the sequence of bytes appearing at the an input stream position. As new bytes are retrieved from the uncompressed input stream we will check any active matches and advance them . If a match is no longer valid it will be removed and forgotten. At that point though maybe the match warrants replacement in the stream with a pointer. We will see.

I made the data queue and related pointers static members of the program and created the following class representing an active match.

    // An active match. At any given position in the sliding WINDOW we compare and track
    //  matches to the incoming DATA stream.
    class match {
        public int start;
        public int ptr;
        public int len;
        
        match(int pos) {
            start = pos;
            ptr = pos + 1;
            if (ptr == WINDOW)
                ptr = 0;
            len = 1;
        }
        
        public boolean check(int ch) {
            if (DATA[ptr] != ch)
                return (false);
            
            ptr++;
            if (ptr == WINDOW)
                ptr = 0;
            len++;
            
            return (true);
        }
    }

When a new data byte is entered into the queue we will want to create these active match objects for every matching byte that previously exists. Every one of those represents a potential sequence.

Okay when we enter a new byte in the queue it becomes a candidate for replacement by a reference to a sequence starting with the byte somewhere earlier. So we want to start a new active match for the earlier bytes. We will process these matches as additional bytes are received from the input stream. To do this though we don’t want to search the entire window for prior existences of each character. So to make things efficient I am going to maintain linked lists through the queue for each character.

Since a byte can have 256 values we create a HEAD pointer array with 256 entries. This is referenced using the byte value. Each queue position then will have both a forward FWD and backwards BACK pointer forming a bi-directional linked list. Yeah, this quintuples our memory requirement but with the benefit of processing speed.

The list has to be bi-directional because once the queue fills we are going to drop bytes. It is then necessary to trim the linked lists to remove pointers for data no longer in the queue. That can only be done efficiently if we can reference a previous entry in the linked list. So we need both directions.

Here are our static members so far. This is the memory usage.

    // create queue (sliding window)
    static final int WINDOW = 1024;
    static final byte[] DATA = new byte[WINDOW];
    static int INPTR = 0;
    static int OUTPTR = 0;
    static int CURPTR = 0;
    
    // data linked list arrays
    static final short[] HEAD = new short[256];
    static final short[] FWD = new short[WINDOW];
    static final short[] BACK = new short[WINDOW];

Now we maintain the linked lists as we add and remove data from the queue. We also can efficiently create new active match objects. Note that we store pointers in the links as +1 so as to keep 0 as a terminator.

            // queue uncompressed DATA
            DATA[INPTR] = (byte)ch;
            
            // Add byte to the head of the appropriate linked list. Note pointers are stored +1 so
            //  as to use 0 as an end of list marker. Lists are bi-directional so we can trim the 
            //  tail when data is dropped from the queue.
            short ptr = HEAD[ch];
            HEAD[ch] = (short)(INPTR + 1);
            FWD[INPTR] = ptr;
            BACK[INPTR] = 0;
            if (ptr != 0)
                BACK[ptr - 1] = (short)(INPTR + 1);
            
            // advance entry pointer
            INPTR++;
            if (INPTR == WINDOW)
                INPTR = 0;
            
            // drop data from queue when full
            if (INPTR == OUTPTR) {
                
                // trim linked list as byte is being dropped
                if (BACK[OUTPTR] == 0)
                    HEAD[DATA[OUTPTR]] = 0;
                else
                    FWD[BACK[OUTPTR] - 1] = 0;
 
                // push end of queue
                OUTPTR++;
                if (OUTPTR == WINDOW)
                    OUTPTR = 0;
            }
 
            // create new active match for all CH in the queue (except last)
            while (ptr != 0) {
                
                // new match started (not doing anything with it yet)
                match m = new match(ptr - 1);
                
                ptr = FWD[ptr - 1];
            }

I adjusted the program to dump non-zero HEAD entries and each occupied queue position including the links as a check. Remember that links are stored in here +1.

bruce_dev /> jtest blah.dat

HEAD
 0x0a 21
 0x0d 20
 0x20 15
 0x42 1
 0x61 18
 0x62 16
 0x68 19
 0x6c 17

QUEUE
 0 0x42 0 0
 1 0x6c 0 7
 2 0x61 0 8
 3 0x68 0 9
 4 0x20 0 10
 5 0x62 0 11
 6 0x6c 2 12
 7 0x61 3 13
 8 0x68 4 14
 9 0x20 5 15
 10 0x62 6 16
 11 0x6c 7 17
 12 0x61 8 18
 13 0x68 9 19
 14 0x20 10 0
 15 0x62 11 0
 16 0x6c 12 0
 17 0x61 13 0
 18 0x68 14 0
 19 0x0d 0 0
 20 0x0a 0 0

STATS
Processing 0.078 seconds.
Source 21 bytes.
Result 21 bytes.
Ratio 0.00%

bruce_dev />

Now as we create new active matches we are going to collect them in an ArrayList object.

    // active matching
    static ArrayList SEQ = new ArrayList();
            // create new active matches for all CH in the queue (except last)
            while (ptr != 0) {
                SEQ.add(new match(ptr - 1));
                ptr = FWD[ptr - 1];
            }

So as each new data byte is retrieved from the uncompressed input stream we will process all active matches. Those that continue to match will be retained and others dropped. That code looks like this:

        // process uncompressed stream
        while (infile.ready()) {
            
            // obtain next byte
            int ch = infile.read();
            
            // process active match objects
            System.out.printf("New byte[%d]: 0x%02x\n", INPTR, ch & 0xff);
            for (int n = SEQ.size() - 1; 0 <= n; n--) {
                match m = SEQ.get(n);
                if (!m.check(ch))
                    SEQ.remove(n);
            }

If following this if I dump the remaining active matches we can watch those proceed. At this point though we have not interpreted the matching status so as to decide whether or not the stream can be altered.

            // dump remaining active matches
            Iterator i = SEQ.iterator();
            while (i.hasNext()) {
                match m = (match) i.next();
                System.out.printf(" Start: %d Ptr: %d Len: %d\n", m.start, m.ptr, m.len);
            }

So in reviewing the An Explanation of the DEFLATE Algorithm paper from 1997:

Antaeus Feldspar wrote:LZ77 compression

LZ77 compression works by finding sequences of data that are repeated. The term “sliding window” is used; all it really means is that at any given point in the data, there is a record of what characters went before. A 32K sliding window means that the compressor (and decompressor) have a record of what the last 32768 (32 * 1024) characters were. When the next sequence of characters to be compressed is identical to one that can be found within the sliding window, the sequence of characters is replaced by two numbers: a distance, representing how far back into the window the sequence starts, and a length, representing the number of characters for which the sequence is identical.

I realize this is a lot easier to see than to just be told. Let’s look at some highly compressible data:

        Blah blah blah blah blah!

Our datastream starts by receiving the following characters: “B,” “l,” “a,” “h,” ” ,” and “b.” However, look at the next five characters:

         vvvvv
        Blah blah blah blah blah!
              ^^^^^

There is an exact match for those five characters in the characters that have already gone into the datastream, and it starts exactly five characters behind the point where we are now. This being the case, we can output special characters to the stream that represent a number for length, and a number for distance.

The data so far:

	Blah blah b

The compressed form of the data so far:

	Blah b[D=5,L=5]

The compression can still be increased, though to take full advantage of it requires a bit of cleverness on the part of the compressor. Look at the two strings that we decided were identical. Compare the character that follows each of them. In both cases, it’s “l” — so we can make the length 6, and not just five. But if we continue checking, we find the next characters, and the next characters, and the next characters, are still identical — even if the so-called ‘previous’ string is overlapping the string we’re trying to represent in the compressed data!

It turns out that the 18 characters that start at the second character are identical to the 18 characters that start at the seventh character. It’s true that when we’re decompressing, and read the length, distance pair that describes this relationship, we don’t know what all those 18 characters will be yet — but if we put in place the ones that we know, we will know more, which will allow us to put down more… or, knowing that any length-and-distance pair where length > distance is going to be repeating (distance) characters again and again, we can set up the decompressor to do just that.

It turns out our highly compressible data can be compressed down to just this:

	Blah b[D=5, L=18]!

So if I feed this exact stream to what we have so far we can observe the sequencing:

bruce_dev /> echo Blah blah blah blah blah! > blah.dat

bruce_dev /> cat blah.dat
Blah blah blah blah blah!

bruce_dev /> jtest blah.dat                           
New byte[0]: 0x42
New byte[1]: 0x6c
New byte[2]: 0x61
New byte[3]: 0x68
New byte[4]: 0x20
New byte[5]: 0x62
New byte[6]: 0x6c
New byte[7]: 0x61
 Start: 1 Ptr: 3 Len: 2
New byte[8]: 0x68
 Start: 1 Ptr: 4 Len: 3
 Start: 2 Ptr: 4 Len: 2
New byte[9]: 0x20
 Start: 1 Ptr: 5 Len: 4
 Start: 2 Ptr: 5 Len: 3
 Start: 3 Ptr: 5 Len: 2
New byte[10]: 0x62
 Start: 1 Ptr: 6 Len: 5
 Start: 2 Ptr: 6 Len: 4
 Start: 3 Ptr: 6 Len: 3
 Start: 4 Ptr: 6 Len: 2
New byte[11]: 0x6c
 Start: 1 Ptr: 7 Len: 6
 Start: 2 Ptr: 7 Len: 5
 Start: 3 Ptr: 7 Len: 4
 Start: 4 Ptr: 7 Len: 3
 Start: 5 Ptr: 7 Len: 2
New byte[12]: 0x61
 Start: 1 Ptr: 8 Len: 7
 Start: 2 Ptr: 8 Len: 6
 Start: 3 Ptr: 8 Len: 5
 Start: 4 Ptr: 8 Len: 4
 Start: 5 Ptr: 8 Len: 3
 Start: 6 Ptr: 8 Len: 2
 Start: 1 Ptr: 3 Len: 2
New byte[13]: 0x68
 Start: 1 Ptr: 9 Len: 8
 Start: 2 Ptr: 9 Len: 7
 Start: 3 Ptr: 9 Len: 6
 Start: 4 Ptr: 9 Len: 5
 Start: 5 Ptr: 9 Len: 4
 Start: 6 Ptr: 9 Len: 3
 Start: 1 Ptr: 4 Len: 3
 Start: 7 Ptr: 9 Len: 2
 Start: 2 Ptr: 4 Len: 2
New byte[14]: 0x20
 Start: 1 Ptr: 10 Len: 9
 Start: 2 Ptr: 10 Len: 8
 Start: 3 Ptr: 10 Len: 7
 Start: 4 Ptr: 10 Len: 6
 Start: 5 Ptr: 10 Len: 5
 Start: 6 Ptr: 10 Len: 4
 Start: 1 Ptr: 5 Len: 4
 Start: 7 Ptr: 10 Len: 3
 Start: 2 Ptr: 5 Len: 3
 Start: 8 Ptr: 10 Len: 2
 Start: 3 Ptr: 5 Len: 2
New byte[15]: 0x62
 Start: 1 Ptr: 11 Len: 10
 Start: 2 Ptr: 11 Len: 9
 Start: 3 Ptr: 11 Len: 8
 Start: 4 Ptr: 11 Len: 7
 Start: 5 Ptr: 11 Len: 6
 Start: 6 Ptr: 11 Len: 5
 Start: 1 Ptr: 6 Len: 5
 Start: 7 Ptr: 11 Len: 4
 Start: 2 Ptr: 6 Len: 4
 Start: 8 Ptr: 11 Len: 3
 Start: 3 Ptr: 6 Len: 3
 Start: 9 Ptr: 11 Len: 2
 Start: 4 Ptr: 6 Len: 2
New byte[16]: 0x6c
 Start: 1 Ptr: 12 Len: 11
 Start: 2 Ptr: 12 Len: 10
 Start: 3 Ptr: 12 Len: 9
 Start: 4 Ptr: 12 Len: 8
 Start: 5 Ptr: 12 Len: 7
 Start: 6 Ptr: 12 Len: 6
 Start: 1 Ptr: 7 Len: 6
 Start: 7 Ptr: 12 Len: 5
 Start: 2 Ptr: 7 Len: 5
 Start: 8 Ptr: 12 Len: 4
 Start: 3 Ptr: 7 Len: 4
 Start: 9 Ptr: 12 Len: 3
 Start: 4 Ptr: 7 Len: 3
 Start: 10 Ptr: 12 Len: 2
 Start: 5 Ptr: 7 Len: 2
New byte[17]: 0x61
 Start: 1 Ptr: 13 Len: 12
 Start: 2 Ptr: 13 Len: 11
 Start: 3 Ptr: 13 Len: 10
 Start: 4 Ptr: 13 Len: 9
 Start: 5 Ptr: 13 Len: 8
 Start: 6 Ptr: 13 Len: 7
 Start: 1 Ptr: 8 Len: 7
 Start: 7 Ptr: 13 Len: 6
 Start: 2 Ptr: 8 Len: 6
 Start: 8 Ptr: 13 Len: 5
 Start: 3 Ptr: 8 Len: 5
 Start: 9 Ptr: 13 Len: 4
 Start: 4 Ptr: 8 Len: 4
 Start: 10 Ptr: 13 Len: 3
 Start: 5 Ptr: 8 Len: 3
 Start: 11 Ptr: 13 Len: 2
 Start: 6 Ptr: 8 Len: 2
 Start: 1 Ptr: 3 Len: 2
New byte[18]: 0x68
 Start: 1 Ptr: 14 Len: 13
 Start: 2 Ptr: 14 Len: 12
 Start: 3 Ptr: 14 Len: 11
 Start: 4 Ptr: 14 Len: 10
 Start: 5 Ptr: 14 Len: 9
 Start: 6 Ptr: 14 Len: 8
 Start: 1 Ptr: 9 Len: 8
 Start: 7 Ptr: 14 Len: 7
 Start: 2 Ptr: 9 Len: 7
 Start: 8 Ptr: 14 Len: 6
 Start: 3 Ptr: 9 Len: 6
 Start: 9 Ptr: 14 Len: 5
 Start: 4 Ptr: 9 Len: 5
 Start: 10 Ptr: 14 Len: 4
 Start: 5 Ptr: 9 Len: 4
 Start: 11 Ptr: 14 Len: 3
 Start: 6 Ptr: 9 Len: 3
 Start: 1 Ptr: 4 Len: 3
 Start: 12 Ptr: 14 Len: 2
 Start: 7 Ptr: 9 Len: 2
 Start: 2 Ptr: 4 Len: 2
New byte[19]: 0x20
 Start: 1 Ptr: 15 Len: 14
 Start: 2 Ptr: 15 Len: 13
 Start: 3 Ptr: 15 Len: 12
 Start: 4 Ptr: 15 Len: 11
 Start: 5 Ptr: 15 Len: 10
 Start: 6 Ptr: 15 Len: 9
 Start: 1 Ptr: 10 Len: 9
 Start: 7 Ptr: 15 Len: 8
 Start: 2 Ptr: 10 Len: 8
 Start: 8 Ptr: 15 Len: 7
 Start: 3 Ptr: 10 Len: 7
 Start: 9 Ptr: 15 Len: 6
 Start: 4 Ptr: 10 Len: 6
 Start: 10 Ptr: 15 Len: 5
 Start: 5 Ptr: 10 Len: 5
 Start: 11 Ptr: 15 Len: 4
 Start: 6 Ptr: 10 Len: 4
 Start: 1 Ptr: 5 Len: 4
 Start: 12 Ptr: 15 Len: 3
 Start: 7 Ptr: 10 Len: 3
 Start: 2 Ptr: 5 Len: 3
 Start: 13 Ptr: 15 Len: 2
 Start: 8 Ptr: 10 Len: 2
 Start: 3 Ptr: 5 Len: 2
New byte[20]: 0x62
 Start: 1 Ptr: 16 Len: 15
 Start: 2 Ptr: 16 Len: 14
 Start: 3 Ptr: 16 Len: 13
 Start: 4 Ptr: 16 Len: 12
 Start: 5 Ptr: 16 Len: 11
 Start: 6 Ptr: 16 Len: 10
 Start: 1 Ptr: 11 Len: 10
 Start: 7 Ptr: 16 Len: 9
 Start: 2 Ptr: 11 Len: 9
 Start: 8 Ptr: 16 Len: 8
 Start: 3 Ptr: 11 Len: 8
 Start: 9 Ptr: 16 Len: 7
 Start: 4 Ptr: 11 Len: 7
 Start: 10 Ptr: 16 Len: 6
 Start: 5 Ptr: 11 Len: 6
 Start: 11 Ptr: 16 Len: 5
 Start: 6 Ptr: 11 Len: 5
 Start: 1 Ptr: 6 Len: 5
 Start: 12 Ptr: 16 Len: 4
 Start: 7 Ptr: 11 Len: 4
 Start: 2 Ptr: 6 Len: 4
 Start: 13 Ptr: 16 Len: 3
 Start: 8 Ptr: 11 Len: 3
 Start: 3 Ptr: 6 Len: 3
 Start: 14 Ptr: 16 Len: 2
 Start: 9 Ptr: 11 Len: 2
 Start: 4 Ptr: 6 Len: 2
New byte[21]: 0x6c
 Start: 1 Ptr: 17 Len: 16
 Start: 2 Ptr: 17 Len: 15
 Start: 3 Ptr: 17 Len: 14
 Start: 4 Ptr: 17 Len: 13
 Start: 5 Ptr: 17 Len: 12
 Start: 6 Ptr: 17 Len: 11
 Start: 1 Ptr: 12 Len: 11
 Start: 7 Ptr: 17 Len: 10
 Start: 2 Ptr: 12 Len: 10
 Start: 8 Ptr: 17 Len: 9
 Start: 3 Ptr: 12 Len: 9
 Start: 9 Ptr: 17 Len: 8
 Start: 4 Ptr: 12 Len: 8
 Start: 10 Ptr: 17 Len: 7
 Start: 5 Ptr: 12 Len: 7
 Start: 11 Ptr: 17 Len: 6
 Start: 6 Ptr: 12 Len: 6
 Start: 1 Ptr: 7 Len: 6
 Start: 12 Ptr: 17 Len: 5
 Start: 7 Ptr: 12 Len: 5
 Start: 2 Ptr: 7 Len: 5
 Start: 13 Ptr: 17 Len: 4
 Start: 8 Ptr: 12 Len: 4
 Start: 3 Ptr: 7 Len: 4
 Start: 14 Ptr: 17 Len: 3
 Start: 9 Ptr: 12 Len: 3
 Start: 4 Ptr: 7 Len: 3
 Start: 15 Ptr: 17 Len: 2
 Start: 10 Ptr: 12 Len: 2
 Start: 5 Ptr: 7 Len: 2
New byte[22]: 0x61
 Start: 1 Ptr: 18 Len: 17
 Start: 2 Ptr: 18 Len: 16
 Start: 3 Ptr: 18 Len: 15
 Start: 4 Ptr: 18 Len: 14
 Start: 5 Ptr: 18 Len: 13
 Start: 6 Ptr: 18 Len: 12
 Start: 1 Ptr: 13 Len: 12
 Start: 7 Ptr: 18 Len: 11
 Start: 2 Ptr: 13 Len: 11
 Start: 8 Ptr: 18 Len: 10
 Start: 3 Ptr: 13 Len: 10
 Start: 9 Ptr: 18 Len: 9
 Start: 4 Ptr: 13 Len: 9
 Start: 10 Ptr: 18 Len: 8
 Start: 5 Ptr: 13 Len: 8
 Start: 11 Ptr: 18 Len: 7
 Start: 6 Ptr: 13 Len: 7
 Start: 1 Ptr: 8 Len: 7
 Start: 12 Ptr: 18 Len: 6
 Start: 7 Ptr: 13 Len: 6
 Start: 2 Ptr: 8 Len: 6
 Start: 13 Ptr: 18 Len: 5
 Start: 8 Ptr: 13 Len: 5
 Start: 3 Ptr: 8 Len: 5
 Start: 14 Ptr: 18 Len: 4
 Start: 9 Ptr: 13 Len: 4
 Start: 4 Ptr: 8 Len: 4
 Start: 15 Ptr: 18 Len: 3
 Start: 10 Ptr: 13 Len: 3
 Start: 5 Ptr: 8 Len: 3
 Start: 16 Ptr: 18 Len: 2
 Start: 11 Ptr: 13 Len: 2
 Start: 6 Ptr: 8 Len: 2
 Start: 1 Ptr: 3 Len: 2
New byte[23]: 0x68
 Start: 1 Ptr: 19 Len: 18
 Start: 2 Ptr: 19 Len: 17
 Start: 3 Ptr: 19 Len: 16
 Start: 4 Ptr: 19 Len: 15
 Start: 5 Ptr: 19 Len: 14
 Start: 6 Ptr: 19 Len: 13
 Start: 1 Ptr: 14 Len: 13
 Start: 7 Ptr: 19 Len: 12
 Start: 2 Ptr: 14 Len: 12
 Start: 8 Ptr: 19 Len: 11
 Start: 3 Ptr: 14 Len: 11
 Start: 9 Ptr: 19 Len: 10
 Start: 4 Ptr: 14 Len: 10
 Start: 10 Ptr: 19 Len: 9
 Start: 5 Ptr: 14 Len: 9
 Start: 11 Ptr: 19 Len: 8
 Start: 6 Ptr: 14 Len: 8
 Start: 1 Ptr: 9 Len: 8
 Start: 12 Ptr: 19 Len: 7
 Start: 7 Ptr: 14 Len: 7
 Start: 2 Ptr: 9 Len: 7
 Start: 13 Ptr: 19 Len: 6
 Start: 8 Ptr: 14 Len: 6
 Start: 3 Ptr: 9 Len: 6
 Start: 14 Ptr: 19 Len: 5
 Start: 9 Ptr: 14 Len: 5
 Start: 4 Ptr: 9 Len: 5
 Start: 15 Ptr: 19 Len: 4
 Start: 10 Ptr: 14 Len: 4
 Start: 5 Ptr: 9 Len: 4
 Start: 16 Ptr: 19 Len: 3
 Start: 11 Ptr: 14 Len: 3
 Start: 6 Ptr: 9 Len: 3
 Start: 1 Ptr: 4 Len: 3
 Start: 17 Ptr: 19 Len: 2
 Start: 12 Ptr: 14 Len: 2
 Start: 7 Ptr: 9 Len: 2
 Start: 2 Ptr: 4 Len: 2
New byte[24]: 0x21
New byte[25]: 0x0d
New byte[26]: 0x0a
Processing 2.494 seconds.
Source 27 bytes.
Result 27 bytes.
Ratio 0.00%

bruce_dev /> 

Now the ‘D’ in the paper refers to the distance between our CURPTR at the point when a match is started and the matched sequence position. If you persevere through the above you can verify that D=5 would be correct and that our best match ran through the length of 18.

So we now need the logic controlling the advancement of CURPTR and what then to output to the compressed stream.

So the strategy at this point is to process all active matches. We don’t move the CURPTR when there is at least one potential match still in the works. When a data byte is received that does not extend a match we remove that match from the array. We keep track of the best of the matches that terminate as that would be a candidate for using a reference pointer if none remain active.

    // Our LZ77 compression engine
    static void do_compress(BufferedWriter outfile, BufferedReader infile) throws Throwable {
        
        // process uncompressed stream
        while (infile.ready()) {
            
            // obtain next byte
            int ch = infile.read();
            System.out.print((char)ch);
            
            // process active match objects
            boolean bActive = false;
            match best = null;
            for (int n = SEQ.size() - 1; 0 <= n; n--) { match m = SEQ.get(n); if (!m.check(ch)) { if (m.curptr == CURPTR && best == null && m.len >= 3)
                        best = m;
                    SEQ.remove(n);
                }
                else if (m.curptr == CURPTR)
                    bActive = true;
            }
            

From the above bActive will be true if there remain potentially longer sequences. In that case we will not do anything with CURPTR or output anything. We just eventually move on to process the next byte from the uncompressed input stream.

If bActive is false and best remains null then the byte at CURPTR will be output and CURPTR advanced.

Otherwise we have match to something in the sliding window and that can replace the uncompressed sequence.

            // If there is no active sequence then we need to generate some output
            if (!bActive) {
                
                // If there's been no match then we output data as is
                if (best == null) {
                    while (CURPTR != INPTR) {
                        outfile.write(DATA[CURPTR]);
                        CURPTR++;
                        if (CURPTR == WINDOW)
                            CURPTR = 0;
 
                        int n;
                        for (n = SEQ.size() - 1; 0 <= n; n--) {
                            match m = SEQ.get(n);
                            if (m.curptr == CURPTR)
                                break;
                        }
                        if (0 <= n)
                            break;
                    }
                }
                
                // otherwise we can substitute
                else {
                    int distance = best.curptr - best.start;
                    if (distance < 0) distance += WINDOW; String msg = String.format("[D=%d, L=%d]", distance, best.len); outfile.write(msg); // flush active matches int len = best.len; while (len-- > 0) {
                        CURPTR++;
                        if (CURPTR == WINDOW)
                            CURPTR = 0;
 
                        // remove overlapped active sequences
                        for (int n = SEQ.size() - 1; 0 <= n; n--) {
                            match m = SEQ.get(n);
                            if (m.curptr == CURPTR)
                                SEQ.remove(n);
                        }
                    }
                    
                }
            }  

In the above when we are outputting data uncompressed from CURPTR we continue until we encounter an active match. When we replace a sequence we flush any active matches involving the data replaced. Those matches (which potentially could be more beneficial) are no longer valid. Note that for debugging I am outputting the [D= L=] format so I can see what is being replaced and how more clearly.

Later the best match would also be the one with the lowest distance. That saves bits and takes advantage of the Huffman Coding which we have yet to implement.

The output using the above logic and debugging looks like this for the blah blah test data.

bruce_dev /> jtest blah.dat
Blah blah blah blah blah!
Processing 0.856 seconds.
Source 27 bytes.
Result 19 bytes.
Ratio 29.63%

bruce_dev /> cat outfile.dat
Blah b[D=5, L=18]!
bruce_dev />

This agrees nicely with the example from the paper. This works with more elaborate data as well. I know that this is a Java prototype but it seems even a bit too slow for larger files even considering the platform. I get the feeling that I am creating far too many new active match objects. Perhaps there is some logic as to what is worth creating and what isn’t.

It turns out that I am likely creating 3X the number of active match objects than is necessary however the overhead to detect them slows the process way too much. It seems better to create the unnecessary match objects and optimize them out later.

I decided to take a step back and not try to generate the compressed output stream just yet. Instead I want to look at the sequence matches that I detect to see what logic I would need to achieve an optimal compression. There is this lazy match optimization to consider.

You might recall that the approach is that when I retrieve a byte from the uncompressed input stream I process all of the active sequence match objects. If the new character extends a match then it remains active. Otherwise the sequence is complete and we decide whether it is useful or not before removing it from the active match list. A useful match is simply one that is 3 or more bytes in length.

The uncompressed data byte is queued (enters the sliding window) and we create new active sequence matches for every matching bat still in the window. We use a linked list to efficiently locate those characters.

If we simply list those potentially useful matches the code looks like what follows.

    // Our LZ77 compression engine
    static void do_compress(BufferedWriter outfile, BufferedReader infile) throws Throwable {
        
        // process uncompressed stream
        while (infile.ready()) {
            
            // obtain next byte
            int ch = infile.read();
            //System.out.print((char)ch);
            
            // process active Match objects
            for (int n = SEQ.size() - 1; 0 <= n; n--) { Match m = SEQ.get(n); if (!m.check(ch)) { if (m.len >= 3) {
                        System.out.printf("I=%04x C=%04x P=%04x, D=%d, L=%d\n", 
                                INPTR, m.curptr, m.start, m.distance, m.len);
                    }
                            
                    SEQ.remove(n);
                }
            }
            
            // queue uncompressed DATA
            int inp = INPTR;
            DATA[INPTR] = (byte)ch;
            
            // Add byte to the head of the appropriate linked list. Note pointers are stored +1 so
            //  as to use 0 as an end of list marker. Lists are bi-directional so we can trim the 
            //  tail when data is dropped from the queue.
            short ptr = HEAD[ch];
            HEAD[ch] = (short)(INPTR + 1);
            FWD[INPTR] = ptr;
            BACK[INPTR] = 0;
            if (ptr != 0)
                BACK[ptr - 1] = (short)(INPTR + 1);
            
            // advance entry pointer
            INPTR++;
            if (INPTR == WINDOW)
                INPTR = 0;
            
            // drop data from queue when full
            if (INPTR == OUTPTR) {
                
                // trim linked list as byte is being dropped
                if (BACK[OUTPTR] == 0)
                    HEAD[DATA[OUTPTR]] = 0;
                else
                    FWD[BACK[OUTPTR] - 1] = 0;
 
                // push end of queue
                OUTPTR++;
                if (OUTPTR == WINDOW)
                    OUTPTR = 0;
            }
 
            // create new active matches for all CH in the queue (except last)
            while (ptr != 0) {
                SEQ.add(new Match(inp, ptr - 1));
                ptr = FWD[ptr - 1];
            }
        }
    }

Now we display the match results with pointers in hex so we can locate them in a hex dump. These are ‘I’ giving the uncompressed input position; ‘C’ showing the input position when the match was created; ‘P’ is the position in the sliding window or queue; ‘D’ is the distance (basically C minus P); And, ‘L’ the length of the match. We get the following for the blah blah data.

bruce_dev /> cat blah.dat -h
00000000  42 6c 61 68 20 62 6c 61  68 20 62 6c 61 68 20 62  Blah.bla h.blah.b
00000010  6c 61 68 20 62 6c 61 68  21 0d 0a                 lah.blah !..

bruce_dev /> jtest blah.dat
I=0018 C=0015 P=0001, D=20, L=3
I=0018 C=0015 P=0006, D=15, L=3
I=0018 C=0015 P=000b, D=10, L=3
I=0018 C=0015 P=0010, D=5, L=3
I=0018 C=0014 P=0005, D=15, L=4
I=0018 C=0014 P=000a, D=10, L=4
I=0018 C=0014 P=000f, D=5, L=4
I=0018 C=0013 P=0004, D=15, L=5
I=0018 C=0013 P=0009, D=10, L=5
I=0018 C=0013 P=000e, D=5, L=5
I=0018 C=0012 P=0003, D=15, L=6
I=0018 C=0012 P=0008, D=10, L=6
I=0018 C=0012 P=000d, D=5, L=6
I=0018 C=0011 P=0002, D=15, L=7
I=0018 C=0011 P=0007, D=10, L=7
I=0018 C=0011 P=000c, D=5, L=7
I=0018 C=0010 P=0001, D=15, L=8
I=0018 C=0010 P=0006, D=10, L=8
I=0018 C=0010 P=000b, D=5, L=8
I=0018 C=000f P=0005, D=10, L=9
I=0018 C=000f P=000a, D=5, L=9
I=0018 C=000e P=0004, D=10, L=10
I=0018 C=000e P=0009, D=5, L=10
I=0018 C=000d P=0003, D=10, L=11
I=0018 C=000d P=0008, D=5, L=11
I=0018 C=000c P=0002, D=10, L=12
I=0018 C=000c P=0007, D=5, L=12
I=0018 C=000b P=0001, D=10, L=13
I=0018 C=000b P=0006, D=5, L=13
I=0018 C=000a P=0005, D=5, L=14
I=0018 C=0009 P=0004, D=5, L=15
I=0018 C=0008 P=0003, D=5, L=16
I=0018 C=0007 P=0002, D=5, L=17
I=0018 C=0006 P=0001, D=5, L=18
Processing 1.063 seconds.
Source 27 bytes.
Result 0 bytes.
Ratio 100.00%

bruce_dev />

You can see from this that we are processing a lot of matches for what we know will be just one replacement. You can see that for each new byte received (at C) we create potential matches for all matching bytes (1 or more P) each with a fixed D. We then have advanced the length L as additional characters extend the match.

Of all these matches completed there will be one best match. That would be the longest match (largest L) and if there is a choice between multiple matches of the same length (L) we would take the one with lowest distance (D).

Now let me add logic to select the best match and only display that one. I am also going to display a marker (“—-“) when we reach a point in the input stream after having found at least one match when there are no active matches in process. That point would be a good time to generate the compressed data based on the matching sequences we’ve found.

So that section of the code now looks like this.

    // Our LZ77 compression engine
    static void do_compress(BufferedWriter outfile, BufferedReader infile) throws Throwable {
        
        boolean bFound = false;
 
        // process uncompressed stream
        while (infile.ready()) {
            
            // obtain next byte
            int ch = infile.read();
            //System.out.print((char)ch);
            
            // process active Match objects
            Match best = null;
            for (int n = SEQ.size() - 1; 0 <= n; n--) { Match m = SEQ.get(n); if (!m.check(ch)) { if (m.len >= 3) {
                        if (best == null)
                            best = m;
                        else if (m.len > best.len)
                            best = m;
                        else if (m.len == best.len && m.distance < best.distance)
                            best = m;
                    }
                            
                    SEQ.remove(n);
                }
            }
            if (best != null) {
                System.out.printf("I=%04x C=%04x P=%04x, D=%d, L=%d\n", 
                        INPTR, best.curptr, best.start, best.distance, best.len);
                bFound = true;
            }
            
            if (bFound && SEQ.size() == 0) {
                System.out.println("----");
                bFound = false;
            }
            
            // queue uncompressed DATA

Processing the blah blah data yield the following which we have seen is the replacement that we are hoping for.

bruce_dev /> cat blah.dat -h
00000000  42 6c 61 68 20 62 6c 61  68 20 62 6c 61 68 20 62  Blah.bla h.blah.b
00000010  6c 61 68 20 62 6c 61 68  21 0d 0a                 lah.blah !..

bruce_dev /> jtest blah.dat 
I=0018 C=0006 P=0001, D=5, L=18
----
Processing 0.819 seconds.
Source 27 bytes.
Result 0 bytes.
Ratio 100.00%

bruce_dev /> 

What about some more involved input data?

bruce_dev /> cat jniorboot.log -h
00000000  30 31 2f 30 34 2f 31 38  20 30 38 3a 31 39 3a 31  01/04/18 .08:19:1
00000010  38 2e 31 31 31 2c 20 2a  2a 20 4f 53 20 43 52 43  8.111,.* *.OS.CRC
00000020  20 64 65 74 61 69 6c 20  75 70 64 61 74 65 64 0d  .detail. updated.
00000030  0a 30 31 2f 30 34 2f 31  38 20 30 38 3a 31 39 3a  .01/04/1 8.08:19:
00000040  31 38 2e 31 35 38 2c 20  2d 2d 20 4d 6f 64 65 6c  18.158,. --.Model
00000050  20 34 31 30 20 76 31 2e  36 2e 33 20 2d 20 4a 41  .410.v1. 6.3.-.JA
00000060  4e 4f 53 20 53 65 72 69  65 73 20 34 0d 0a 30 31  NOS.Seri es.4..01
00000070  2f 30 34 2f 31 38 20 30  38 3a 31 39 3a 31 38 2e  /04/18.0 8:19:18.
00000080  31 37 38 2c 20 43 6f 70  79 72 69 67 68 74 20 28  178,.Cop yright.(
00000090  63 29 20 32 30 31 32 2d  32 30 31 38 20 49 4e 54  c).2012- 2018.INT
000000A0  45 47 20 50 72 6f 63 65  73 73 20 47 72 6f 75 70  EG.Proce ss.Group
000000B0  2c 20 49 6e 63 2e 2c 20  47 69 62 73 6f 6e 69 61  ,.Inc.,. Gibsonia
000000C0  20 50 41 20 55 53 41 0d  0a 30 31 2f 30 34 2f 31  .PA.USA. .01/04/1
000000D0  38 20 30 38 3a 31 39 3a  31 38 2e 31 39 37 2c 20  8.08:19: 18.197,.
000000E0  4a 41 4e 4f 53 20 77 72  69 74 74 65 6e 20 61 6e  JANOS.wr itten.an
000000F0  64 20 64 65 76 65 6c 6f  70 65 64 20 62 79 20 42  d.develo ped.by.B
00000100  72 75 63 65 20 43 6c 6f  75 74 69 65 72 0d 0a 30  ruce.Clo utier..0
00000110  31 2f 30 34 2f 31 38 20  30 38 3a 31 39 3a 31 38  1/04/18. 08:19:18
00000120  2e 32 31 36 2c 20 53 65  72 69 61 6c 20 4e 75 6d  .216,.Se rial.Num
00000130  62 65 72 3a 20 36 31 34  30 37 30 35 30 30 0d 0a  ber:.614 070500..
00000140  30 31 2f 30 34 2f 31 38  20 30 38 3a 31 39 3a 31  01/04/18 .08:19:1
00000150  38 2e 32 33 36 2c 20 46  69 6c 65 20 53 79 73 74  8.236,.F ile.Syst
00000160  65 6d 20 6d 6f 75 6e 74  65 64 0d 0a 30 31 2f 30  em.mount ed..01/0
00000170  34 2f 31 38 20 30 38 3a  31 39 3a 31 38 2e 32 35  4/18.08: 19:18.25
00000180  37 2c 20 52 65 67 69 73  74 72 79 20 6d 6f 75 6e  7,.Regis try.moun
00000190  74 65 64 0d 0a 30 31 2f  30 34 2f 31 38 20 30 38  ted..01/ 04/18.08
000001A0  3a 31 39 3a 31 38 2e 33  30 36 2c 20 4e 65 74 77  :19:18.3 06,.Netw
000001B0  6f 72 6b 20 49 6e 69 74  69 61 6c 69 7a 65 64 0d  ork.Init ialized.
000001C0  0a 30 31 2f 30 34 2f 31  38 20 30 38 3a 31 39 3a  .01/04/1 8.08:19:
000001D0  31 38 2e 33 32 36 2c 20  45 74 68 65 72 6e 65 74  18.326,. Ethernet
000001E0  20 41 64 64 72 65 73 73  3a 20 39 63 3a 38 64 3a  .Address :.9c:8d:
000001F0  31 61 3a 30 30 3a 30 37  3a 65 65 0d 0a 30 31 2f  1a:00:07 :ee..01/
00000200  30 34 2f 31 38 20 30 38  3a 31 39 3a 31 38 2e 34  04/18.08 :19:18.4
00000210  34 37 2c 20 53 65 6e 73  6f 72 20 50 6f 72 74 20  47,.Sens or.Port.
00000220  69 6e 69 74 69 61 6c 69  7a 65 64 0d 0a 30 31 2f  initiali zed..01/
00000230  30 34 2f 31 38 20 30 38  3a 31 39 3a 31 38 2e 35  04/18.08 :19:18.5
00000240  30 32 2c 20 49 2f 4f 20  73 65 72 76 69 63 65 73  02,.I/O. services
00000250  20 69 6e 69 74 69 61 6c  69 7a 65 64 0d 0a 30 31  .initial ized..01
00000260  2f 30 34 2f 31 38 20 30  38 3a 31 39 3a 31 38 2e  /04/18.0 8:19:18.
00000270  35 33 35 2c 20 46 54 50  20 73 65 72 76 65 72 20  535,.FTP .server.
00000280  65 6e 61 62 6c 65 64 20  66 6f 72 20 70 6f 72 74  enabled. for.port
00000290  20 32 31 0d 0a 30 31 2f  30 34 2f 31 38 20 30 38  .21..01/ 04/18.08
000002A0  3a 31 39 3a 31 38 2e 35  35 36 2c 20 50 72 6f 74  :19:18.5 56,.Prot
000002B0  6f 63 6f 6c 20 73 65 72  76 65 72 20 65 6e 61 62  ocol.ser ver.enab
000002C0  6c 65 64 20 66 6f 72 20  70 6f 72 74 20 39 32 30  led.for. port.920
000002D0  30 0d 0a 30 31 2f 30 34  2f 31 38 20 30 38 3a 31  0..01/04 /18.08:1
000002E0  39 3a 31 38 2e 35 38 36  2c 20 57 65 62 53 65 72  9:18.586 ,.WebSer
000002F0  76 65 72 20 65 6e 61 62  6c 65 64 20 66 6f 72 20  ver.enab led.for.
00000300  70 6f 72 74 20 38 30 0d  0a 30 31 2f 30 34 2f 31  port.80. .01/04/1
00000310  38 20 30 38 3a 31 39 3a  31 38 2e 36 30 38 2c 20  8.08:19: 18.608,.
00000320  54 65 6c 6e 65 74 20 73  65 72 76 65 72 20 65 6e  Telnet.s erver.en
00000330  61 62 6c 65 64 20 66 6f  72 20 70 6f 72 74 20 32  abled.fo r.port.2
00000340  33 0d 0a 30 31 2f 30 34  2f 31 38 20 30 38 3a 31  3..01/04 /18.08:1
00000350  39 3a 31 38 2e 36 33 32  2c 20 50 4f 52 3a 20 35  9:18.632 ,.POR:.5
00000360  39 32 36 0d 0a 30 31 2f  30 34 2f 31 38 20 30 38  926..01/ 04/18.08
00000370  3a 31 39 3a 31 38 2e 36  35 33 2c 20 43 75 6d 75  :19:18.6 53,.Cumu
00000380  6c 61 74 69 76 65 20 52  75 6e 74 69 6d 65 3a 20  lative.R untime:.
00000390  38 20 57 65 65 6b 73 20  35 20 44 61 79 73 20 31  8.Weeks. 5.Days.1
000003A0  20 48 6f 75 72 20 32 34  3a 33 32 2e 32 38 31 0d  .Hour.24 :32.281.
000003B0  0a 30 31 2f 30 34 2f 31  38 20 30 38 3a 31 39 3a  .01/04/1 8.08:19:
000003C0  31 38 2e 36 37 38 2c 20  42 6f 6f 74 20 43 6f 6d  18.678,. Boot.Com
000003D0  70 6c 65 74 65 64 20 5b  32 2e 33 20 73 65 63 6f  pleted.[ 2.3.seco
000003E0  6e 64 73 5d 0d 0a                                 nds]..
bruce_dev /> jtest jniorboot.log
I=0044 C=0031 P=0000, D=49, L=19
----
I=0064 C=0061 P=001a, D=71, L=3
----
I=0081 C=006c P=002f, D=61, L=21
----
I=0085 C=0082 P=0045, D=61, L=3
----
I=009b C=0098 P=0093, D=5, L=3
I=009d C=009a P=0074, D=38, L=3
----
I=00d2 C=00cf P=009a, D=53, L=3
I=00dc C=00c7 P=006c, D=91, L=21
----
I=00e6 C=00df P=005d, D=130, L=7
----
I=00f4 C=00f1 P=0020, D=209, L=3
----
I=0118 C=0115 P=009a, D=123, L=3
I=0121 C=010d P=00c7, D=70, L=20
----
I=012a C=0125 P=0063, D=194, L=5
----
I=0149 C=0146 P=009a, D=172, L=3
I=0152 C=013e P=00c7, D=119, L=20
I=0153 C=013e P=010d, D=49, L=21
----
I=0157 C=0154 P=0123, D=49, L=3
----
I=0175 C=0172 P=009a, D=216, L=3
I=017e C=0167 P=002c, D=315, L=23
I=017f C=016a P=013e, D=44, L=21
----
I=0183 C=0180 P=00dd, D=163, L=3
----
I=019e C=019b P=009a, D=257, L=3
I=01a7 C=018b P=0162, D=41, L=28
----
I=01ac C=01a9 P=0154, D=85, L=3
----
I=01b6 C=01b3 P=00b1, D=258, L=3
I=01bb C=01b8 P=0129, D=143, L=3
----
I=01ca C=01c7 P=009a, D=301, L=3
I=01d3 C=01bd P=0168, D=85, L=22
I=01d4 C=01bd P=0191, D=44, L=23
----
I=01d8 C=01d5 P=01a9, D=44, L=3
----
I=01e8 C=01e5 P=00a7, D=318, L=3
----
I=0206 C=0203 P=009a, D=361, L=3
I=020f C=01fb P=01bf, D=60, L=20
----
I=0214 C=0211 P=0180, D=145, L=3
I=0216 C=0212 P=0124, D=238, L=4
----
I=0227 C=0224 P=0129, D=251, L=3
I=0236 C=0233 P=009a, D=409, L=3
I=023f C=0221 P=01b5, D=108, L=30
----
I=0245 C=0242 P=00b0, D=402, L=3
----
I=0250 C=024d P=00a6, D=423, L=3
I=0251 C=024e P=0068, D=486, L=3
I=0258 C=0255 P=0129, D=300, L=3
I=0267 C=0264 P=009a, D=458, L=3
I=0270 C=0252 P=01b5, D=157, L=30
I=0271 C=0250 P=021f, D=49, L=33
----
I=0276 C=0273 P=0155, D=286, L=3
----
I=027d C=0278 P=0247, D=49, L=5
----
I=0288 C=0285 P=00f9, D=396, L=3
----
I=028c C=0289 P=0218, D=113, L=3
----
I=0291 C=028d P=021c, D=113, L=4
----
I=029e C=029b P=009a, D=513, L=3
I=02a7 C=0293 P=01fb, D=152, L=20
I=02a8 C=0293 P=025c, D=55, L=21
----
I=02ac C=02a9 P=01d5, D=212, L=3
I=02af C=02ab P=00a2, D=521, L=4
----
I=02b9 C=02b4 P=0247, D=109, L=5
I=02c4 C=02c1 P=00f9, D=456, L=3
I=02c8 C=02c5 P=0218, D=173, L=3
I=02cd C=02b4 P=0278, D=60, L=25
----
I=02dc C=02d9 P=009a, D=575, L=3
I=02e5 C=02cf P=013c, D=403, L=22
I=02e6 C=02d1 P=0293, D=62, L=21
----
I=02ea C=02e7 P=02a9, D=62, L=3
----
I=02f0 C=02ed P=0126, D=455, L=3
I=02f1 C=02ee P=0249, D=165, L=3
I=02fc C=02f9 P=00f9, D=512, L=3
I=0300 C=02fd P=0218, D=229, L=3
I=0305 C=02ee P=02b6, D=56, L=23
----
I=0312 C=030f P=009a, D=629, L=3
I=031b C=0306 P=02d0, D=54, L=21
----
I=0320 C=031d P=0082, D=667, L=3
----
I=0327 C=0323 P=01dd, D=326, L=4
I=032b C=0326 P=0247, D=223, L=5
I=0336 C=0333 P=00f9, D=570, L=3
I=033a C=0337 P=0218, D=287, L=3
I=033f C=0326 P=02b4, D=114, L=25
I=0340 C=0326 P=0278, D=174, L=26
----
I=034c C=0349 P=009a, D=687, L=3
I=0355 C=0341 P=02d1, D=112, L=20
I=0356 C=0341 P=0307, D=58, L=21
----
I=035a C=0357 P=0241, D=278, L=3
I=035b C=0358 P=02aa, D=174, L=3
----
I=036e C=036b P=009a, D=721, L=3
I=0377 C=0363 P=02d1, D=146, L=20
I=0378 C=0363 P=0341, D=34, L=21
----
I=037d C=037a P=0083, D=759, L=3
----
I=038b C=0388 P=018e, D=506, L=3
----
I=0394 C=0391 P=02e9, D=168, L=3
----
I=03ba C=03b7 P=009a, D=797, L=3
I=03c3 C=03ae P=0292, D=284, L=21
I=03c4 C=03af P=0363, D=76, L=21
----
I=03c8 C=03c4 P=0081, D=835, L=4
----
I=03cf C=03cc P=0084, D=840, L=3
----
I=03d6 C=03d3 P=0190, D=579, L=3
I=03d7 C=03d4 P=0333, D=161, L=3
----
I=03dc C=03d9 P=0059, D=896, L=3
I=03de C=03db P=0326, D=181, L=3
----
Processing 77.371 seconds.
Source 998 bytes.
Result 0 bytes.
Ratio 100.00%

bruce_dev /> 

Don’t let the execution times scare you. Remember we are just bread-boarding this in Java and this is running on the JNIOR after all.

So at those points where we would generate compressed output when there is only one match our task is obvious. But what about when there is more than one?

Look closely these matches overlap! That means that if we had acted on the first one we may have missed the opportunity to employ one that follows which sometimes would be a serious improvement. This is the benefit of performing the lazy matches.

There are some cases where there are several usable sequences for a block. We need logic now to select one or more of those sequences so as to end up with the absolute minimum length of generated compressed data. It is not just taking the one longest as there could be two or more that do not overlap but that would result in the optimum outcome. This requires a little examination…

For example in this case:

I=009b C=0098 P=0093, D=5, L=3
I=009d C=009a P=0074, D=38, L=3
----

If we replace 3 bytes at position 0x98 with [D=5, L=3] that will conflict with data for 0x9a and that second match would be unusable. But in this case since both matches are L=3 we really have no choice but to use only one. Here we need to select the one with the shorter distance (D=5) as that would benefit likely the most from the Huffman coding yet to come.

The following case is a little more interesting:

I=00d2 C=00cf P=009a, D=53, L=3
I=00dc C=00c7 P=006c, D=91, L=21
----

Here we see that we can replace 21 bytes at 0xc7 with [D=91, L=21] and since this sequence completely contains the other it isn’t needed at all. In this case going with the longest match happens to provide the best compression. But that cannot always be the rule. Here we need to be careful that our algorithm just doesn’t blindly go for the first replacement.

A little further into the file we have this case:

I=0149 C=0146 P=009a, D=172, L=3
I=0152 C=013e P=00c7, D=119, L=20
I=0153 C=013e P=010d, D=49, L=21
----

Here the longest is again the most beneficial as it completely overlaps the other two.

How about this one?

I=0175 C=0172 P=009a, D=216, L=3
I=017e C=0167 P=002c, D=315, L=23
I=017f C=016a P=013e, D=44, L=21
----

The 2nd and 3rd matches both eliminate the usefulness of the 1st. The 2nd replaces 23 bytes from addresses 0x167 thru 0x17d inclusive. The 3rd replaces 21 bytes from 0x16a thru 0x17e inclusive. There is not a complete overlap but since we still have to choose one over the other the longer one is the most beneficial.

You can see how we might benefit from some careful implementation here. We do have some flexibility to partially implement one or both of the sequences.

Look at this case that occurs further into the file:

I=0250 C=024d P=00a6, D=423, L=3
I=0251 C=024e P=0068, D=486, L=3
I=0258 C=0255 P=0129, D=300, L=3
I=0267 C=0264 P=009a, D=458, L=3
I=0270 C=0252 P=01b5, D=157, L=30
I=0271 C=0250 P=021f, D=49, L=33
----

We should elect to use the first to replace bytes from 0x24d thru 0x24f inclusive with [D=423, L=3]. We then should use the 6th to replace bytes from 0x250 thru 0x270 inclusive with [D=49, L=33]. Thus we replace a total of 36 bytes with two references.

Okay, time to create some logic.

To help visualize I added a little plotting. I found this case as an example of partially applying a match.

I=0327 C=0323 P=01dd, D=326, L=4
I=032b C=0326 P=0247, D=223, L=5
I=0336 C=0333 P=00f9, D=570, L=3
I=033a C=0337 P=0218, D=287, L=3
I=033f C=0326 P=02b4, D=114, L=25
I=0340 C=0326 P=0278, D=174, L=26
0323 - 0340
|--|                          
   |---|                      
                |-|           
                    |-|       
   |-----------------------|  
   |------------------------| 

We can see from this that we can replace the entire range from 0x323 to 0x340 using two sequences.

One option is to truncate the 1st using only 3 of its 4 bytes along with the 6th sequence in its entirety. Remember that the minimum sequence is 3 bytes so we can do this. The replacement being [D=326, L=3][D=174, L=26].

The other option, which we would have to use if the 1st were only 3 bytes, is to use the 1st match and then skip the first byte of the 6th. The replacement then being [D=326, L=4][D=173, L=25].

Is one preferential over the other? I am not sure. This might come down to how the logic is implemented. This is fun though.

Here’s the jniorboot.log run with the sequences plotted if you are interested.

bruce_dev /> cat jniorboot.log -h 
00000000  30 31 2f 30 34 2f 31 38  20 30 38 3a 31 39 3a 31  01/04/18 .08:19:1
00000010  38 2e 31 31 31 2c 20 2a  2a 20 4f 53 20 43 52 43  8.111,.* *.OS.CRC
00000020  20 64 65 74 61 69 6c 20  75 70 64 61 74 65 64 0d  .detail. updated.
00000030  0a 30 31 2f 30 34 2f 31  38 20 30 38 3a 31 39 3a  .01/04/1 8.08:19:
00000040  31 38 2e 31 35 38 2c 20  2d 2d 20 4d 6f 64 65 6c  18.158,. --.Model
00000050  20 34 31 30 20 76 31 2e  36 2e 33 20 2d 20 4a 41  .410.v1. 6.3.-.JA
00000060  4e 4f 53 20 53 65 72 69  65 73 20 34 0d 0a 30 31  NOS.Seri es.4..01
00000070  2f 30 34 2f 31 38 20 30  38 3a 31 39 3a 31 38 2e  /04/18.0 8:19:18.
00000080  31 37 38 2c 20 43 6f 70  79 72 69 67 68 74 20 28  178,.Cop yright.(
00000090  63 29 20 32 30 31 32 2d  32 30 31 38 20 49 4e 54  c).2012- 2018.INT
000000A0  45 47 20 50 72 6f 63 65  73 73 20 47 72 6f 75 70  EG.Proce ss.Group
000000B0  2c 20 49 6e 63 2e 2c 20  47 69 62 73 6f 6e 69 61  ,.Inc.,. Gibsonia
000000C0  20 50 41 20 55 53 41 0d  0a 30 31 2f 30 34 2f 31  .PA.USA. .01/04/1
000000D0  38 20 30 38 3a 31 39 3a  31 38 2e 31 39 37 2c 20  8.08:19: 18.197,.
000000E0  4a 41 4e 4f 53 20 77 72  69 74 74 65 6e 20 61 6e  JANOS.wr itten.an
000000F0  64 20 64 65 76 65 6c 6f  70 65 64 20 62 79 20 42  d.develo ped.by.B
00000100  72 75 63 65 20 43 6c 6f  75 74 69 65 72 0d 0a 30  ruce.Clo utier..0
00000110  31 2f 30 34 2f 31 38 20  30 38 3a 31 39 3a 31 38  1/04/18. 08:19:18
00000120  2e 32 31 36 2c 20 53 65  72 69 61 6c 20 4e 75 6d  .216,.Se rial.Num
00000130  62 65 72 3a 20 36 31 34  30 37 30 35 30 30 0d 0a  ber:.614 070500..
00000140  30 31 2f 30 34 2f 31 38  20 30 38 3a 31 39 3a 31  01/04/18 .08:19:1
00000150  38 2e 32 33 36 2c 20 46  69 6c 65 20 53 79 73 74  8.236,.F ile.Syst
00000160  65 6d 20 6d 6f 75 6e 74  65 64 0d 0a 30 31 2f 30  em.mount ed..01/0
00000170  34 2f 31 38 20 30 38 3a  31 39 3a 31 38 2e 32 35  4/18.08: 19:18.25
00000180  37 2c 20 52 65 67 69 73  74 72 79 20 6d 6f 75 6e  7,.Regis try.moun
00000190  74 65 64 0d 0a 30 31 2f  30 34 2f 31 38 20 30 38  ted..01/ 04/18.08
000001A0  3a 31 39 3a 31 38 2e 33  30 36 2c 20 4e 65 74 77  :19:18.3 06,.Netw
000001B0  6f 72 6b 20 49 6e 69 74  69 61 6c 69 7a 65 64 0d  ork.Init ialized.
000001C0  0a 30 31 2f 30 34 2f 31  38 20 30 38 3a 31 39 3a  .01/04/1 8.08:19:
000001D0  31 38 2e 33 32 36 2c 20  45 74 68 65 72 6e 65 74  18.326,. Ethernet
000001E0  20 41 64 64 72 65 73 73  3a 20 39 63 3a 38 64 3a  .Address :.9c:8d:
000001F0  31 61 3a 30 30 3a 30 37  3a 65 65 0d 0a 30 31 2f  1a:00:07 :ee..01/
00000200  30 34 2f 31 38 20 30 38  3a 31 39 3a 31 38 2e 34  04/18.08 :19:18.4
00000210  34 37 2c 20 53 65 6e 73  6f 72 20 50 6f 72 74 20  47,.Sens or.Port.
00000220  69 6e 69 74 69 61 6c 69  7a 65 64 0d 0a 30 31 2f  initiali zed..01/
00000230  30 34 2f 31 38 20 30 38  3a 31 39 3a 31 38 2e 35  04/18.08 :19:18.5
00000240  30 32 2c 20 49 2f 4f 20  73 65 72 76 69 63 65 73  02,.I/O. services
00000250  20 69 6e 69 74 69 61 6c  69 7a 65 64 0d 0a 30 31  .initial ized..01
00000260  2f 30 34 2f 31 38 20 30  38 3a 31 39 3a 31 38 2e  /04/18.0 8:19:18.
00000270  35 33 35 2c 20 46 54 50  20 73 65 72 76 65 72 20  535,.FTP .server.
00000280  65 6e 61 62 6c 65 64 20  66 6f 72 20 70 6f 72 74  enabled. for.port
00000290  20 32 31 0d 0a 30 31 2f  30 34 2f 31 38 20 30 38  .21..01/ 04/18.08
000002A0  3a 31 39 3a 31 38 2e 35  35 36 2c 20 50 72 6f 74  :19:18.5 56,.Prot
000002B0  6f 63 6f 6c 20 73 65 72  76 65 72 20 65 6e 61 62  ocol.ser ver.enab
000002C0  6c 65 64 20 66 6f 72 20  70 6f 72 74 20 39 32 30  led.for. port.920
000002D0  30 0d 0a 30 31 2f 30 34  2f 31 38 20 30 38 3a 31  0..01/04 /18.08:1
000002E0  39 3a 31 38 2e 35 38 36  2c 20 57 65 62 53 65 72  9:18.586 ,.WebSer
000002F0  76 65 72 20 65 6e 61 62  6c 65 64 20 66 6f 72 20  ver.enab led.for.
00000300  70 6f 72 74 20 38 30 0d  0a 30 31 2f 30 34 2f 31  port.80. .01/04/1
00000310  38 20 30 38 3a 31 39 3a  31 38 2e 36 30 38 2c 20  8.08:19: 18.608,.
00000320  54 65 6c 6e 65 74 20 73  65 72 76 65 72 20 65 6e  Telnet.s erver.en
00000330  61 62 6c 65 64 20 66 6f  72 20 70 6f 72 74 20 32  abled.fo r.port.2
00000340  33 0d 0a 30 31 2f 30 34  2f 31 38 20 30 38 3a 31  3..01/04 /18.08:1
00000350  39 3a 31 38 2e 36 33 32  2c 20 50 4f 52 3a 20 35  9:18.632 ,.POR:.5
00000360  39 32 36 0d 0a 30 31 2f  30 34 2f 31 38 20 30 38  926..01/ 04/18.08
00000370  3a 31 39 3a 31 38 2e 36  35 33 2c 20 43 75 6d 75  :19:18.6 53,.Cumu
00000380  6c 61 74 69 76 65 20 52  75 6e 74 69 6d 65 3a 20  lative.R untime:.
00000390  38 20 57 65 65 6b 73 20  35 20 44 61 79 73 20 31  8.Weeks. 5.Days.1
000003A0  20 48 6f 75 72 20 32 34  3a 33 32 2e 32 38 31 0d  .Hour.24 :32.281.
000003B0  0a 30 31 2f 30 34 2f 31  38 20 30 38 3a 31 39 3a  .01/04/1 8.08:19:
000003C0  31 38 2e 36 37 38 2c 20  42 6f 6f 74 20 43 6f 6d  18.678,. Boot.Com
000003D0  70 6c 65 74 65 64 20 5b  32 2e 33 20 73 65 63 6f  pleted.[ 2.3.seco
000003E0  6e 64 73 5d 0d 0a                                 nds]..

bruce_dev />
bruce_dev /> jtest jniorboot.log
I=0044 C=0031 P=0000, D=49, L=19
0031 - 0044
|-----------------| 

I=0064 C=0061 P=001a, D=71, L=3
0061 - 0064
|-| 

I=0081 C=006c P=002f, D=61, L=21
006c - 0081
|-------------------| 

I=0086 C=0082 P=0045, D=61, L=3
0082 - 0085
|-| 

I=009d C=0098 P=0093, D=5, L=3
I=009d C=009a P=0074, D=38, L=3
0098 - 009d
|-|   
  |-| 

I=00dd C=00cf P=009a, D=53, L=3
I=00dd C=00c7 P=006c, D=91, L=21
00c7 - 00dc
        |-|           
|-------------------| 

I=00e6 C=00df P=005d, D=130, L=7
00df - 00e6
|-----| 

I=00f4 C=00f1 P=0020, D=209, L=3
00f1 - 00f4
|-| 

I=0121 C=0115 P=009a, D=123, L=3
I=0121 C=010d P=00c7, D=70, L=20
010d - 0121
        |-|          
|------------------| 

I=012b C=0125 P=0063, D=194, L=5
0125 - 012a
|---| 

I=0153 C=0146 P=009a, D=172, L=3
I=0153 C=013e P=00c7, D=119, L=20
I=0153 C=013e P=010d, D=49, L=21
013e - 0153
        |-|           
|------------------|  
|-------------------| 

I=0157 C=0154 P=0123, D=49, L=3
0154 - 0157
|-| 

I=017f C=0172 P=009a, D=216, L=3
I=017f C=0167 P=002c, D=315, L=23
I=017f C=016a P=013e, D=44, L=21
0167 - 017f
           |-|           
|---------------------|  
   |-------------------| 

I=0183 C=0180 P=00dd, D=163, L=3
0180 - 0183
|-| 

I=01a8 C=019b P=009a, D=257, L=3
I=01a8 C=018b P=0162, D=41, L=28
018b - 01a7
                |-|          
|--------------------------| 

I=01ad C=01a9 P=0154, D=85, L=3
01a9 - 01ac
|-| 

I=01bb C=01b3 P=00b1, D=258, L=3
I=01bb C=01b8 P=0129, D=143, L=3
01b3 - 01bb
|-|      
     |-| 

I=01d4 C=01c7 P=009a, D=301, L=3
I=01d4 C=01bd P=0168, D=85, L=22
I=01d4 C=01bd P=0191, D=44, L=23
01bd - 01d4
          |-|           
|--------------------|  
|---------------------| 

I=01d8 C=01d5 P=01a9, D=44, L=3
01d5 - 01d8
|-| 

I=01e8 C=01e5 P=00a7, D=318, L=3
01e5 - 01e8
|-| 

I=020f C=0203 P=009a, D=361, L=3
I=020f C=01fb P=01bf, D=60, L=20
01fb - 020f
        |-|          
|------------------| 

I=0217 C=0211 P=0180, D=145, L=3
I=0217 C=0212 P=0124, D=238, L=4
0211 - 0216
|-|   
 |--| 

I=023f C=0224 P=0129, D=251, L=3
I=023f C=0233 P=009a, D=409, L=3
I=023f C=0221 P=01b5, D=108, L=30
0221 - 023f
   |-|                         
                  |-|          
|----------------------------| 

I=0245 C=0242 P=00b0, D=402, L=3
0242 - 0245
|-| 

I=0271 C=024d P=00a6, D=423, L=3
I=0271 C=024e P=0068, D=486, L=3
I=0271 C=0255 P=0129, D=300, L=3
I=0271 C=0264 P=009a, D=458, L=3
I=0271 C=0252 P=01b5, D=157, L=30
I=0271 C=0250 P=021f, D=49, L=33
024d - 0271
|-|                                  
 |-|                                 
        |-|                          
                       |-|           
     |----------------------------|  
   |-------------------------------| 

I=0276 C=0273 P=0155, D=286, L=3
0273 - 0276
|-| 

I=0280 C=0278 P=0247, D=49, L=5
0278 - 027d
|---| 

I=0288 C=0285 P=00f9, D=396, L=3
0285 - 0288
|-| 

I=028c C=0289 P=0218, D=113, L=3
0289 - 028c
|-| 

I=0293 C=028d P=021c, D=113, L=4
028d - 0291
|--| 

I=02a8 C=029b P=009a, D=513, L=3
I=02a8 C=0293 P=01fb, D=152, L=20
I=02a8 C=0293 P=025c, D=55, L=21
0293 - 02a8
        |-|           
|------------------|  
|-------------------| 

I=02af C=02a9 P=01d5, D=212, L=3
I=02af C=02ab P=00a2, D=521, L=4
02a9 - 02af
|-|    
  |--| 

I=02ce C=02b4 P=0247, D=109, L=5
I=02ce C=02c1 P=00f9, D=456, L=3
I=02ce C=02c5 P=0218, D=173, L=3
I=02ce C=02b4 P=0278, D=60, L=25
02b4 - 02cd
|---|                     
             |-|          
                 |-|      
|-----------------------| 

I=02e7 C=02d9 P=009a, D=575, L=3
I=02e7 C=02cf P=013c, D=403, L=22
I=02e7 C=02d1 P=0293, D=62, L=21
02cf - 02e6
          |-|           
|--------------------|  
  |-------------------| 

I=02ea C=02e7 P=02a9, D=62, L=3
02e7 - 02ea
|-| 

I=0305 C=02ed P=0126, D=455, L=3
I=0305 C=02ee P=0249, D=165, L=3
I=0305 C=02f9 P=00f9, D=512, L=3
I=0305 C=02fd P=0218, D=229, L=3
I=0305 C=02ee P=02b6, D=56, L=23
02ed - 0305
|-|                      
 |-|                     
            |-|          
                |-|      
 |---------------------| 

I=031c C=030f P=009a, D=629, L=3
I=031c C=0306 P=02d0, D=54, L=21
0306 - 031b
         |-|          
|-------------------| 

I=0320 C=031d P=0082, D=667, L=3
031d - 0320
|-| 

I=0341 C=0323 P=01dd, D=326, L=4
I=0341 C=0326 P=0247, D=223, L=5
I=0341 C=0333 P=00f9, D=570, L=3
I=0341 C=0337 P=0218, D=287, L=3
I=0341 C=0326 P=02b4, D=114, L=25
I=0341 C=0326 P=0278, D=174, L=26
0323 - 0340
|--|                          
   |---|                      
                |-|           
                    |-|       
   |-----------------------|  
   |------------------------| 

I=0356 C=0349 P=009a, D=687, L=3
I=0356 C=0341 P=02d1, D=112, L=20
I=0356 C=0341 P=0307, D=58, L=21
0341 - 0356
        |-|           
|------------------|  
|-------------------| 

I=035b C=0357 P=0241, D=278, L=3
I=035b C=0358 P=02aa, D=174, L=3
0357 - 035b
|-|  
 |-| 

I=0378 C=036b P=009a, D=721, L=3
I=0378 C=0363 P=02d1, D=146, L=20
I=0378 C=0363 P=0341, D=34, L=21
0363 - 0378
        |-|           
|------------------|  
|-------------------| 

I=037d C=037a P=0083, D=759, L=3
037a - 037d
|-| 

I=038c C=0388 P=018e, D=506, L=3
0388 - 038b
|-| 

I=0395 C=0391 P=02e9, D=168, L=3
0391 - 0394
|-| 

I=03c4 C=03b7 P=009a, D=797, L=3
I=03c4 C=03ae P=0292, D=284, L=21
I=03c4 C=03af P=0363, D=76, L=21
03ae - 03c4
         |-|           
|-------------------|  
 |-------------------| 

I=03c9 C=03c4 P=0081, D=835, L=4
03c4 - 03c8
|--| 

I=03cf C=03cc P=0084, D=840, L=3
03cc - 03cf
|-| 

I=03d7 C=03d3 P=0190, D=579, L=3
I=03d7 C=03d4 P=0333, D=161, L=3
03d3 - 03d7
|-|  
 |-| 

I=03de C=03d9 P=0059, D=896, L=3
I=03de C=03db P=0326, D=181, L=3
03d9 - 03de
|-|   
  |-| 

Processing 80.766 seconds.
Source 998 bytes.
Result 0 bytes.
Ratio 100.00%

bruce_dev />

I think the first step is to filter out sequences that are completely covered by another. That seems to happen a lot.

I have filtered matched sequences that are eclipsed by another in the block. Since we know what to do when only one sequence remains (just use it) I filtered those from the output. So we are left with overlapping situations that remain to be studied.

bruce_dev /> jtest jniorboot.log
I=009d C=0098 P=0093, D=5, L=3
I=009d C=009a P=0074, D=38, L=3
0098 - 009d
|-|   
  |-| 

I=017f C=0167 P=002c, D=315, L=23
I=017f C=016a P=013e, D=44, L=21
0167 - 017f
|---------------------|  
   |-------------------| 

I=01bb C=01b3 P=00b1, D=258, L=3
I=01bb C=01b8 P=0129, D=143, L=3
01b3 - 01bb
|-|      
     |-| 

I=0217 C=0211 P=0180, D=145, L=3
I=0217 C=0212 P=0124, D=238, L=4
0211 - 0216
|-|   
 |--| 

I=0271 C=024d P=00a6, D=423, L=3
I=0271 C=024e P=0068, D=486, L=3
I=0271 C=0250 P=021f, D=49, L=33
024d - 0271
|-|                                  
 |-|                                 
   |-------------------------------| 

I=02af C=02a9 P=01d5, D=212, L=3
I=02af C=02ab P=00a2, D=521, L=4
02a9 - 02af
|-|    
  |--| 

I=02e7 C=02cf P=013c, D=403, L=22
I=02e7 C=02d1 P=0293, D=62, L=21
02cf - 02e6
|--------------------|  
  |-------------------| 

I=0305 C=02ed P=0126, D=455, L=3
I=0305 C=02ee P=02b6, D=56, L=23
02ed - 0305
|-|                      
 |---------------------| 

I=0341 C=0323 P=01dd, D=326, L=4
I=0341 C=0326 P=0278, D=174, L=26
0323 - 0340
|--|                          
   |------------------------| 

I=035b C=0357 P=0241, D=278, L=3
I=035b C=0358 P=02aa, D=174, L=3
0357 - 035b
|-|  
 |-| 

I=03c4 C=03ae P=0292, D=284, L=21
I=03c4 C=03af P=0363, D=76, L=21
03ae - 03c4
|-------------------|  
 |-------------------| 

I=03d7 C=03d3 P=0190, D=579, L=3
I=03d7 C=03d4 P=0333, D=161, L=3
03d3 - 03d7
|-|  
 |-| 

I=03de C=03d9 P=0059, D=896, L=3
I=03de C=03db P=0326, D=181, L=3
03d9 - 03de
|-|   
  |-| 

Processing 78.368 seconds.
Source 998 bytes.
Result 0 bytes.
Ratio 100.00%

bruce_dev />

There is one block where the remaining matches are mutually exclusive. It is also obvious what to do for that but I would still need to identify it. Maybe the goal is to reduce the sequence set to one but if you cannot then to get to a set of mutually exclusive matches.

Added a step to force mutual exclusivity. The logic now appears as follows:

    // Our LZ77 compression engine
    static void do_compress(BufferedWriter outfile, BufferedReader infile) throws Throwable {
        
        boolean bFound = false;
 
        // process uncompressed stream
        while (infile.ready()) {
            
            // obtain next byte
            int ch = infile.read();
            //System.out.print((char)ch);
            
            // process active Match objects
            Match best = null;
            for (int n = SEQ.size() - 1; 0 <= n; n--) { Match m = SEQ.get(n); if (!m.check(ch)) { if (m.len >= 3) {
                        if (best == null)
                            best = m;
                        else if (m.len > best.len)
                            best = m;
                        else if (m.len == best.len && m.distance < best.distance)
                            best = m;
                    }
                            
                    SEQ.remove(n);
                }
            }
            if (best != null) {
                REPL.add(best);
                bFound = true;
            }
            
            if (bFound && SEQ.size() == 0) {
                
                // filter out sequences eclipsed by another
                for (int n = REPL.size() - 1; 0 <= n; n--) {
                    Match mn = REPL.get(n);
                    
                    int k;
                    for (k = REPL.size() - 1; 0 <= k; k--) { if (k == n) continue; Match mk = REPL.get(k); if (mn.curptr >= mk.curptr && mn.curptr + mn.len <= mk.curptr + mk.len)
                            break;
                    }
                    if (0 <= k)
                        REPL.remove(n);
                }
                
                // Force mutual exclusivity. Note that REPL at this point has matchin SEQ with
                //  increasing CURPTR.
                for (int n = 0; n < REPL.size() - 1; n++) {
                    Match n1 = REPL.get(n);
                    Match n2 = REPL.get(n + 1);
                    if (n2.curptr < n1.curptr + n1.len) {
                        int adj = n1.curptr + n1.len - n2.curptr;
                        if (n2.len - adj < 3) REPL.remove(n2); else { n2.curptr += adj; if (n2.curptr >= WINDOW)
                                n2.curptr -= WINDOW;
                            n2.ptr += adj;
                            if (n2.ptr >= WINDOW)
                                n2.ptr -= WINDOW;
                            n2.distance -= adj;
                            n2.len -= adj;
                        }
                    }
                }
                
                // $ - temporary only display when there are still choices
                if (REPL.size() > 1) {
 
                    // determine the overall affected range
                    int start = 0;
                    int end = 0;
                    for (int n = 0; n < REPL.size(); n++) {
                        Match m = REPL.get(n);
                        System.out.printf("I=%04x C=%04x P=%04x, D=%d, L=%d\n", 
                                INPTR, m.curptr, m.start, m.distance, m.len);
 
                        if (n == 0 || m.curptr < start) start = m.curptr; if (n == 0 || m.curptr + m.len > end)
                            end = m.curptr + m.len;
                    }
 
                    // plot
                    System.out.printf("%04x - %04x\n", start, end);
 
                    for (int n = 0; n < REPL.size(); n++) {
                        Match m = REPL.get(n);
                        for (int i = start; i <= end; i++) {
                            if (i < m.curptr || i >= m.curptr + m.len)
                                System.out.print(" ");
                            else if (i == m.curptr || i == m.curptr + m.len - 1)
                                System.out.print("|");
                            else
                                System.out.print("-");
                        }
                        System.out.println("");
                    }
                    System.out.println("");
                }
 
                REPL.clear();
                bFound = false;
            }
            
            // queue uncompressed DATA

As matches are located we select the best (lines 14 thru 29) and add them to a list (lines 30 thru 33). Then later when we reach that point where there are no more active matches and we can generate compressed output (line 35) we filter those eclipsed sequences (lines 38 thru 52). Next we force the remaining sequences to be mutually exclusive (lines 56 thru 74).

Since these sequences appear in the REPL list in increasing CURPTR order (at least they appear to be) we take pairs of sequences and shift the starting point of the next one so it no longer overlaps. If this shrinks the match to less than 3 bytes it is removed.

After that there is code to display and plot the remaining sequences if there are more than one. Here we see that in every case we have created a mutually exclusive set.

bruce_dev /> jtest jniorboot.log
I=01bb C=01b3 P=00b1, D=258, L=3
I=01bb C=01b8 P=0129, D=143, L=3
01b3 - 01bb
|-|      
     |-| 

I=0271 C=024d P=00a6, D=423, L=3
I=0271 C=0250 P=021f, D=49, L=33
024d - 0271
|-|                                  
   |-------------------------------| 

I=02af C=02a9 P=01d5, D=212, L=3
I=02af C=02ac P=00a2, D=520, L=3
02a9 - 02af
|-|    
   |-| 

I=0305 C=02ed P=0126, D=455, L=3
I=0305 C=02f0 P=02b6, D=54, L=21
02ed - 0305
|-|                      
   |-------------------| 

I=0341 C=0323 P=01dd, D=326, L=4
I=0341 C=0327 P=0278, D=173, L=25
0323 - 0340
|--|                          
    |-----------------------| 

Processing 77.415 seconds.
Source 998 bytes.
Result 0 bytes.
Ratio 100.00%

bruce_dev /> 

So it would appear now that I have what I need to generate the compressed output stream.

Generating the compressed output using the [D=,L=] format just to make things visible actually enlarges the file of course. But here it is (some line breaks at the right margin manually inserted).

bruce_dev /> cat outfile.dat
01/04/18 08:19:18.111, ** OS CRC detail updated
[D=49,L=19]58, -- Model 410 v1.6.3 - JAN[D=71,L=3]Series 4[D=61,L=21]7[D=61,L=3]Copyright (c) 2012-[D=5,L=3]8
 INTEG Process Group, Inc., Gibsonia PA USA[D=91,L=21]97,[D=130,L=7]written and[D=209,L=3]veloped by Bruce Cl
outier[D=70,L=20]216,[D=194,L=5]al Number: 614070500[D=49,L=21]3[D=49,L=3]File System moun[D=315,L=23]25[D=16
3,L=3]Registry[D=41,L=28]30[D=85,L=3]Network[D=258,L=3]it[D=143,L=3]iz[D=44,L=23]2[D=44,L=3]Ethernet Addr[D=3
18,L=3]: 9c:8d:1a:00:07:ee[D=60,L=20]44[D=145,L=3]Sensor Port i[D=108,L=30]502[D=402,L=3]/O servi[D=423,L=3][
D=49,L=33]35[D=286,L=3]TP[D=49,L=5]er enabl[D=396,L=3]f[D=113,L=3]p[D=113,L=4]21[D=55,L=21]5[D=212,L=3][D=520
,L=3]tocol[D=60,L=25]92[D=403,L=22]58[D=62,L=3]Web[D=455,L=3][D=54,L=21]8[D=54,L=21]60[D=667,L=3]Tel[D=326,L=
4][D=173,L=25]3[D=58,L=21]3[D=278,L=3]POR: 5926[D=34,L=21]53[D=759,L=3]umulative R[D=506,L=3]ime: 8[D=168,L=3
]eks 5 Days 1 Hour 24:32.28[D=284,L=21]6[D=835,L=4]Boot[D=840,L=3]mple[D=579,L=3] [2[D=896,L=3]seconds]
bruce_dev />

So it appears to be time to convert this into a bit stream with the proper length and distance codes in preparation for Huffman coding.

Sticking in a couple of placeholder bytes for the length and distance codes this is representative of the pre-Huffman coding compression ratio of this file.

bruce_dev /> jtest jniorboot.log
Processing 79.124 seconds.
Source 998 bytes.
Result 533 bytes.
Ratio 46.59%

bruce_dev />

Oh it’ll be fast in C and in the JANOS kernel. Okay… Huffman.

So before getting too deep into generating the Huffman coding with dynamic tables I figured that it would make sense to write a quick decompressor for my interim LZ77 compression as a check. I modified the compressor to output length and distance codes as shorts using a 0xff prefix which I then escaped. This stream I will later be able to digest in performing the Huffman coding. The decompressor would take outfile.dat and generate the decompressed newfile.dat.

Well after compressing and then decompressing the content of newfile.dat resembled jniorboot.log very closely but there were a few variances. First, I found the glitch in the step that eliminates any overlap in matched sequences (shouldn’t have modified distance when also modifying the CURPTR). Then I had to address the boundary conditions at the end of the file in order to properly process the entire file (I ended up a couple of bytes short initially). With that we achieved success.

You can see here how we can use MANIFEST to verify file size and content. Note that the MD5 are identical.

bruce_dev /> jtest jniorboot.log
Processing 69.323 seconds.
Source 950 bytes.
Result 671 bytes.
Ratio 1.42:1

bruce_dev /> jtest2

bruce_dev /> manifest jniorboot.log
JNIOR Manifest      Fri Jan 05 11:02:57 EST 2018
  Size                  MD5                  File Specification
 950      dc425a0283e22944b463eeab9e625adb  [Modified] /jniorboot.log
End of Manifest (1 files listed)

bruce_dev /> manifest newfile.dat  
JNIOR Manifest      Fri Jan 05 11:02:59 EST 2018
  Size                  MD5                  File Specification
 950      dc425a0283e22944b463eeab9e625adb  [New] /newfile.dat
End of Manifest (1 files listed)

bruce_dev />

Here is the original content and the resulting compressed format that I have used in bread-boarding this.

bruce_dev /> cat jniorboot.log
01/05/18 07:39:52.913, -- Model 410 v1.6.3 - JANOS Series 4
01/05/18 07:39:52.960, Copyright (c) 2012-2018 INTEG Process Group, Inc., Gibsonia PA USA
01/05/18 07:39:52.980, JANOS written and developed by Bruce Cloutier
01/05/18 07:39:52.999, Serial Number: 614070500
01/05/18 07:39:53.018, File System mounted
01/05/18 07:39:53.039, Registry mounted
01/05/18 07:39:53.089, Network Initialized
01/05/18 07:39:53.109, Ethernet Address: 9c:8d:1a:00:07:ee
01/05/18 07:39:53.229, Sensor Port initialized
01/05/18 07:39:53.284, I/O services initialized
01/05/18 07:39:53.327, FTP server enabled for port 21
01/05/18 07:39:53.347, Protocol server enabled for port 9200
01/05/18 07:39:53.368, WebServer enabled for port 80
01/05/18 07:39:53.390, Telnet server enabled for port 23
01/05/18 07:39:53.414, POR: 5927
01/05/18 07:39:53.439, Cumulative Runtime: 8 Weeks 5 Days 9 Hours 32:22.102
01/05/18 07:39:53.460, Boot Completed [2.3 seconds]

bruce_dev />
bruce_dev /> cat outfile.dat -h
00000000  30 31 2f 30 35 2f 31 38  20 30 37 3a 33 39 3a 35  01/05/18 .07:39:5
00000010  32 2e 39 31 33 2c 20 2d  2d 20 4d 6f 64 65 6c 20  2.913,.- -.Model.
00000020  34 31 30 20 76 31 2e 36  2e 33 20 2d 20 4a 41 4e  410.v1.6 .3.-.JAN
00000030  4f 53 20 53 65 72 69 65  73 20 34 0d 0a ff 00 13  OS.Serie s.4.....
00000040  00 3d 36 30 2c 20 43 6f  70 79 72 69 67 68 74 20  .=60,.Co pyright.
00000050  28 63 29 20 32 30 31 32  2d ff 00 03 00 05 38 20  (c).2012 -.....8.
00000060  49 4e 54 45 47 20 50 72  6f 63 65 73 73 20 47 72  INTEG.Pr ocess.Gr
00000070  6f 75 70 2c 20 49 6e 63  2e 2c 20 47 69 62 73 6f  oup,.Inc .,.Gibso
00000080  6e 69 61 20 50 41 20 55  53 41 ff 00 15 00 5b 38  nia.PA.U SA....[8
00000090  ff 00 03 00 5b ff 00 06  00 82 77 72 69 74 74 65  ....[... ..writte
000000A0  6e 20 61 6e 64 20 64 65  76 65 6c 6f 70 65 64 20  n.and.de veloped.
000000B0  62 79 20 42 72 75 63 65  20 43 6c 6f 75 74 69 65  by.Bruce .Cloutie
000000C0  72 ff 00 15 00 46 39 39  2c ff 00 05 00 c2 61 6c  r....F99 ,.....al
000000D0  20 4e 75 6d 62 65 72 3a  20 36 31 34 30 37 30 35  .Number: .6140705
000000E0  30 30 ff 00 12 00 31 33  2e ff 00 03 00 b9 2c 20  00....13 ......,.
000000F0  46 69 6c 65 20 53 79 73  74 65 6d 20 6d 6f 75 6e  File.Sys tem.moun
00000100  74 65 64 ff 00 15 00 2c  33 ff 00 03 00 5d 52 65  ted...., 3....]Re
00000110  67 69 73 74 72 79 ff 00  1d 00 29 38 ff 00 03 00  gistry.. ..)8....
00000120  29 4e 65 74 77 6f 72 6b  ff 00 03 01 02 69 74 ff  )Network .....it.
00000130  00 03 00 8f 69 7a ff 00  16 00 2c 31 30 ff 00 03  ....iz.. ..,10...
00000140  00 2c 45 74 68 65 72 6e  65 74 20 41 64 64 72 ff  .,Ethern et.Addr.
00000150  00 03 01 3e 3a 20 39 63  3a 38 64 3a 31 61 3a 30  ...>:.9c :8d:1a:0
00000160  30 3a ff 00 03 00 2c 65  65 ff 00 14 00 3c 32 32  0:....,e e....<22
00000170  ff 00 05 00 ee 6e 73 6f  72 20 50 6f 72 74 20 69  .....nso r.Port.i
00000180  ff 00 1e 00 6c 32 38 34  ff 00 03 01 92 2f 4f 20  ....l284 ...../O.
00000190  73 65 72 76 69 ff 00 03  01 a7 ff 00 20 00 31 33  servi... ......13
000001A0  32 37 ff 00 03 01 1e 54  50 ff 00 05 00 31 65 72  27.....T P....1er
000001B0  20 65 6e 61 62 6c ff 00  03 01 8c 66 ff 00 03 00  .enabl.. ...f....
000001C0  71 70 ff 00 04 00 71 32  31 ff 00 15 00 37 34 ff  qp....q2 1....74.
000001D0  00 03 00 37 ff 00 03 02  09 74 6f 63 6f 6c ff 00  ...7.... .tocol..
000001E0  19 00 3c 39 32 ff 00 16  01 93 33 36 ff 00 03 01  ..<92... ..36....
000001F0  93 57 65 62 ff 00 03 01  c7 ff 00 15 00 38 38 ff  .Web.... .....88.
00000200  00 16 00 36 39 ff 00 03  02 40 54 65 6c ff 00 04  ...69... .@Tel...
00000210  01 46 ff 00 19 00 ae 33  ff 00 14 00 3a 34 31 ff  .F.....3 ....:41.
00000220  00 03 01 16 50 4f 52 3a  20 35 39 32 37 ff 00 15  ....POR: .5927...
00000230  00 22 ff 00 04 01 f9 43  75 6d 75 6c 61 74 69 76  .".....C umulativ
00000240  65 20 52 ff 00 03 01 fa  69 6d 65 3a 20 38 ff 00  e.R..... ime:.8..
00000250  03 00 a8 65 6b 73 20 35  20 44 61 79 73 20 39 20  ...eks.5 .Days.9.
00000260  48 6f 75 72 73 20 33 32  3a 32 32 ff 00 03 01 da  Hours.32 :22.....
00000270  32 ff 00 15 00 4d ff 00  04 03 44 42 6f 6f 74 ff  2....M.. ..DBoot.
00000280  00 03 03 49 6d 70 6c 65  ff 00 03 02 44 20 5b 32  ...Imple ....D.[2
00000290  ff 00 03 03 81 73 65 63  6f 6e 64 73 5d 0d 0a     .....sec onds]..

bruce_dev />
ATTACHMENTS
JTest2.java
(3.04 KiB) Downloaded 25 times
JTest.java
(9.42 KiB) Downloaded 24 times

So at this point I feel like this algorithm generates the optimum LZ77 compression for the data. This should even take into account the lazy matches however that is perceived by the industry. When I cast it into C I will work on optimizing the execution.

The only question might be in optimizing distance codes to minimize extra bits. I didn’t consider that in pruning the matched sequence list for a block. When those situations occur there might be a bit or two to save if I were to retain the closer match. I am not going to worry about that. My feeling is that we son’t save anything noticeable if anything at all.

Now to handle the Huffman coding.

Well there are a couple of bugs in my coding which were discovered while testing the approach on much larger files. With those issues fixed I see that I need to focus on optimizing because this all-encompassing matching is much too slow (even when considering the Java breadboard).

The approach would find all of the sequence matches to data in the previous 32KB of the stream (sliding window) for a section of the input stream bound by non-matching data. Once collected I would then end up trashing the vast majority of those. That is wasteful of processing time. It was a logical approach if without thinking you weren’t sure if a better compression ratio couldn’t be obtained through careful selection of sequences. There is this suggestion that better compression is possible if lazy matches are allowed. Without really knowing what those are the shotgun all-encompassing approach guaranteed at least that you had all of the information you needed to reach the optimum. Let’s actually look at this more closely.

Matching

We start a match upon receipt of a data bytes. I’m keeping a bidirectional linked list for the occurrences of each byte value. This allows the routine to rapidly create an active match object for each. Subsequently as each new byte is received we check each active match for those that may be extended and those that are no longer useful. When we reach a point where none of the active matches have been extended we select the longest match completed as the best. For matches of equal length we pick the closest one (lowest distance).

The DEFLATE specification recognizes matches of 3 or more bytes (maximum 258). Why 3? That is because the compression is achieved by replacing a matched sequence with a pointer back to the same data found in previous 32KB of data (the sliding window). That pointer consists of a distance and a length. That pointer in the worst case requires about 3 bytes. So replacing shorter sequences on average won’t buy you anything. That’s for DEFALTE. I am actually using like 5 bytes for this breadboard but eventually we will be be strictly DEFLATE. Obviously the longer the match the greater the savings. Therefore the best match is the longest and closest (uses a minimum pointer size).

So for a point in the incoming stream we seek the longest match. If there is no 3 byte match then we output that first byte as is and search using the next one. The results can be impressive especially for text and log files. It’s not LZW but it works. It turns out to be good enough for the JNIOR.

Lazy Matching

So what is with this lazy matching? Well imagine a sequence of 3 or more matching bytes located someplace in the sliding window. The consider that if we ignored that match and searched for matches starting with the next byte we might find a much longer match from someplace else in the sliding window. Do we miss an opportunity for better compression?

I can graphically show the overlap of the best matches. Say the first is 5 in lenght and the other some 15.

|---|
 |-------------|

Here vertical bars denote the first and last matching byte and the dashes bytes in between. The first would replace 5 data bytes and the second 15 starting a byte later.

If we were to strictly process matches as they are found we would encode the 5-byte match. And then we would still find the latter 11 bytes of the seconds sequence (or maybe even another better sequence). This would encode as two sequences one right after another and require two pointers for a total of maybe 6 bytes.

|---||---------|
2 pointers = 6 bytes

Note that we can always prune a match. We can ignore some of its leading bytes by incrementing the replace position (CURPTR) and decrementing the length. We can even ignore trailing bytes in a match merely by shortening its length. So here we drop the first 4 bytes of the 15-byte sequence that were eclipsed by the initial 5-byte sequence. We don’t have to actually do this manipulation as supposedly our search algorithm would find it directly for us.

Now those who get excited by such things would point out that if we absolutely ignored the first 5-byte sequence completely and outputted that first raw byte then we would use just one pointer.

.|-------------|
1 raw byte plus 1 pointer = 4 bytes

And, yes, there is a savings that depending on how often such a thing occurs will in fact lead to a better result. This is a lazy match. This is even true when the second sequence is further offset as seen here.

|---|
  |-------------|

..|-------------|
2 raw bytes plus 1 pointer = 5 bytes

But there is no benefit beyond that. If the two sequences were offset by 3 bytes then you might as well include first 3 as a sequence and you end up using 6 bytes (or less) anyway with 2 pointers.

OK so

Alright for the JNIOR it is likely that these lazy matches aren’t necessary. After all we just want to create a file collection or a graphics file. We aren’t really worried about saving every byte. In fact, we are probably more concerned about it getting done quickly.

So using matches as they come works. But… if we can find a way to efficiently accommodate the lazy matching it would be cool.

Optimized Program Code

Now that I have a little better understanding as to the lazy matching we can take what we have and move on to the next step in DEFLATE. Later after I cast this into C we can decide if it is worth handling the lazy matches. It represents a trade off between an optimum compression ratio and processing time. For the JNIOR we really are more concerned about the processing time as the brute force compression ratios appear more than acceptable. Note that I bet that our original processing of all possible matches between unmatched raw data would lead to an even better compression ratio than that including just the lazy matches but that would be slow.

Speaking of slow I thought to take a little time to distill our algorithm down and to code it so it would execute faster. I know that it is still a breadboard but I would like to not waste as much time with iterations in debugging. So I have eliminated the Match object and automatic growing lists. And since we are identifying the best match for a single position I have eliminated the list of completed matches (RSEQ). I also made an adjustment so as to be able to replay bytes into the matcher should we need to output raw data.

The following is the LZ77 code. Hopefully the comments are sufficient for you to follow the algorithm.

// Our LZ77 compression engine
    static void do_compress(BufferedOutputStream outfile, BufferedInputStream infile, int filesize) 
            throws Throwable {
        
        int ch;
        
        // process uncompressed stream byte-by-byte
        while (filesize > 0) {
            
            // Make sure that there are bytes in the queue to work with. We process bytes from 
            //  the queue using SEQPTR. When SEQPTR reaches the INPTR then we add bytes from the input
            //  stream. The linked lists are updated. 
            if (SEQPTR == INPTR) {
                
                // obtain byte from uncompressed stream
                ch = infile.read();
                filesize--;
                
                // queue data and manage associated linked list
                DATA[INPTR] = (byte)ch;
 
                // Add byte to the head of the appropriate linked list. Note pointers are stored +1 so
                //  as to use 0 as an end of list marker. Lists are bi-directional so we can trim the 
                //  tail when data is dropped from the queue.
                int ptr = HEAD[ch];
                HEAD[ch] = INPTR + 1;
                FWD[INPTR] = ptr;
                BACK[INPTR] = 0;
                if (ptr != 0)
                    BACK[ptr - 1] = INPTR + 1;
 
                // advance entry pointer
                INPTR++;
                if (INPTR == WINDOW)
                    INPTR = 0;
 
                // drop old data from queue when the sliding window is full
                if (INPTR == OUTPTR) {
 
                    // trim linked list as byte is being dropped
                    if (BACK[OUTPTR] == 0)
                        HEAD[DATA[OUTPTR]] = 0;
                    else
                        FWD[BACK[OUTPTR] - 1] = 0;
 
                    // push end of queue
                    OUTPTR++;
                    if (OUTPTR == WINDOW)
                        OUTPTR = 0;
                }
            }
            
            // Obtain the next character to process. We are assured of a byte at SEQPTR now.
            //  SEQPTR allows us to replay bytes into the sequence matching.
            ch = DATA[SEQPTR++];
            if (SEQPTR == WINDOW)
                SEQPTR = 0;
            
            // Reset match state. These will define the best match should one be found for 
            //  the current CURPTR.
            int best_distance = 0;
            int best_length = 0;
            
            // If there are no active sequences we create a new set. This uses the linked list
            //  for the byte at CURPTR to initialize a series of potention sequence sites.
            if (MSIZE == 0) {
 
                // create new active matches for all CH in the queue (except last)
                int ptr = HEAD[ch];
                while (ptr != 0) {
                    if (ptr - 1 != CURPTR) {
                        int distance = CURPTR - ptr + 1;
                        if (distance < 0)
                            distance += WINDOW;
 
                        DISTANCE[MSIZE] = distance;
                        LENGTH[MSIZE] = 1;
                        MSIZE++;
                    }
 
                    ptr = FWD[ptr - 1];
                }
                
            }
                
            // Otherwise process the active sequence matches. Here we advance sequences as each
            //  new byte is processed. Of those matches that cannot be extended we keep the
            //  best (longest and closest to CURPTR). We will use the best match if all of the
            //  potential matches end.
            else {
                
                // each active match
                for (int n = MSIZE - 1; 0 <= n; n--) {
                    
                    int p = CURPTR - DISTANCE[n];
                    if (p < 0) p += WINDOW; p += LENGTH[n]; if (p >= WINDOW)
                        p -= WINDOW;
 
                    // Can we extend this match? If so we bump its length and move on to
                    //  the next match.
                    if (DATA[p] == ch && LENGTH[n] < 258) {
                        LENGTH[n]++;
 
                        if (DISTANCE[n] + LENGTH[n] < WINDOW && filesize > 0)
                            continue;
                    }
 
                    // Sequence did not get extended. See if it is the best found so far.
                    if (LENGTH[n] >= 3) {
                        
                        // first 
                        if (best_length == 0) {
                            best_distance = DISTANCE[n];
                            best_length = LENGTH[n];
                        }
                        
                        // longer
                        else if (LENGTH[n] > best_length) {
                            best_distance = DISTANCE[n];
                            best_length = LENGTH[n];
                        }
                        
                        // closer
                        else if (LENGTH[n] == best_length && DISTANCE[n] < best_distance) { best_distance = DISTANCE[n]; best_length = LENGTH[n]; } } // Competed matches are eliminated from the active list. To be quick we // replace it with the last in the list and reduce the count. MSIZE--; DISTANCE[n] = DISTANCE[MSIZE]; LENGTH[n] = LENGTH[MSIZE]; } } // If there are no active sequence matches at this point we can generate output. if (MSIZE == 0) { // If a we have a completed sequence we can output a pointer. These are escaped // into the output buffer for later processing into encoded length-distance // pairs. best_length = 0; if (best_length != 0) { bufwrite(0xff, outfile); bufint(best_length, outfile); bufint(best_distance, outfile); // Move CURPTR to the next byte after the replaced sequence CURPTR += best_length; if (CURPTR >= WINDOW)
                        CURPTR -= WINDOW;
                }                
 
                // Otherwise output a raw uncompressed byte. The unmatched byte is sent to
                //  the output stream and we move CURPTR to the next. 
                else {
                    bufbyte(DATA[CURPTR], outfile);
                    CURPTR++;
                    if (CURPTR == WINDOW)
                        CURPTR = 0;
                }
 
                // Here we reset SEQPTR to process from the nex CURPTR location. In the case that
                //  we could not match this replays bytes previously processed so as to not miss
                //  an opportunity.
                SEQPTR = CURPTR;
            }
        }
        
        // If we are done and there are unprocessed bytes left we push them to the output stream.
        while (CURPTR != INPTR) {
            bufbyte(DATA[CURPTR], outfile);
            CURPTR++;
            if (CURPTR == WINDOW)
                CURPTR = 0;
        }
    }

With this in place we are going to move on to see what needs to be done next with our output stream.

Our LZ77 compression routine loads an output buffer with processed and hopefully compressed data. When this output buffer fills (say to 64KB) we must process it further. That data is then compressed again using a form of Huffman coding.

Huffman coding for DEFLATE

If you search you can find lots of useful descriptions of Huffman coding. Not all of those will provide the detail for constructing the required tree from the data. Of those that do, most do not lead you to creating a Huffman table compatible with DEFLATE. This is because the Huffman table in DEFLATE is eventually stored using a form of shorthand. That is only possible if the Huffman encoding follows some addition rules. To meet those requirements we need to be careful in constructing our initial dynamic tree.

The DEFLATE specification cryptically defines it:

The Huffman codes used for each alphabet in "deflate" format have two additional rules:

   * All codes of a given bit length have lexicographically consecutive values, in the same order as the
     symbols they represet;

   * Shorter codes lexicographically precede longer codes.

It would be nice if they would avoid words like lexicographically but you can’t have everything. You can also get confused over the term codes verses the binary values of the bytes in the alphabet. And of course shorter refers to bit count. That being perhaps a little more obvious but here again these must “lexicographically precede” others.

Alphabet

This refers to the set of values that we intend to compress. Obviously this needs to include byte values (0..255) since we are not constraining our input to ASCII or something. We include all of the possible values in the alphabet (in increasing value) even if some do not appear in the data. That seems obvious but DEFLATE also defines an end-of-block code (like an EOF) of 256 as well as special codes from 257..285 used to represent length codes (in the length-distance pointers we created).

So we will need to encode bytes from 0 thru 285. Okay, That set requires 9 bits and makes life in a world of bytes difficult. Remember how I had to escape my length-distance pointers in the buffer? Anyway, we can handle it in building our trees as we can define the value of a node as an integer. So for DEFLATE our “alphabet” consists of the numbers 0..285.

Don’t be confused if you notice that length codes and also distance codes generally include some “extra” bits. They do and those are simply slipped into the bit stream and are not subjected to Huffman coding. We’ll get into that later.

Length codes

These lie outside of the normal byte values 0..255 simply because in decompression we need to recognize them. These are flagged just as I have escaped the same in the output buffer. There are 29 of the length codes which are used with extra bits in some cases to encode lengths of 3 to 258. You may recall that we did not create matching sequences of less that 3 bytes and there is a maximum of a 258 byte length. The 258 maximum I bet results from storing the length-3 as a byte (0..255) someplace. But I would be very curious as to the thought process that breaks these 256 possible lengths into 29 codes. That is likely based upon some probability distribution or some such thing. It is what it is.

Distance codes

Unlike the length codes the distance codes do not need to be flagged. We expect a distance code after a length code and so those use normal byte values already represented in the alphabet (0..29). Here there are 30 distance codes some also requiring extra bits encoding distance from 1..32768. This allows the matched sequence to sit in that 32KB sliding window.

Huffman coding compresses data by representing frequent values with a small number of bits. If a space ' ' (0x20) appears in the data a tremendous number of times it might get encoded by just 2 bits. That saving 6 bits for every occurrence of a space. That can be a huge savings. The down side is that a rare byte that might occur only a few times might be encoded by 10 bits. That actually increases the storage from the original 8-bit byte but happens only a few times.

This implies then that we know the frequencies of each member of our alphabet. That is the first step. We need to proceed to count each occurrence of each member in our alphabet that appears in the data.

Here we modify my bufflush() routine that is responsible for emptying the buffer. First we will add a routine to count. There are 286 members in the alphabet (256 byte values, the end-of-block code and 29 length codes). We create an integer array where we use the value as an index to count occurrences. There is one complication in that I need to convert my length-distance escaping into the DEFLATE encoding. That entails tables of length and distance ranges so we can decide which of the length and distance codes we need to use.

    // length code range maximums
    static int[] blen = { 
        4, 5, 6, 7, 8, 9, 10, 11, 13, 15, 17, 19, 23, 27, 31, 35, 
        43, 51, 59, 67, 83, 99, 115, 131, 163, 195, 227, 258, 259 
    };

Here for each or the 29 length ranges we specify the largest size plus 1 that it can encode. For a given length we will loop through these ranges to determine the proper alphabet value to use. The DEFLATE lengths are encoded as follows (from RFC 1951):

                 Extra               Extra               Extra
            Code Bits Length(s) Code Bits Lengths   Code Bits Length(s)
            ---- ---- ------     ---- ---- -------   ---- ---- -------
             257   0     3       267   1   15,16     277   4   67-82
             258   0     4       268   1   17,18     278   4   83-98
             259   0     5       269   2   19-22     279   4   99-114
             260   0     6       270   2   23-26     280   4  115-130
             261   0     7       271   2   27-30     281   5  131-162
             262   0     8       272   2   31-34     282   5  163-194
             263   0     9       273   3   35-42     283   5  195-226
             264   0    10       274   3   43-50     284   5  227-257
             265   1  11,12      275   3   51-58     285   0    258
             266   1  13,14      276   3   59-66

Similarly we create an array for the 30 distance codes.

    static int[] bdist = { 
        2, 3, 4, 5, 7, 9, 13, 17, 25, 33, 49, 65, 97, 129, 193, 
        257, 385, 513, 769, 1025, 1537, 2049, 3073, 4097, 6145, 
        8193, 12289, 16385, 24577, 32769 
    };

The DEFLATE specification encodes distances as follows:

                  Extra           Extra               Extra
             Code Bits Dist  Code Bits   Dist     Code Bits Distance
             ---- ---- ----  ---- ----  ------    ---- ---- --------
               0   0    1     10   4     33-48    20    9   1025-1536
               1   0    2     11   4     49-64    21    9   1537-2048
               2   0    3     12   5     65-96    22   10   2049-3072
               3   0    4     13   5     97-128   23   10   3073-4096
               4   1   5,6    14   6    129-192   24   11   4097-6144
               5   1   7,8    15   6    193-256   25   11   6145-8192
               6   2   9-12   16   7    257-384   26   12  8193-12288
               7   2  13-16   17   7    385-512   27   12 12289-16384
               8   3  17-24   18   8    513-768   28   13 16385-24576
               9   3  25-32   19   8   769-1024   29   13 24577-32768

So this buffer flush routine looks as follows. Note that we are not encoding the output in any way yet. This merely determines the counts.

    // length code range maximums
    static int[] blen = { 
        4, 5, 6, 7, 8, 9, 10, 11, 13, 15, 17, 19, 23, 27, 31, 35, 
        43, 51, 59, 67, 83, 99, 115, 131, 163, 195, 227, 258, 259 
    };
    
    static int[] bdist = { 
        2, 3, 4, 5, 7, 9, 13, 17, 25, 33, 49, 65, 97, 129, 193, 
        257, 385, 513, 769, 1025, 1537, 2049, 3073, 4097, 6145, 
        8193, 12289, 16385, 24577, 32769 
    };
 
    static void bufflush(BufferedOutputStream outfile) throws Throwable {
        
        // Determine frequecies by counting each occurrence of a byte value
        int[] freq = new int[286];
        for (int n = 0; n < BUFPTR; n++) {
            
            // Get the byte value. This may be escaped ith 0xff.
            int ch = BUFR[n] & 0xff;
            
            // not escaped
            if (ch != 0xff)
                freq[ch]++;
            
            // escaped
            else {
                
                // Get next byte.
                ch = BUFR[++n] & 0xff;
                
                // may just be 0xff itself
                if (ch == 0xff)
                    freq[0xff]++;
                
                // length-distance pair
                else {
                    
                    // obtain balance of length and distance values
                    int len = (ch << 8) + (BUFR[++n] & 0xff);
                    int dist = ((BUFR[++n] & 0xff) << 8) + (BUFR[++n] & 0xff);
                    
                    // determine length code to use (257..285)
                    for (int k = 0; k < blen.length; k++) {
                        if (len < blen[k]) {
                            freq[257 + k]++;
                            break;
                        }
                    }
                    
                    // determine distance code to use (0..29)
                    for (int k = 0; k < bdist.length; k++) {
                        if (dist < bdist[k]) {
                            freq[k]++;
                            break;
                        }
                    }
                    
                }
            }
        }
        
        // dump the results
        for (int n = 0; n < 256; n++) { if (freq[n] > 0)
                System.out.printf("0x%03x %d\n", n, freq[n]);
        }

And the unsorted results. Here we list only those values that appear in the data.

jtest jniorboot.log
0x004 1
0x00a 9
0x00b 13
0x00c 4
0x00d 5
0x00e 8
0x00f 3
0x010 4
0x011 8
0x012 2
0x013 2
0x020 41
0x028 1
0x029 1
0x02c 6
0x02d 5
0x02e 7
0x02f 3
0x030 16
0x031 12
0x032 12
0x033 6
0x034 10
0x035 7
0x036 6
0x037 4
0x038 9
0x039 9
0x03a 10
0x041 4
0x042 2
0x043 3
0x044 1
0x045 2
0x046 1
0x047 3
0x048 1
0x049 3
0x04a 1
0x04d 1
0x04e 4
0x04f 3
0x050 4
0x052 3
0x053 4
0x054 3
0x055 1
0x057 1
0x05b 1
0x05d 1
0x061 8
0x062 5
0x063 7
0x064 9
0x065 29
0x066 1
0x067 2
0x068 2
0x069 14
0x06b 2
0x06c 10
0x06d 6
0x06e 9
0x06f 17
0x070 5
0x072 17
0x073 11
0x074 15
0x075 8
0x076 4
0x077 2
0x079 5
0x07a 1
0x101 30
0x102 2
0x103 3
0x105 1
0x10c 1
0x10d 13
0x10e 2
0x10f 2
0x110 1

Processing 6.129 seconds.
Source 954 bytes.
Result 680 bytes.
Ratio 1.40:1

bruce_dev /> 

There is one omission. We will in fact use one end-of-block code (0x100) and so we will need to force it into the table.

I will need to build a tree. So we need to create some kind of node which will have left and right references as well as a potential value and a weigh (frequency). Here I will define a class.

    static ArrayList  nodes = new ArrayList(512);
    
    static class Node {
        int left;
        int value;
        int weight;
        int right;
        int length;
        int code;
        
        Node() {
        }
        
        // instantiate leaf
        Node(int val, int w) {
            value = val;
            weight = w;
        }
        
        // instantiate node
        Node(int l, int r, int w) {
            left = l;
            right = r;
            weight = w;
        }                
    }

I know that later I will assign individual leaves a code length and a code so those are included in the class as well. Now I will take our array of value frequencies and create leaves for each of those that appear in the data. An ArrayList will store these nodes and grow as we define a full tree.

I will also create an ordering array that we will use to properly constructing the tree. Each entry in this array will reference a node. Initially we will sort this by decreasing frequency. Also in keeping with the lexicographical requirement those nodes of the same frequency will be ordered by increasing alphabet value. This is where it gets tricky but we start here.

We replace the frequency dump loop with the following.

        // create node set (0 not used)
        int[] order = new int[286];
        int cnt = 0;
        nodes.add(new Node());  // not used (index 0 is terminator)
        for (int n = 0; n < 286; n++) { if (freq[n] > 0) {
                nodes.add(new Node(n, freq[n]));
                order[cnt++] = nodes.size() - 1;
            }
        }
        
        // sort
        for (int n = 0; n < cnt - 1; n++) {
            Node nd1 = nodes.get(order[n]);
            Node nd2 = nodes.get(order[n + 1]);
 
            if (nd1.weight < nd2.weight || nd1.weight == nd2.weight && nd1.value > nd2.value) {
                int k = order[n];
                order[n] = order[n + 1];
                order[n + 1] = k;
                n -= 2;
                if (n < -1)
                    n = -1;
            }
        }
        
        //dump
        System.out.println("");
        for (int n = 0; n < cnt; n++) {
            Node nd = nodes.get(order[n]);
            System.out.printf("%d 0x%03x %d\n", order[n], nd.value, nd.weight);
        }

And the results of the sort are displayed.

bruce_dev /> jtest jniorboot.log

12 0x020 41
75 0x101 30
55 0x065 29
64 0x06f 17
66 0x072 17
19 0x030 16
68 0x074 15
59 0x069 14
3 0x00b 13
80 0x10d 13
20 0x031 12
21 0x032 12
67 0x073 11
23 0x034 10
29 0x03a 10
61 0x06c 10
2 0x00a 9
27 0x038 9
28 0x039 9
54 0x064 9
63 0x06e 9
6 0x00e 8
9 0x011 8
51 0x061 8
69 0x075 8
17 0x02e 7
24 0x035 7
53 0x063 7
15 0x02c 6
22 0x033 6
25 0x036 6
62 0x06d 6
5 0x00d 5
16 0x02d 5
52 0x062 5
65 0x070 5
72 0x079 5
4 0x00c 4
8 0x010 4
26 0x037 4
30 0x041 4
41 0x04e 4
43 0x050 4
45 0x053 4
70 0x076 4
7 0x00f 3
18 0x02f 3
32 0x043 3
36 0x047 3
38 0x049 3
42 0x04f 3
44 0x052 3
46 0x054 3
77 0x103 3
10 0x012 2
11 0x013 2
31 0x042 2
34 0x045 2
57 0x067 2
58 0x068 2
60 0x06b 2
71 0x077 2
76 0x102 2
81 0x10e 2
82 0x10f 2
1 0x004 1
13 0x028 1
14 0x029 1
33 0x044 1
35 0x046 1
37 0x048 1
39 0x04a 1
40 0x04d 1
47 0x055 1
48 0x057 1
49 0x05b 1
50 0x05d 1
56 0x066 1
73 0x07a 1
74 0x100 1
78 0x105 1
79 0x10c 1
83 0x110 1

Processing 8.873 seconds.
Source 954 bytes.
Result 680 bytes.
Ratio 1.40:1

bruce_dev /> 

The first number if the node index. The next is the value and the last the count of occurrences. Here we see that the space (0x20) is the most frequent in this file. You can see that I included the end-of-block code (0x100) once and it is as infrequent as a few others.

The next step is to construct a Huffman tree. So to simplify things at this point we are going to use a simple phrase as the data and disable the LZ77 aspect. There is a Duke University page describing Huffman coding that uses the phrase “go go gophers”. For lack of anything better we will use the same.

With LZ77 disabled and running only that phrase our frequency sort yields the following.

bruce_dev /> jtest flash/gogo.dat

3 0x067 3
5 0x06f 3
1 0x020 2
2 0x065 1
4 0x068 1
6 0x070 1
7 0x072 1
8 0x073 1

Processing 0.649 seconds.
Source 13 bytes.
Result 13 bytes.
Ratio 1.00:1

bruce_dev /> 

bruce_dev /> cat flash/gogo.dat
go go gophers
bruce_dev />

We also ignore any end-of-block code.

The Duke page demonstrates that this phrase can be encoded with just 37 bits. It also demonstrates that there are multiple possible Huffman trees that can be created to yield that result. Interesting that none of them there meet the DEFLATE requirement. So I am going to determine the procedure that creates the right kind of table.

The game next is to combine pairs of leaves into nodes, And then pairs of leaves and nodes into other nodes. This procedure is repeated until there is only one at the head of the tree. Generally one is directed to combine the lowest two weighted leaves or nodes into a single node of combined weight.

Each time a node is constructed that defines a bit 0/1 for left/right for the one or two members the node contains. By working from the lowest weighted or least frequent leaves and then nodes one ends up building longer codes for that than the higher frequency values which are not touch right away.

The following procedure generates a tree (but not the kind we want just yet) using just such a procedure. The lowest two are combined and the list is resorted. We dump and repeat.

        // generate tree
        while (cnt > 1) {
            
            // take lowest weight nodes and create new
            int left = order[cnt - 2];
            int right = order[cnt - 1];
            Node nd1 = nodes.get(left);
            Node nd2 = nodes.get(right);
            nodes.add(new Node(left, right, nd1.weight + nd2.weight));
            order[cnt - 2] = nodes.size() - 1;
            cnt--;
 
            // sort
            for (int n = 0; n < cnt - 1; n++) {
                nd1 = nodes.get(order[n]);
                nd2 = nodes.get(order[n + 1]);
 
                if (nd1.weight < nd2.weight) {
                    int k = order[n];
                    order[n] = order[n + 1];
                    order[n + 1] = k;
                    n -= 2;
                    if (n < -1)
                        n = -1;
                }
            }
            
            //dump
            System.out.println("");
            for (int n = 0; n < cnt; n++) {
                Node nd = nodes.get(order[n]);
                if (nd.left == 0 && nd.right == 0)
                    System.out.printf("%d 0x%03x %d\n", order[n], nd.value, nd.weight);
                else
                    System.out.printf("%d %d-%d %d\n", order[n], nd.left, nd.right, nd.weight);
            }
        }

You can follow the procedure in the output although the resulting tree is not displayed. I might create a way to display the tree but I haven’t gone that far yet.

bruce_dev /> jtest flash/gogo.dat

3 0x067 3
5 0x06f 3
1 0x020 2
2 0x065 1
4 0x068 1
6 0x070 1
7 0x072 1
8 0x073 1

3 0x067 3
5 0x06f 3
1 0x020 2
9 7-8 2
2 0x065 1
4 0x068 1
6 0x070 1

3 0x067 3
5 0x06f 3
1 0x020 2
9 7-8 2
10 4-6 2
2 0x065 1

3 0x067 3
5 0x06f 3
11 10-2 3
1 0x020 2
9 7-8 2

12 1-9 4
3 0x067 3
5 0x06f 3
11 10-2 3

13 5-11 6
12 1-9 4
3 0x067 3

14 12-3 7
13 5-11 6

15 14-13 13

Processing 0.865 seconds.
Source 13 bytes.
Result 13 bytes.
Ratio 1.00:1

bruce_dev />

Here the new nodes display node reference indexes as left-right instead of a value.

Now with a tree built I create a recursive routine to walk the tree and assign leaves a code length and binary code. I then collect these leaves into a node array that would allow me to efficiently translate the raw data. Here that array is dumped. You will see the new code later. Here’s the table from the prior procedure.

0x020 ' ' count=2 length=3 code 000
0x065 'e' count=1 length=3 code 111
0x067 'g' count=3 length=2 code 01
0x068 'h' count=1 length=4 code 1100
0x06f 'o' count=3 length=2 code 10
0x070 'p' count=1 length=4 code 1101
0x072 'r' count=1 length=4 code 0010
0x073 's' count=1 length=4 code 0011

if we manually encode the phrase we see that it does in fact compress to just 37 bits.

g  o  ' ' g  o  ' ' g  o  p    h    e   r    s
01 10 000 01 10 000 01 10 1101 1100 111 0010 0011
total 37 bits

But this table does not meet the DEFLATE requirements and cannot fit the shorthand.

By the way here are two other tables from the Duke University page. Both of these are different yet again from our table but each gets the job done. None of these meet the DEFLATE requirement.

Now that third tree is close. But here is the tree that we need to learn to generate from this data.

Why? Because the two additional requirements are met.

First the code lengths (depth of the tree) increase from left to right.

Second, for the same code length (depth) the values increase from left to right (lexicographically).

And with this tree we can apply the required shorthand to properly to fit the DEFLATE format. The details of that shorthand I will get into.

Let’s revise the procedure to consider pairs of nodes from right to left.

Here we first determine the combined weight of the rightmost pair. We will combine that pair and any prior pair whose combined weight is less than or equal to that. We repeat this until we have only one node that being the head of the tree.

        // generate tree
        while (cnt > 1) {
            
            // determine to combined weight of the lowest two nodes
            int left = order[cnt - 2];
            int right = order[cnt - 1];
            Node nd1 = nodes.get(left);
            Node nd2 = nodes.get(right);
            int weight = nd1.weight + nd2.weight;
            
            // Now combine node pairs equal to or less than this weight from
            //  right to left.
            int pos = cnt;
            while (pos >= 2) {
                
                // Get the combined weight of the pair preceeding the pointer. We will
                //  combine the psir if its weight is less than or equal to that of 
                //  the rightmost (least) pair. We stop if not.
                left = order[pos - 2];
                right = order[pos - 1];
                nd1 = nodes.get(left);
                nd2 = nodes.get(right);
                int w = nd1.weight + nd2.weight;
                if (w > weight)
                    break;
                
                // Combine the pair and reduce teh order array.
                nodes.add(new Node(left, right, w));
                order[pos - 2] = nodes.size() - 1;
                for (int n = pos; n < cnt; n++)
                    order[pos - 1] = order[n];
                cnt--;
                
                // onto the the next prior pair
                pos -= 2;
            }
            
            //dump
            System.out.println("");
            for (int n = 0; n < cnt; n++) {
                Node nd = nodes.get(order[n]);
                if (nd.left == 0 && nd.right == 0)
                    System.out.printf("%d 0x%03x %d\n", order[n], nd.value, nd.weight);
                else
                    System.out.printf("%d %d-%d %d\n", order[n], nd.left, nd.right, nd.weight);
            }
        }

Now when this is executed we obtain a different tree. This one actually is the one we seek.

0x020 ' ' count=2 length=3 code 100
0x065 'e' count=1 length=3 code 101
0x067 'g' count=3 length=2 code 00
0x068 'h' count=1 length=4 code 1100
0x06f 'o' count=3 length=2 code 01
0x070 'p' count=1 length=4 code 1101
0x072 'r' count=1 length=4 code 1110
0x073 's' count=1 length=4 code 1111
bruce_dev /> jtest flash/gogo.dat

3 0x067 3
5 0x06f 3
1 0x020 2
2 0x065 1
4 0x068 1
6 0x070 1
7 0x072 1
8 0x073 1

3 0x067 3
5 0x06f 3
1 0x020 2
2 0x065 1
10 4-6 2
9 7-8 2

3 0x067 3
5 0x06f 3
12 1-2 3
11 10-9 4

14 3-5 6
13 12-11 7

15 14-13 13
0x020 ' ' count=2 length=3 code 100
0x065 'e' count=1 length=3 code 101
0x067 'g' count=3 length=2 code 00
0x068 'h' count=1 length=4 code 1100
0x06f 'o' count=3 length=2 code 01
0x070 'p' count=1 length=4 code 1101
0x072 'r' count=1 length=4 code 1110
0x073 's' count=1 length=4 code 1111

Processing 0.840 seconds.
Source 13 bytes.
Result 13 bytes.
Ratio 1.00:1

bruce_dev /> 

I may not be ready to claim victory here but this appears to be very promising. Perhaps we should return to a more complicated situation.

Alright so we re-enable the LZ77 compression and remove much of the dump output. When we run this on jniorboot.log we get the following table.

bruce_dev /> jtest jniorboot.log
0x004 '.' count=1 length=11 code 10011110110
0x00a '.' count=9 length=6 code 110000
0x00b '.' count=13 length=6 code 110100
0x00c '.' count=4 length=8 code 10110011
0x00d '.' count=5 length=6 code 110010
0x00e '.' count=8 length=7 code 1010001
0x00f '.' count=3 length=8 code 10011101
0x010 '.' count=4 length=8 code 10111100
0x011 '.' count=8 length=7 code 1001100
0x012 '.' count=2 length=10 code 1001111000
0x013 '.' count=2 length=10 code 1001111001
0x020 ' ' count=41 length=2 code 00
0x028 '(' count=1 length=11 code 10011110111
0x029 ')' count=1 length=11 code 10110111110
0x02c ',' count=6 length=6 code 110110
0x02d '-' count=5 length=6 code 110011
0x02e '.' count=7 length=8 code 10110001
0x02f '/' count=3 length=9 code 100111110
0x030 '0' count=16 length=5 code 11111
0x031 '1' count=12 length=7 code 1011100
0x032 '2' count=12 length=7 code 1011101
0x033 '3' count=6 length=6 code 110111
0x034 '4' count=10 length=7 code 1001001
0x035 '5' count=7 length=8 code 10110100
0x036 '6' count=6 length=7 code 1001010
0x037 '7' count=4 length=8 code 10111101
0x038 '8' count=9 length=6 code 110001
0x039 '9' count=9 length=6 code 100000
0x03a ':' count=10 length=6 code 111000
0x041 'A' count=4 length=7 code 1000100
0x042 'B' count=2 length=9 code 111010000
0x043 'C' count=3 length=9 code 100111111
0x044 'D' count=1 length=11 code 10110111111
0x045 'E' count=2 length=9 code 111010001
0x046 'F' count=1 length=10 code 1110110110
0x047 'G' count=3 length=8 code 11101110
0x048 'H' count=1 length=10 code 1110110111
0x049 'I' count=3 length=8 code 11101111
0x04a 'J' count=1 length=10 code 1011111110
0x04d 'M' count=1 length=10 code 1011111111
0x04e 'N' count=4 length=7 code 1000101
0x04f 'O' count=3 length=8 code 11101010
0x050 'P' count=4 length=9 code 101101100
0x052 'R' count=3 length=8 code 11101011
0x053 'S' count=4 length=9 code 101101101
0x054 'T' count=3 length=7 code 1000110
0x055 'U' count=1 length=10 code 1110100110
0x057 'W' count=1 length=10 code 1110100111
0x05b '[' count=1 length=10 code 1110110100
0x05d ']' count=1 length=10 code 1110110101
0x061 'a' count=8 length=7 code 1001101
0x062 'b' count=5 length=7 code 1010010
0x063 'c' count=7 length=8 code 10110101
0x064 'd' count=9 length=6 code 100001
0x065 'e' count=29 length=4 code 0110
0x066 'f' count=1 length=10 code 1110100100
0x067 'g' count=2 length=10 code 1011011100
0x068 'h' count=2 length=10 code 1011011101
0x069 'i' count=14 length=6 code 101011
0x06b 'k' count=2 length=9 code 101111100
0x06c 'l' count=10 length=6 code 111001
0x06d 'm' count=6 length=7 code 1001011
0x06e 'n' count=9 length=7 code 1010000
0x06f 'o' count=17 length=4 code 0111
0x070 'p' count=5 length=7 code 1010011
0x072 'r' count=17 length=5 code 11110
0x073 's' count=11 length=7 code 1001000
0x074 't' count=15 length=6 code 101010
0x075 'u' count=8 length=8 code 10110000
0x076 'v' count=4 length=8 code 10011100
0x077 'w' count=2 length=9 code 101111101
0x079 'y' count=5 length=8 code 10110010
0x07a 'z' count=1 length=10 code 1110100101
0x100 '.' count=1 length=10 code 1011111100
0x101 '.' count=30 length=3 code 010

Processing 10.284 seconds.
Source 954 bytes.
Result 680 bytes.
Ratio 1.40:1

bruce_dev />

Well okay. Just about all you can say is that it does appear that the stuff with the higher counts (frequency) does appear to use the shortest code lengths. Another good clue is the fact that the first alphabet entry that uses the smallest code is represented by a sequence of all zeroes.

I suppose now we get into what I have been calling the shorthand storage format for Huffman table. If this table can be so represented and the table reconstructed from that then we are good to go.

Huffman table “Shorthand”

While “shorthand” is my term and no one else’s that I’ve seen, it still refers to efficiently conveying the table. To start I have defined an entry for 285 possible codes each with a count and a binary code. Even with some cute integer packing this is still a lot of bytes. Having to pass the table with the compressed file painfully can cut into the benefit of the compression.

It turns out that if the Huffman table conforms to the two special rules it can be reconstructed from only knowing the code length for each of the alphabet members. So we don’t need to include the actual code.

Once we have the code lengths for each alphabet that gets packed further. It’s a bit crazy. The array of 285 code lengths contains a lot of repetition. This is packed using a form and run-length encoding where sequences of the same length are defined by the count (run length). Then that data is again (ugh) run through a Huffman encoding which results in just 19 bit lengths. Those are stored in a weird order which is intended to keep those alphabet members whose bit lengths are likely to be 0 near the end as trailing zeroes need not be included. So the entire Huffman table ends up being conveyed in just a handful of bytes. I guess people were really creative back then.

The procedure for reconstructing the Huffman table from the code lengths first requires a count of codes for each length. Let me add that to our table output. Here is the additional output from the execution.

Length=2 Count 1
Length=3 Count 1
Length=4 Count 2
Length=5 Count 2
Length=6 Count 13
Length=7 Count 15
Length=8 Count 14
Length=9 Count 8
Length=10 Count 15
Length=11 Count 4

From this table we can calculate the first binary code assigned to that code length group. Each alphabet member using that code length is then assigned an incremental binary value from that. I can add that calculation to the this table.

Length=2 Count 1 Start Code 00
Length=3 Count 1 Start Code 010
Length=4 Count 2 Start Code 0110
Length=5 Count 2 Start Code 10000
Length=6 Count 13 Start Code 100100
Length=7 Count 15 Start Code 1100010
Length=8 Count 14 Start Code 11100010
Length=9 Count 8 Start Code 111100000
Length=10 Count 15 Start Code 1111010000
Length=11 Count 4 Start Code 11110111110

Okay so I can see that the Huffman table generated DOES NOT conform. As an exercise you can see for yourself. So I wonder where error might be. Hmm…

bruce_dev /> jtest jniorboot.log
0x004 '.' count=1 length=11 code 10011110110
0x00a '.' count=9 length=6 code 110000
0x00b '.' count=13 length=6 code 110100
0x00c '.' count=4 length=8 code 10110011
0x00d '.' count=5 length=6 code 110010
0x00e '.' count=8 length=7 code 1010001
0x00f '.' count=3 length=8 code 10011101
0x010 '.' count=4 length=8 code 10111100
0x011 '.' count=8 length=7 code 1001100
0x012 '.' count=2 length=10 code 1001111000
0x013 '.' count=2 length=10 code 1001111001
0x020 ' ' count=41 length=2 code 00
0x028 '(' count=1 length=11 code 10011110111
0x029 ')' count=1 length=11 code 10110111110
0x02c ',' count=6 length=6 code 110110
0x02d '-' count=5 length=6 code 110011
0x02e '.' count=7 length=8 code 10110001
0x02f '/' count=3 length=9 code 100111110
0x030 '0' count=16 length=5 code 11111
0x031 '1' count=12 length=7 code 1011100
0x032 '2' count=12 length=7 code 1011101
0x033 '3' count=6 length=6 code 110111
0x034 '4' count=10 length=7 code 1001001
0x035 '5' count=7 length=8 code 10110100
0x036 '6' count=6 length=7 code 1001010
0x037 '7' count=4 length=8 code 10111101
0x038 '8' count=9 length=6 code 110001
0x039 '9' count=9 length=6 code 100000
0x03a ':' count=10 length=6 code 111000
0x041 'A' count=4 length=7 code 1000100
0x042 'B' count=2 length=9 code 111010000
0x043 'C' count=3 length=9 code 100111111
0x044 'D' count=1 length=11 code 10110111111
0x045 'E' count=2 length=9 code 111010001
0x046 'F' count=1 length=10 code 1110110110
0x047 'G' count=3 length=8 code 11101110
0x048 'H' count=1 length=10 code 1110110111
0x049 'I' count=3 length=8 code 11101111
0x04a 'J' count=1 length=10 code 1011111110
0x04d 'M' count=1 length=10 code 1011111111
0x04e 'N' count=4 length=7 code 1000101
0x04f 'O' count=3 length=8 code 11101010
0x050 'P' count=4 length=9 code 101101100
0x052 'R' count=3 length=8 code 11101011
0x053 'S' count=4 length=9 code 101101101
0x054 'T' count=3 length=7 code 1000110
0x055 'U' count=1 length=10 code 1110100110
0x057 'W' count=1 length=10 code 1110100111
0x05b '[' count=1 length=10 code 1110110100
0x05d ']' count=1 length=10 code 1110110101
0x061 'a' count=8 length=7 code 1001101
0x062 'b' count=5 length=7 code 1010010
0x063 'c' count=7 length=8 code 10110101
0x064 'd' count=9 length=6 code 100001
0x065 'e' count=29 length=4 code 0110
0x066 'f' count=1 length=10 code 1110100100
0x067 'g' count=2 length=10 code 1011011100
0x068 'h' count=2 length=10 code 1011011101
0x069 'i' count=14 length=6 code 101011
0x06b 'k' count=2 length=9 code 101111100
0x06c 'l' count=10 length=6 code 111001
0x06d 'm' count=6 length=7 code 1001011
0x06e 'n' count=9 length=7 code 1010000
0x06f 'o' count=17 length=4 code 0111
0x070 'p' count=5 length=7 code 1010011
0x072 'r' count=17 length=5 code 11110
0x073 's' count=11 length=7 code 1001000
0x074 't' count=15 length=6 code 101010
0x075 'u' count=8 length=8 code 10110000
0x076 'v' count=4 length=8 code 10011100
0x077 'w' count=2 length=9 code 101111101
0x079 'y' count=5 length=8 code 10110010
0x07a 'z' count=1 length=10 code 1110100101
0x100 '.' count=1 length=10 code 1011111100
0x101 '.' count=30 length=3 code 010

Length=2 Count 1 Start Code 00
Length=3 Count 1 Start Code 010
Length=4 Count 2 Start Code 0110
Length=5 Count 2 Start Code 10000
Length=6 Count 13 Start Code 100100
Length=7 Count 15 Start Code 1100010
Length=8 Count 14 Start Code 11100010
Length=9 Count 8 Start Code 111100000
Length=10 Count 15 Start Code 1111010000
Length=11 Count 4 Start Code 11110111110

Processing 10.395 seconds.
Source 954 bytes.
Result 680 bytes.
Ratio 1.40:1

bruce_dev /> 

Yeah I had at least one glitch but trying to generate the precise tree appropriate for the DEFLATE “shorthand” still eludes me. The search engines these days are much less effective for locating useful technical information than for finding ways to separate me from my money. It seems easier to reinvent the wheel and devise my own algorithm even though I know that a simple procedure is likely documented in numerous pages on the net.

It strikes me that all we need to do is determine the optimum bit length for each of the used alphabet members. It is almost irrelevant as to where in a tree a particular member ends up. Once we have the proper bit length a tree meeting the DEFLATE requirements can be directly created.

Perhaps the simple procedure for generating a valid Huffman tree ignoring the DEFLATE requirements can be employed and without actually building a tree structure. Note that when two leaves are combined you are simply assigning another bit to them regardless of which gets ‘0’ and which gets ‘1’. The bit length is incremented for the two leaves as it is combined into a node. In fact when you combine two nodes you need only increment the bit length (depth) for all of the members below it. So in creating a node I need only keep track of all of the leaves below it. A simple linked list suffices.

Such an implementation need not even retain intermediate nodes. You just need to maintain the node membership list. You need that so you can advance the bit count for all of the leaves below as the node is combined.

Maybe you follow me to this point or maybe not. I’ll go ahead an try an implementation.

Okay this new approach is golden! And its fast! Oh and I don’t need to build any damn tree!

    static void bufflush(BufferedOutputStream outfile) throws Throwable {
        
        // Determine frequecies by counting each occurrence of a byte value. 
        //  Here we force the end-of-block code that we know we will use.
        int[] sym_cnt = new int[286];
//        sym_cnt[0x100] = 1;
        
        for (int n = 0; n < BUFPTR; n++) {
            
            // Get the byte value. This may be escaped with 0xff.
            int ch = BUFR[n] & 0xff;
            
            // not escaped
            if (ch != 0xff)
                sym_cnt[ch]++;
            
            // escaped
            else {
                
                // Get next byte.
                ch = BUFR[++n] & 0xff;
                
                // may just be 0xff itself
                if (ch == 0xff)
                    sym_cnt[0xff]++;
                
                // length-distance pair
                else {
                    
                    // obtain balance of length and distance values
                    int len = (ch << 8) + (BUFR[++n] & 0xff);
                    int dist = ((BUFR[++n] & 0xff) << 8) + (BUFR[++n] & 0xff);
                    
                    // determine length code to use (257..285)
                    for (int k = 0; k < blen.length; k++) {
                        if (len < blen[k]) {
                            sym_cnt[257 + k]++;
                            break;
                        }
                    }
                    
                    // determine distance code to use (0..29)
                    for (int k = 0; k < bdist.length; k++) {
                        if (dist < bdist[k]) {
                            sym_cnt[k]++;
                            break;
                        }
                    }
                    
                }
            }
        }
        
        // Create node list containing symbols in our alphabet that are found in the
        //  data. This will be sorted and used to assign bit lengths. Note list pointers
        //  are stored +1 to reserve 0 as a list terminator.
        int[] nodes = new int[286];
        int[] cnts = new int[286];
        int nodecnt = 0;
        for (int n = 0; n < 286; n++) { if (sym_cnt[n] > 0) {
                nodes[nodecnt] = n + 1;
                cnts[nodecnt] = sym_cnt[n];
                nodecnt++;                
            }
        }
        
        // Determine optimal bit lengths. Here we initialize a bit length array and a
        //  node membership list pointer array. These will be used as we generate
        //  the detail required for Huffman coding.
        int[] sym_len = new int[286];
        int[] sym_ptr = new int[286];
        
        // Perform Huffman optimization. This loops until we've folded all the leaves
        //  into a single head node.
        while (nodecnt > 1) {
            
            // The leaves are sorted by decreasing frequency (counts).
            for (int n = 0; n < nodecnt - 1; n++) {
                if (cnts[n] < cnts[n + 1]) { int k = nodes[n]; nodes[n] = nodes[n + 1]; nodes[n + 1] = k; k = cnts[n]; cnts[n] = cnts[n + 1]; cnts[n + 1] = k; if (n > 0)
                        n -= 2;
                }
            }
 
            // The last two leaves/nodes have the lowest frequencies and are to 
            //  be combined. Here we increment the bit lengths for each and 
            //  merge leaves into a single list of node members.
            int ptr = nodes[nodecnt - 2];
            int add_ptr = nodes[nodecnt - 1];
            while (ptr > 0) {
                sym_len[ptr - 1]++;
                int p = sym_ptr[ptr - 1];
                if (p == 0 && add_ptr > 0) {
                    sym_ptr[ptr - 1] = add_ptr;
                    p = add_ptr;
                    add_ptr = 0;
                }
                ptr = p;
            }
            
            // Combine the last two nodes by adding their frequencies and dropping 
            //  the last.
            cnts[nodecnt - 2] += cnts[nodecnt - 1];
            nodecnt--;
            
        }
        
        // dump nonzero bit lengths
        for (int n = 0; n < 286; n++) { if (sym_len[n] > 0)
                System.out.printf("0x%03x '%c' count=%d optimal bits %d\n", n,
                        n >= 0x20 && n < 0x7f ? n : '.', sym_cnt[n], sym_len[n]);
        }
        
        outfile.write(BUFR, 0, BUFPTR);
        BUFPTR = 0;        
    }

Running this on the “go go gophers ” test string again with LZ77 disabled yields the desired results.

bruce_dev /> jtest flash/gogo.dat
0x020 ' ' count=2 optimal bits 3
0x065 'e' count=1 optimal bits 3
0x067 'g' count=3 optimal bits 2
0x068 'h' count=1 optimal bits 4
0x06f 'o' count=3 optimal bits 2
0x070 'p' count=1 optimal bits 4
0x072 'r' count=1 optimal bits 4
0x073 's' count=1 optimal bits 4

Processing 0.540 seconds.
Source 13 bytes.
Result 13 bytes.
Ratio 1.00:1

bruce_dev />

Now I can use this to generate the DEFLATE compatible Huffman table.

If you multiply the frequency (count) times the bit length and add them for this example you get 37 bits which we know is the optimal for this example.

The code with comments above should be reasonably understandable. If there are any questions I can describe what is going on. But basically the routine that counts occurrences of each alphabet symbol is as it was before. Unfortunately that is complicated a bit as I have to process the length-distance pointers to determine the encoding for the tally.

Next we create a list of leaves so we can combine the two with the lowest frequency of occurrence. Here’s where we simply increment the bit lengths for node members. The leaf list distills down as it would for the traditional Huffman process.

Next I can reinstate the LZ77 compressor and run this on real data. Then we can then take the process further.

We can now use the procedure detailed in the DEFLATE specification to convert assigned bit lengths into binary codes for compression. We don’t need to build a tree.

First we tally the number of symbols in our alphabet the use each bit length.

        // count the occurrence of each bit length
        int[] bits = new int[19];
        for (int n = 0; n < 286; n++) { if (sym_len[n] > 0)
                bits[sym_len[n]]++;
        }

With this we can calculate the starting binary code for each bit length. Basically for each bit length we reserve N codes and use the next as a prefix for subsequent bit lengths.

        // determine starting bit code for each bit length
        int[] start = new int[19];
        int c = 0;
        for (int n = 0; n < 19; n++) {
            start[n] = c;
            c = (c + bits[n]) << 1;
        }

This gives us the correct first codes as we see here.

bruce_dev /> jtest flash/gogo.dat
bit length 2 count 2 first code 00
bit length 3 count 2 first code 100
bit length 4 count 4 first code 1100

Now we use these starting codes in assigning the binary codes to each symbol.

        // assign codes to used alphabet symbols
        int[] code = new int[286];
        for (int n = 0; n < 286; n++) { if (sym_len[n] > 0) 
                code[n] = start[sym_len[n]]++;
        }

This results are displayed by the attached program.

bruce_dev /> jtest flash/gogo.dat
bit length 2 count 2 first code 00
bit length 3 count 2 first code 100
bit length 4 count 4 first code 1100

0x020 ' ' count=2 optimal bits 3 100
0x065 'e' count=1 optimal bits 3 101
0x067 'g' count=3 optimal bits 2 00
0x068 'h' count=1 optimal bits 4 1100
0x06f 'o' count=3 optimal bits 2 01
0x070 'p' count=1 optimal bits 4 1101
0x072 'r' count=1 optimal bits 4 1110
0x073 's' count=1 optimal bits 4 1111

Processing 0.654 seconds.
Source 13 bytes.
Result 13 bytes.
Ratio 1.00:1

bruce_dev />

This gives us all of the codes that we need to compress our data. In this case it is for the example string “go go gophers”. Happily we did not need to build any tree structure. And, this Huffman coding is compatible with the DEFLATE specification. We can move forward with the shorthand.

Curious? Here’s what I get using jniorboot.log. The content of that file has changed by the way as I have rebooted the JNIOR during the course of this topic. Here the LZ77 compression has also been re-enabled. the program however does not yet generate the bit stream compressed with these Huffman codes.

bruce_dev /> jtest jniorboot.log
bit length 4 count 3 first code 0000
bit length 5 count 8 first code 00110
bit length 6 count 20 first code 011100
bit length 7 count 22 first code 1100000
bit length 8 count 11 first code 11101100
bit length 9 count 18 first code 111101110

0x004 '.' count=1 optimal bits 9 111101110
0x00a '.' count=8 optimal bits 6 011100
0x00b '.' count=12 optimal bits 5 00110
0x00c '.' count=5 optimal bits 7 1100000
0x00d '.' count=5 optimal bits 7 1100001
0x00e '.' count=6 optimal bits 6 011101
0x00f '.' count=3 optimal bits 8 11101100
0x010 '.' count=5 optimal bits 7 1100010
0x011 '.' count=7 optimal bits 6 011110
0x012 '.' count=4 optimal bits 7 1100011
0x013 '.' count=4 optimal bits 7 1100100
0x020 ' ' count=41 optimal bits 4 0000
0x028 '(' count=1 optimal bits 9 111101111
0x029 ')' count=1 optimal bits 9 111110000
0x02c ',' count=6 optimal bits 6 011111
0x02d '-' count=5 optimal bits 7 1100101
0x02e '.' count=7 optimal bits 6 100000
0x02f '/' count=3 optimal bits 8 11101101
0x030 '0' count=18 optimal bits 5 00111
0x031 '1' count=15 optimal bits 5 01000
0x032 '2' count=9 optimal bits 6 100001
0x033 '3' count=8 optimal bits 6 100010
0x034 '4' count=9 optimal bits 6 100011
0x035 '5' count=5 optimal bits 7 1100110
0x036 '6' count=4 optimal bits 7 1100111
0x037 '7' count=5 optimal bits 7 1101000
0x038 '8' count=7 optimal bits 6 100100
0x039 '9' count=8 optimal bits 6 100101
0x03a ':' count=9 optimal bits 6 100110
0x041 'A' count=4 optimal bits 7 1101001
0x042 'B' count=2 optimal bits 8 11101110
0x043 'C' count=2 optimal bits 8 11101111
0x044 'D' count=1 optimal bits 9 111110001
0x045 'E' count=2 optimal bits 8 11110000
0x046 'F' count=1 optimal bits 9 111110010
0x047 'G' count=3 optimal bits 7 1101010
0x048 'H' count=1 optimal bits 9 111110011
0x049 'I' count=2 optimal bits 8 11110001
0x04a 'J' count=1 optimal bits 9 111110100
0x04d 'M' count=1 optimal bits 9 111110101
0x04e 'N' count=4 optimal bits 7 1101011
0x04f 'O' count=3 optimal bits 7 1101100
0x050 'P' count=5 optimal bits 7 1101101
0x052 'R' count=3 optimal bits 7 1101110
0x053 'S' count=4 optimal bits 7 1101111
0x054 'T' count=3 optimal bits 7 1110000
0x055 'U' count=1 optimal bits 9 111110110
0x057 'W' count=1 optimal bits 9 111110111
0x05b '[' count=1 optimal bits 9 111111000
0x05d ']' count=1 optimal bits 9 111111001
0x061 'a' count=8 optimal bits 6 100111
0x062 'b' count=5 optimal bits 7 1110001
0x063 'c' count=7 optimal bits 6 101000
0x064 'd' count=9 optimal bits 6 101001
0x065 'e' count=29 optimal bits 4 0001
0x066 'f' count=1 optimal bits 9 111111010
0x067 'g' count=2 optimal bits 8 11110010
0x068 'h' count=2 optimal bits 8 11110011
0x069 'i' count=14 optimal bits 5 01001
0x06b 'k' count=2 optimal bits 8 11110100
0x06c 'l' count=10 optimal bits 6 101010
0x06d 'm' count=6 optimal bits 6 101011
0x06e 'n' count=9 optimal bits 6 101100
0x06f 'o' count=17 optimal bits 5 01010
0x070 'p' count=5 optimal bits 7 1110010
0x072 'r' count=17 optimal bits 5 01011
0x073 's' count=11 optimal bits 6 101101
0x074 't' count=15 optimal bits 5 01100
0x075 'u' count=8 optimal bits 6 101110
0x076 'v' count=4 optimal bits 7 1110011
0x077 'w' count=2 optimal bits 8 11110101
0x079 'y' count=5 optimal bits 7 1110100
0x07a 'z' count=1 optimal bits 9 111111011
0x100 '.' count=1 optimal bits 9 111111100
0x101 '.' count=28 optimal bits 4 0010
0x102 '.' count=6 optimal bits 6 101111
0x103 '.' count=2 optimal bits 8 11110110
0x105 '.' count=1 optimal bits 9 111111101
0x10d '.' count=13 optimal bits 5 01101
0x10e '.' count=4 optimal bits 7 1110101
0x10f '.' count=1 optimal bits 9 111111110
0x110 '.' count=1 optimal bits 9 111111111

Processing 9.181 seconds.
Source 954 bytes.
Result 680 bytes.
Ratio 1.40:1

bruce_dev /> 

With the Huffman coding for DEFLATE there is one thing that we need to worry about. We need to limit the bit length to a maximum of 15. It is not very likely to occur I suspect. But this is because the bit length list for the alphabet is run-length encoded using a procedure that can only handle bit lengths of 0 to 15. Codes of 16, 17 and 18 are used to signal certain types of repetition. This is where I got the ’19’ I use in my breadboard code to dimension the bit length arrays. That need only be ’16’ as the 3 additional are repetition codes. If the Huffman coding results in a bit length exceeding 15 I will simply have to decrease the size of the block we are encoding until we are good to go.

So now that we have the ability to reasonably compress our data using LZ77 and then to compress it further with Huffman coding, we are ready to generate the DEFLATE formatted payload. It is time to tackle the “craziness” and “shorthand” that I have referred to. We can apply the run-length encoding and the second iteration of Huffman coding that are required. We are ready to generate the DEFLATE format compressed bit stream.

JANOS has been able to process JAR/ZIP files since early in its development. This was required to meet our goal of executing Java directly from the JAR files generated by the compiler. So we have been able to decipher the DEFLATE format. I just hadn’t needed to generate it. But the advantage to this is that there is already proven code parsing the DEFLATE structure. Referring to that helps to remove any question when trying to figure out how to generate such a structure.

Rather than drop this topic now that I have the LZ77 and Huffman procedures that I need, I’ll take a moment to review the final steps. Let me see if I can clarify some of it here. For our Java breadboard to produce something usable I would not only have to complete the DEFLATE programming but also encapsulate the result in the JAR/ZIP library format. That’s more effort than I need given that I will be doing just that at the C level within the OS and I need to get to that soon.

DEFLATE Formatted Data

When file data is compressed using DEFLATE and included in a JAR/ZIP library it is represented as a bit stream. That sounds simple enough but because our micro-controller works with bytes and retrieves bytes from a file data stream we need to be concerned with bit order and byte order.

Normally bits in a byte are ordered from right to left with the least significant bit (the first in the bit stream) on the right. Like this:

+--------+
|76543210|
+--------+

The 9th bit in the stream then comes from the next byte and bytes are in sequence in increasing memory addresses (or increasing file position). This order of bits seems only natural as it should.

So if we were to retrieve a 5-bit value from the stream we would obtain the right 5 bits from the first byte using the mask 0x1f. Placing that in a byte of its own would give us the numeric value of the 5-bit data element. The next 5-bit element would use the remaining 3 bits in the first byte and 2 from the right side of the next. We would likely be using a right shift before applying the mask to pull those together.

Huffman codes will seem to contradict this. These codes are packed starting with the most-significant bit of the code. In other words the most significant bit of the first Huffman code would be found in the rightmost bit of the first byte. Once you realize that you must process Huffman codes a bit at a time and that you are reading single bit data elements this order makes sense. Pointing out that the Huffman code appears in the byte in reverse order serves to confuse us. But you are reading it a single bit at a time using each bit to decide which direction to descend through the Huffman tree. That means that you need the code’s most-significant bit first. We also never know in advance how many bits we are going to need to reach the first leaf of the tree and thus our first encoded member of the alphabet.

Block Format

The DEFLATE bit stream is broken down into a series of 1 or more blocks of bits. In our case when we flush our 64KB LZ77 data buffer we are going to construct a single block. Since we are compressing it you would expect that it will contain much less than 512Kb (64KB x 8 bits). For large file we will likely need to flush our buffer multiple times creating a stream with multiple blocks.

Each block contains a 3-bit header. The first bit is called BFINAL and it is a 1 only for the last block in the stream. The next 2 bits are called BTYPE and these define the data encoding for this block. That means that we could use a different encoding for each block if we felt it to be beneficial. The 2 BTYPE bits gives us 4 options.

  00 - no compression
  01 - compressed with fixed Huffman codes
  10 - compressed with dynamic Huffman codes
  11 - reserved (error)

We have been working toward being able to generate blocks of BTYPE 10. We could have used the predefined Huffman tables in BTYPE 01 or not even bothered to compress using BTYPE 00. There may be times when our compression fails to reduce the size of our file data. IN that case we could decide to include a block without compressing. But for now we will concern ourselves with BTYPE 10 that includes dynamic (not adaptive) Huffman tables (tables that change block to block).

So to start this defines our block so far. I’ll show the stream progressing from left to right with the number of bits for each element shown in parentheses.

 Bit 0         1         2         3         4         5         6         7
+---------+---------+---------+---------+---------+---------+---------+---------
| BFINAL  |     BTYPE (2)     |        Balance of Block . . .
+---------+---------+---------+---------+---------+---------+---------+---------

BFINAL set on the last block in the stream.

Now logically we know that somehow we need to get the Huffman table before we see compressed data so we can decompress that data. We also have seen that the Huffman table can be defined knowing the bit lengths for each of the symbols in our alphabet. In fact we expect that, since I made a big deal about having the right kind of Huffman table for DEFLATE that can be generated from just the bit lengths. So we are looking for that array of bit lengths. Here’s where the fun begins.

To start we know that not all of the 286 members of the alphabet will be used in the data. Some entries in that table will be assigned a 0 bit length. We have to assume that the majority of the literal bytes (0..255) will be represented in the data. We also know that the end-of-block code (257) will appear once. So we need an array of bit lengths at least 257 entries long. Beyond that we don’t know how many of the sequence length codes (257..285) will be used. But if some of the trailing codes aren’t used then the array doesn’t need to be 286 entries long. We just need the non-zero bit lengths. So this array will be 257 plus however many length codes we need to cover all of the non-zero ones.

The next element in the bit stream is HLIT. This is a 5-bit element when added to 257 defines the count of entries in the bit length array that will be provided. We will assume that the reset are 0. Since there are 29 length codes beyond the end-of-block code we need only know how many of those to know the size of the array. That can be passed in a 5-bit element.

       Bit 0         1         2         3         4         5         6         7
      +---------+---------+---------+---------+---------+---------+---------+---------+-----
      | BFINAL  |     BTYPE (2)     |                     HLIT (5)                    |
      +---------+---------+---------+---------+---------+---------+---------+---------+-----

HLIT + 257 tells us how many of the 286 bit lengths will be provided. But, don’t expect that those will be forthcoming. At least not right away.

Next comes something that the breadboard program handled incorrectly. The distance codes for the length-distance pointers are Huffman coded using their own table. The length codes are Huffman coded using the same table as the literal data. This makes sense since when you retrieve the next code you don’t know if it is a literal or a length code. You do know that after the length code (and any extra bits) comes the distance code. So compressing those with their own table is a benefit. The DEFLATE specification (RFC 1951) states this but it isn’t all that obvious.

So wait! Now we need 2 Huffman tables and therefore 2 arrays of bit lengths. Yes we do.

Next we receive a 5-bit data element containing HDIST – 1. This defines the number of bit lengths to be supplied for the distance alphabet and the second Huffman table. The specification shows that there are 30 distance codes (0..29) but refers to 32 distance codes. It states that codes 30 and 31 will never occur in the data. These are perhaps reserved to allow for a larger sliding window in the future. There is also the need for a 0 distance code which would be used as a flag to indicate that no distance codes are used at all and that the data is all literals. So to pass a bit length array size of up to 33 the value is stored -1.

       Bit 0         1         2         3         4         5         6         7
      +---------+---------+---------+---------+---------+---------+---------+---------+-----
      | BFINAL  |     BTYPE (2)     |                     HLIT (5)                    |
      +---------+---------+---------+---------+---------+---------+---------+---------+-----
           8         9        10        11        12        13        14        15
 -----+---------+---------+---------+---------+---------+---------+---------+---------+-----
      |                    HDIST (5)                    |
 -----+---------+---------+---------+---------+---------+---------+---------+---------+-----

Now we know the length of two arrays defining bit lengths for two alphabets. I had alluded to the fact that we would again use Huffman coding to compress the bit length array data. That is the case and so we need yet another Huffman table and therefore a third array of bit lengths. Will it never end??

The bit length alphabet includes 3 codes for repetition for total of 19 codes. The size of the bit length array for this is conveyed in a 4-bit data element HCLEN – 4. Note though that this array defines bit lengths for the codes in a very specific order. This was devised to keep those codes that typically have 0 bit length at the end of the array so they can be omitted. The order of the codes is as follows:

16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15

That means that we will have to arrange the bit lengths in this order before deciding how many to include in our stream. When reading these we would shuffle them into the proper location.

Our bit stream now looks like this.

       Bit 0         1         2         3         4         5         6         7
      +---------+---------+---------+---------+---------+---------+---------+---------+-----
      | BFINAL  |     BTYPE (2)     |                     HLIT (5)                    |
      +---------+---------+---------+---------+---------+---------+---------+---------+-----
           8         9        10        11        12
 -----+---------+---------+---------+---------+---------+-----
      |                    HDIST (5)                    |
 -----+---------+---------+---------+---------+---------+-----
          13        14        15        16
 -----+---------+---------+---------+---------+-----
      |               HCLEN (4)               |
 -----+---------+---------+---------+---------+-----

Well there is a lot of information so far conveyed in just 2 bytes. Next we will finally receive some bit length data. Following this we are supplied HCLEN + 4 data elements each 3-bits (0..15) defining the code lengths for the Huffman table that we will use to generate bit length arrays for the other tables. Note that these are in the predefined order and must be sorted into the bit lengths for the 19 symbol alphabet. Now there are a variable number of these data elements and so I will no longer be able to number the bits nor can I show them all. HCLEN + 4 data elements however many is required follow in the stream.

          13        14        15        16        17        18        19   . . . 
 -----+---------+---------+---------+---------+---------+---------+---------+-----
      |               HCLEN (4)               |          CODE16 (3)         |
 -----+---------+---------+---------+---------+---------+---------+---------+-----

 -----+---------+---------+---------+---------+---------+---------+-----
      |          CODE17 (3)         |          CODE18 (3)         |
 -----+---------+---------+---------+---------+---------+---------+-----

 -----+---------+---------+---------+---------+-----
      |          CODE0 (3)          |       . . . 
 -----+---------+---------+---------+---------+-----

We now have a bit length array for our 19 symbol alphabet. At least we are familiar with this from the breadboard program. We can use the procedure to generate the Huffman code table. First we count the number of times each bit length is used. Then we calculate a starting code for each length. And then we assign sequential codes to the alphabet for each bit length group. So we can decipher Huffman codes. It’s a good thing too because now what follows in the bit stream are Huffman codes from this table.

Keeping up? We are talking about what is involved in decompressing a DEFLATE stream but really we are doing this so we know what to do when we compress our own data. So now is the time to consider dropping some bread crumbs because if you are going to create a DEFLATE stream you will need to find your way back through this.

Supposedly at this point we can read Huffman codes, locate the symbol they represent and retrieve the bit length array data we were originally looking for. Well almost. While some of the values we obtain will be bit lengths for the next entry in a bit length array there are 3 special codes each requiring a different action. The specification describe them as follows.

               0 - 15: Represent code lengths of 0 - 15
                   16: Copy the previous code length 3 - 6 times.
                       The next 2 bits indicate repeat length
                             (0 = 3, ... , 3 = 6)
                          Example:  Codes 8, 16 (+2 bits 11),
                                    16 (+2 bits 10) will expand to
                                    12 code lengths of 8 (1 + 6 + 5)
                   17: Repeat a code length of 0 for 3 - 10 times.
                       (3 bits of length)
                   18: Repeat a code length of 0 for 11 - 138 times
                       (7 bits of length)

Here we see that when we encounter one of the codes 16, 17 or 18 we are required to pull 2, 3 or 7 additional bits respectively from the bit stream which define a repeat count.

First we will receive Huffman codes to define HLIT + 257 bit lengths for the literal Huffman table. Then we will receive codes defining HDIST + 1 bit lengths for the distance Huffman table. But don’t think the fun ends here.

HLIT and HDIST do not define the count of Huffman codes that follow. If you obtain a code that repeats a value that counts for that many bit lengths. That perhaps makes sense. But to just make things a little trickier, once you acquire the HLIT + 257 codes you immediately start defining the HDIST + 1 codes even if you are performing repetition. Yeah, a single repeat code can take you from one table to the next. If you are repeating some 0 bit lengths trailing in the HLIT bit length array you would just keep going to define any 0 bit lengths in the first part of the HDIST array. The specification says “code lengths form a single sequence of HLIT + HDIST + 258 values.”

When you are generating these Huffman codes of course you don’t have to force it to be a single sequence. You might just be wasting a few bits. Today that’s not a big deal but it certainly must have been 40 years ago.

So start pulling Huffman codes. Remember you process these 1 bit at a time so you are starting with the most-significant bit of some code. With each bit you are either descending through a tree looking for a leaf and the corresponding symbol or otherwise collecting the code looking for a match in a code list. The former is faster but the latter easier to structure in memory (no tree). You proceed to process each symbol to define a bit length or repetition code.

Now you have 2 tables of bit lengths and you know how to generate the Huffman codes for the associated alphabets. What follows next in our bit stream are the actual Huffman codes for the compressed data block. Each Huffman code will either define a literal value (byte) or a length code. The byte you just push into your uncompressed stream and the sliding window. In the case of a length code you would retrieve the additional bits defining the length of a matched sequence. You would then use the distance Huffman table for the next code that together with extra bits defines a distance back into the sliding window. Push the referenced string into your uncompressed stream and the sliding window. This is repeated until you encounter the end-of-block code (256). If BFINAL was set for the block you can save your now uncompressed data and you are done. Otherwise another block will follow.

Now we follow this logic backwards to figure out for our own compression effort what we need to do to generate the proper DEFLATE format for our data.

Okay, feel free to post questions, comments, corrections or whatever. I would be curious to know if I have helped anyone. I have written these posts as a form of review and preparation for myself. I am now ready to generate some C code in JANOS to make perhaps broad use of DEFLATE.

  • This will allow the existing JAR/ZIP JANOS command to create or modify compressed file collections. That will be helpful in applications that generate massive amounts of log data.
  • It will let JANOS create PNG graphics using drawing and plotting commands. This will allow us to easily display the data acquired by monitoring applications.
  • The WebServer can utilize DEFLATE to more efficiently transfer content. It can really help here as we already serve files (the DCP for example) directly out of a ZIP library. Whereas we presently decompress those files and send uncompressed content, the WebServer could forward the DEFLATE formatted data directly providing a bandwidth benefit.

The JANOS WebServer uniquely can locate and serve content directly from ZIP file collections. Generally files in a ZIP collection are compressed in DEFLATE format already. The WebServer can detect a browser’s ability to accept content in DEFLATE format directly and transfer the compressed content directly. Why spend the time to decompress before transfer?

Since this does not involve a DEFLATE compressor it was quick to implement. Starting with JANOS v1.6.4 (now in Beta) the WebServer will utilize DEFLATE content encoding when transferring files already compressed in that format provided that the browser can accept it. It works nicely.

I’ve been busy extending the JAR/ZIP command in JANOS to allow you to create, update and freshen an archive file. The first step was to just store files uncompressed. Getting all of that command line logic straight is enough of a headache. Once that was behind me I was ready to implement DEFLATE.

One difference in the JANOS implementation from the approach taken earlier in this thread is that I am going to work with buffers as opposed to a byte stream. Since the JNIOR uses a 64MB heap and generally consumes only a few MB of it I can load an entire file’s content in a memory buffer. Yeah, files on the JNIOR aren’t very large. This eliminates the queue approach to the sliding window. That helps with matching as it eliminates any need to test pointers for wrap around.

Where I used a bidirectional linked list before tracking occurrences of each byte value in the sliding window, I have gravitated to a series of queues tracking the last 256 matching bytes (or fewer if that be the case) in the 32KB preceding window. There is also no need now to keep any array of active matches since we are not streaming. So a pass through the appropriate queue for the current byte generally delivers a usable match and limits the search times in interest of speed. I get results within 2% or so of what the PC generates for the same file. This is certainly acceptable for JANOS.

I use the code length generator algorithm that I devised previously so as to not have to generate any Huffman tree physically. This for certain test files tended to hit the DEFLATE maximum code length limits. So I will describe a variation that seems to avoid that problem.

I will go through the implementation details here over the next day or so and maybe share some of the code. Remember that I am writing in C so…

DEFLATE Compressor Implementation (C Language)

The goal is to compress using DEFLATE a byte buffer filled with the entire contents of a file or other data. I want a simple function call like this:

FZIP_deflate(fbuff, filesize, &cbuff, &csize)

This function needs to return two things, a buffer containing the DEFLATE formatted bit stream and the compressed length. There are other ways to return these parameters but for JANOS which has to remain thread-safe (cannot use static variables) this works. This routine returns TRUE if we can successfully compress the data. It might return FALSE if in compressing the results we decide it isn’t going to be worth it. This function is used as follows in the JAR/ZIP command in JANOS.

				// optionally compress
				compressed = FALSE;
				compsize = filesize;
				if (filesize >= 128)
				{
					char *cbuff = NULL;
					uint32_t csize;
 
					// compress the content
					if (FZIP_deflate(fbuff, filesize, &cbuff, &csize))
					{
						compressed = TRUE;
						compsize = csize;
						MM_free(fbuff);
						fbuff = cbuff;
					}
					else
						MM_free(cbuff);
				}

Here we see that successful compression replaces the uncompressed buffer and modifies the compressed data size. It sets the compressedflag so the file can be properly saved in the JAR or ZIP archive. Like magic we have a compressed file!

Preface

Why am I doing this? I mean there is code out there and people have been able to construct archives and compress files literally for multiple decades. Why reinvent the wheel?

Well there are multiple reasons. First is a design goal for JANOS. This operating system uses no third-party written code. That sounds crazy but what it means is that there is no bug or performance that we cannot correct, change, improve or alter. And, this can be done quickly and in a timely fashion. Every bit of code is understood, clearly written and documented. It is written for a single target and not littered with conditional compilation blocks which obfuscate everything. If you support an operating system you might see how you could be envious of this.

Another reason is educational. Now that maybe is selfish but if I am going to be able to fully debug something I need to fully understand it. We cannot tolerate making what seems like a simple bug correction which later turns out to break some other part of the system. The only way to guarantee that this risk is minimized is for me to know everything that is going on and exactly what is going on. Yeah there is a JVM in here. Yeah it does TLS v1.2 encryption. It’s been fun.

The real problem though is that it is difficult to find good and complete technical information on the net. Yes there is RFC 1951 defining DEFLATE. It does not tell how to do it just tells you what you need to do. And, some aspects of it are not clear until you encounter it (or recreate it) in action. It describes LZ77 but you don’t realize that this is very difficult to implement and not have it take 5 minutes to compress 100 KB.

There are numerous web pages discussing DEFLATE and some by reputable universities. These usually include a good discussion on Huffman coding. Yet I have found none the creates a Huffman table that actually meets the additional requirements for DEFLATE. If you are going to describe Huffman in connection with DEFLATE, shouldn’t it be compatible? Would it have helped if you actually had implemented DEFLATE before describing it?

The procedure to create a compatible Huffman tree is not described anywhere that I have found, Most don’t even mention that you need 3 separate Huffman tables (one for literals, one for distance codes and one for code lengths) and that there is a limit of a 15 bit code length for 2 of the tables and 7 bit for the third. Then they say only to “adjust the Huffman table” accordingly. So there is no procedure for generating a less than optimum Huffman tree meeting DEFLATE restrictions. I had to get creative myself.

Enough of my rant. The result really is that I have had to reinvent the wheel. I am not the only one to have done so. I am going to try to document it here for your edification.

Overview

First let me greatly over-simplify the process and provide an outline for the compression procedure.

  1. Perform efficient LZ77 scanning the uncompressed data byte-by-byte filling a 64 KB interim buffer with raw unmatched literal bytes and escaped length-distance references.
  2. Scan the interim LZ77 data creating two DEFLATE compatible Huffman tables, one for literals and length codes combined and one for distance codes.
  3. Assign code lengths (15 bits max) to the used alphabet for both tables.
  4. Determine the size of the alphabet required for the length and distance Huffman tables.
  5. Combine code lengths into a single run-length compressed array.
  6. Determine the DELFATE compatible Huffman table needed to code this compressed code length array.
  7. Assign code lengths (7 bits max) to the used alphabet for code lengths.
  8. Sort the resulting code lengths for this 3rd Huffman table into the unique order specified for DEFLATE and determine the length of the array (trim trailing zeroes),
  9. Output the block header marking only the last as BFINAL and output the alphabet sizes.
  10. Output the reordered code lengths.
  11. Output the Huffman codes compressing the run-length encoded combined code length array. Insert extra bits where required.
  12. Output the Huffman codes compressing the LZ77 data. Use the literal table for literals and sequence lengths. Use the distance table for distance codes. Insert extra bits where required.
  13. Output end-of-block code.
  14. If not the final block keep the bit stream going and continue LZ77 at step #1.

Uh. That about summarizes it. All we can do is to push through this step by step. It amounts the two phases. The first compresses the data using LZ77 and determines the Huffman coding requirements. The second outputs a bit stream encoding the Huffman tables and then the actual data. We first determine everything that we need for the bit stream and then generate the bit stream itself.

Time-Efficient LZ77 Compression

First off let me note that we end up trading off compression ratio for processing speed. As we had experimented earlier in Java we could discover every possible matching sequence for a block of data and then analyze those matches selecting the optimum set. This arguably would create the best compression possible. This is certainly doable if we are running on a multiple core GHz processor and it is coded carefully. Still it would be lengthy and possibly not appropriate even then for some applications. The gain in compression ratio is expected to be only slight and not worth the processing cycles. It is certainly not critical for JANOS. So we will not go down this path.

Another approach is what is called lazy matching. Here we are concerned that a matched sequence might prevent a longer matching sequence starting in the next byte or two from being used. In the analysis it appears that 1 or 2 bytes may be unnecessarily forced into the output stream in these situations. Those may be rare but for certain types of data it could be more of a concern. Again the gain in compression ratio if we were to take the time to perform lazy matching is assumed not to be necessary for JNIOR.

As a result we are just going to go ahead and perform straight up sequence match detection for each byte position in the data. Even with this the amount of processing involved prevents us from using any kind of brute force scanning. Imagine how much processing would be involved if for each byte in the file we have to compare it directly against 32 KB of bytes in the preceding sliding window. For a large file this is starts to become a very large number of processor cycles. It gets even worse when bytes match and you need to check following bytes to determine the usability of the sequence.

I had implemented the brute force search with no trickery at first. Small files produced proper LZ77 output in a short time but a 20 KB JSON file took almost 30 seconds (JNIOR remember). A large binary file basically stalled the system. The JSON performed more poorly due to the high occurrence of certain characters such as curly braces, quotes and colons. That code has long been discarded or I would include it here for better understanding.

In the prior Java experimentation I used a linked list to track the occurrences of matching byte values. This eliminated the need to scan the entire 32 KB sliding window saving time. For the JANOS implementation I decided to use a pointer queue for each byte value. Each queue holding a maximum of 256 pointers. So basically we would test only the last 256 matching byte positions for usable sequences. This might miss some good sequence matches deep in the sliding window for very frequently occurring byte values but not for those appearing less often. Again it’s a trade off. It is an interesting approach.

I had figured that I could adjust the depth of these pointer queues. Since there are 256 possible literal values and each with 256 pointers which require 4 bytes this results in a matrix of 65,536 pointers or 256 KB of memory. There is room for that in the JANOS heap. Increasing the depth increases the memory requirement as well as the time spent in sequence detection. I was pleased with the results at a depth of 256. Perhaps later I will conduct some experiments plotting the effects of this parameter on compression ration and execution time.

I will present the resulting routine in the next post.

Here is the resulting routine. This handles only the LZ77 leaving all of the Huffman to the routine responsible for flushing the interim buffer. Note that I have made both the SLIDING_WINDOW and DEPTH parameters adjustable through the Registry. I will use this later to measure performance.

/* -- function ---------------------------------------------------------------
** FZIP_deflate()
**
** Compresses the supplied buffer.
**
** -------------------------------------------------------------------------*/
int FZIP_deflate(char *inb, uint32_t insize, char **outbuf, uint32_t *outsize)
{
	char *obuf;
	int optr;
	int err = FALSE;
	int curptr, len, seqlen;
	char *seqptr, *p1, *p2, *s1, *s2;
	struct bitstream_t stream;
	int *matrix, *mat, *track;
	int ch, trk;
	int window, depth;
 
	// obtain sliding window size
	window = REG_getRegistryInteger("Zip/Window", 16384);
	if (window < 2048) window = 2048; else if (window > 32768)
		window = 32768;
 
	// obtain tracking queue depth
	depth = REG_getRegistryInteger("Zip/Depth", 256);
	if (depth < 16) depth = 16; else if (depth > 1024)
		depth = 1024;
 
	// check call
	if (outbuf == NULL || outsize == NULL)
		return (FALSE);
 
	// initialize bit stream
	memset(&stream, 0, sizeof(struct bitstream_t));
	stream.buffer = MM_alloc(insize + 1024, &_bufflush);
 
	// create an output buffer
	obuf = MM_alloc(64 * 1024, &FZIP_deflate);
	optr = 0;
 
	// initialize matrix
	matrix = MM_alloc(256 * depth * sizeof(int), &FZIP_deflate);
	track = MM_alloc(256 * sizeof(int), &FZIP_deflate);
 
	// process uncompressed stream byte-by-byte
	curptr = 0;
	while (curptr < insize)
	{
		// get current byte value
		ch = inb[curptr];
 
		// Locate best match. This is the longest match located the closest to the curPtr. This
		//  is intended to be fast at the slight cost in compression ratio. We do not handle lazy
		//  matches or block optimization (selective matching). Only seqlen of 3 or more matter so
		//  we initialize seqlen to 2 to limit unnecessary best match updates. Try to limit cycles in
		//  this loop.
		mat = &matrix[depth * ch];
		trk = track[ch] - 1;
		if (trk < 0) trk = depth - 1; seqlen = 2; p2 = &inb[curptr]; while (trk != track[ch]) { if (mat[trk] == 0 || curptr - mat[trk] >= window)
				break;
 
			s1 = p1 = &inb[mat[trk] - 1];
			s2 = p2;
			while (*s1 == *s2)
				s1++, s2++;
 
			// check for improved match
			len = s1 - p1;
			if (len > seqlen)
			{
				seqptr = p1;
				seqlen = len;
			}
 
			trk--;
			if (trk < 0) trk = depth - 1; } // track the character mat[track[ch]] = curptr + 1; track[ch]++; if (track[ch] >= depth)
			track[ch] = 0;
 
		// check validity (match past end of buffer)
		if (curptr + seqlen > insize)
			seqlen = insize - curptr;
 
		// If we have a good sequence we output a pointer and advance curPtr
		if (seqlen >= 3)
		{
			// check maximum allowable sequence which is 258 bytes but we reserve one
			//  for 0xff escaping
			if (seqlen > 257)
				seqlen = 257;
 
			// escape length-distance pointer
			obuf[optr++] = 0xff;
			obuf[optr++] = seqlen - 3;
			*(short *)&obuf[optr] = p2 - seqptr;
			optr += 2;
 
			// advance curPtr
			curptr += seqlen;
			PROC_yield();
		}
 
		// otherwise we output the raw uncompressed byte and keep searching
		else
		{
			// escape 0xff
			if (*p2 == 0xff)
				obuf[optr++] = 0xff;
			obuf[optr++] = *p2;
			curptr++;
		}
 
		// flush output buffer as needed. Because we escape 0xff our length-pointer encoding
		//  requires 4 bytes. This at times replaces a 3-byte match and so the compression
		//  ration in this buffer is compromised. That will be corrected as we move into
		//  Huffman coding. Blocks will then be slightly less than 64KB. We also want to
		//  flush this before over-running it.
		if (optr > 65530)
		{
			if (!_bufflush(obuf, optr, &stream, FALSE))
				err = TRUE;
 
			// if compression seems fruitless
			if (stream.length > curptr)
				err = TRUE;
 
			optr = 0;
		}
 
		if (err)
			break;
	}
 
	// Flush remaining data
	if (!err && !_bufflush(obuf, optr, &stream, TRUE))
		err = TRUE;
 
	// if compression was fruitless
	if (stream.length > insize)
		err = TRUE;
 
	// clean up
	MM_free(track);
	MM_free(matrix);
 
	// return
	*outbuf = stream.buffer;
	*outsize = stream.length;
	return (!err);
}

So the preliminaries are over by line 50 in this code. You might notice that when JANOS allocates memory it retains a reference pointer. That is used to locate memory leaks among other things. I can see quickly where any block is allocated.

The main loop which processes byte-by-byte through the uncompressed data is done by line 150. After that we merely flush any partially compressed block and be done.

From lines 55 to 95 we search for the best match. From 95 to about 130 we either output the sequence length-distance reference or the raw unmatched data byte. After that we check to see if we need to flush the interim buffer.

The magic occurs in the search where we employ the pointer matrix. The trk array keeps a list of current positions in the each byte value queue. This is where we add the pointer to the current byte once we’ve searched it. The search runs backward through the queue either until we’ve check all of the pointers (wraps back to the starting position) or we run out of the sliding window (pointer reference too far back in the data). This checks for closer matches first. When a longer sequence of characters matches the current position we update our best match.

After that if we have a sequence of 3 or more bytes we output the length-distance reference otherwise we output the current byte and move on to process the next.

I know that I can optimize this some more. I haven’t gone through that step of trying to minimize processor cycles. We do use the RX compiler optimization. But… this is my approach for LZ77. There is no hash table or confusing prefix-suffix tricks. It seems to run fast enough at least for our needs at this point. The LZ77 search is where all of the time is consumed.

Next we will look at what happens when the interim buffer is flushed.

The FZIP_deflate() routine involves 90+ percent of the processing time and handles only step #1 in the prior outline. The buffer flush routine handles all of the remaining items.

We will be creating three different Huffman tables. These amount to a structure for each alphabet symbol.

struct huff_t {
	uint16_t code;
	uint16_t len;
	uint16_t cnt;
	uint16_t link;
};

Here the cnt will represent the symbol’s frequency. It is the count of occurrences of the symbol’s value in the data. The len will eventually be the code length in bits assigned to the symbol. The code will hold the bit pattern to be used in coding. And the link will be used in the code length determination routine.

The _bufflush() routine is called with a buffer of LZ77 compressed data. This includes raw data bytes that could not be included in a sequence and length-distance references for matches. Since data can take on all byte values I use a form of escaping. The escape character is 0xFF and a 0xFF in the data is represented by repeating the escape (e.g. 0xff 0xff). Otherwise the next byte is the length of the sequence -3 and the following short value the distance. To make this work I disallow the sequence length of 258 since that would be confused with an escaped 0xFF. I don’t see this as having any significant impact. Later I could find a way around that but for now it works.

// Routine applies DEFLATE style Huffman coding to the buffer content.
static int _bufflush(char *obuf, int osize, struct bitstream_t *stream, int final)
{
	struct huff_t *littbl;
	struct huff_t *dsttbl;
	struct huff_t *cnttbl;
	int n, len, c, code, dst, lastc, ncnt;
	int *cntlist, *litcnts, *dstcnts, *bitcnts, *bitlens;
	int hlit, hdist, hclen;
	int *startcd;
	int totdist = 0;
	int err = FALSE;
 
	// Now we need to construct two Huffman trees (although I am going to avoid
	//  actual trees). One for the literal data and one for the distance codes.
	//  Note that extra bits are just extra bits inserted in the stream.
	littbl = MM_alloc(286 * sizeof(struct huff_t), &_bufflush);
	dsttbl = MM_alloc(30 * sizeof(struct huff_t), &_bufflush);
 
	// Not a loop. This allows the use of break.
	for (;;)
	{
 
		// Now we analyze the data to determine frequencies. Note that this is complicated
		//  just a bit because of the escaping that I have had to use. I will have to
		//  temporarily decode length and distance encoding. We'll have to do that again
		//  later when we stream the coding. We will also use one end-of-block code so we
		//  virtually count it first.
		littbl[256].cnt = 1;
		for (n = 0; n < osize; n++)
		{
			// tally literal if not escaped
			if (obuf[n] != 0xff)
				littbl[obuf[n]].cnt++;
			else
			{
				// check and tally escaped 0xff
				if (obuf[++n] == 0xff)
					littbl[0xff].cnt++;
				else
				{
					totdist++;
 
					// table defined above
					//static const int lcode_ofs[29] = {
					//	3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 15, 17, 19, 23, 27, 31,
					//	35, 43, 51, 59, 67, 83, 99, 115, 131, 163, 195, 227, 258
					//};
 
					// determine required length code for lengths (3..258). This code is
					//  coded in the literal table.
					len = (obuf[n++] & 0xff) + 3;
					for (c = 0; c < 29; c++) if (lcode_ofs[c] > len)
							break;
					code = 256 + c;
					littbl[code].cnt++;
 
					// table define above
					//static const int dcode_ofs[30] = {
					//	1, 2, 3, 4, 5, 7, 9, 13, 17, 25, 33, 49, 65, 97, 129, 193,
					//	257, 385, 513, 769, 1025, 1537, 2049, 3073, 4097, 6145,
					//	8193, 12289, 16385, 24577
					//};
 
					// determine required distance code for distances (1..32768). This code is
					//  coded in the distance table.
					dst = (obuf[n++] & 0xff) << 8;
					dst |= (obuf[n] & 0xff);
					for (c = 0; c < 30; c++) if (dcode_ofs[c] > dst)
							break;
					code = c - 1;
					dsttbl[code].cnt++;
				}
			}
		}

So here we create two of the Huffman tables, one for the literal alphabet (0..285) and one for the distance alphabet (0..29). Note that when JANOS allocates memory it is zero filled.

Don’t be confused by my use of for(;;) { }. This is not an infinite loop. In fact it is not a loop at all. Rather it allows me to exit the procedure at any point using break; and I just have to remember to place a break; at the very end. There are other ways to achieve the same thing.

The first step is to determine the frequency of symbols from both alphabets. Here we scan the supplied data and count the literals. The escaped length-distance references are translated temporarily into their length codes and distance codes. Those are tallied in the appropriate table. The extra bits are ignored. Length codes are combined with literal data since when they are read you don’t know which it will be. The distance codes use their own alphabet. We will have to do this same translation again later when we encode the references for output. Then, of course, we will insert the required extra bits.

Note that this also tallies one end-of-block code (0x100) as we will be using that.

If I were to dump these two tables after this they may look something like this. These are just the symbols that occur in a particular set of data – the jniorsys.log file. This is the symbol value followed by its count.

 0x00a 2
 0x00d 2
 0x020 70
 0x027 4
 0x028 11
 0x029 2
 0x02a 2
 0x02b 7
 0x02c 7
 0x02d 20
 0x02e 133
 0x02f 16
 0x030 105
 0x031 153
 0x032 124
 0x033 124
 0x034 103
 0x035 110
 0x036 97
 0x037 90
 0x038 93
 0x039 84
 0x03a 74
 0x03c 1
 0x03d 3
 0x03e 3
 0x041 5
 0x043 4
 0x044 1
 0x046 1
 0x048 1
 0x04a 3
 0x04c 2
 0x04d 4
 0x04e 4
 0x04f 4
 0x050 8
 0x052 8
 0x053 4
 0x054 4
 0x055 1
 0x057 3
 0x05a 1
 0x05f 3
 0x061 34
 0x062 4
 0x063 15
 0x064 26
 0x065 37
 0x066 9
 0x067 9
 0x068 8
 0x069 36
 0x06a 6
 0x06b 3
 0x06c 21
 0x06d 13
 0x06e 36
 0x06f 41
 0x070 24
 0x071 1
 0x072 29
 0x073 20
 0x074 29
 0x075 11
 0x076 6
 0x077 7
 0x078 3
 0x079 6
 0x07a 4
 0x100 1
 0x101 962
 0x102 457
 0x103 136
 0x104 127
 0x105 55
 0x106 71
 0x107 47
 0x108 14
 0x109 67
 0x10a 208
 0x10b 103
 0x10c 40
 0x10d 52
 0x10e 44
 0x10f 15
 0x110 42
 0x111 180
 0x112 287
 0x113 30
 0x114 5
 0x115 12
 0x116 6

 0x002 1
 0x003 2
 0x008 2
 0x009 2
 0x00a 13
 0x00b 150
 0x00c 166
 0x00d 40
 0x00e 127
 0x00f 63
 0x010 157
 0x011 112
 0x012 155
 0x013 116
 0x014 198
 0x015 169
 0x016 240
 0x017 197
 0x018 288
 0x019 215
 0x01a 321
 0x01b 226

We need to assign code lengths to these alphabet symbols. Here there are two tables and later we will process a third. So the procedure is handled by a separate rom.buffer;
*outsize = stream.length;
return (!err);
}

The _bitlength() routine assigns code lengths creating possibly a slightly less than optimal Huffman table that does not exceed a 15-bit maximum. The assignments must make the Huffman tables compatible with the DEFLATE requirements.

Huffman Tables for DEFLATE

The buffer full of data that we have and which has been compressed using LZ77 will be compressed further using Huffman coding. The DEFLATE format specifies two separate Huffman code sets. One to encode both the literal bytes in the data (0..255), the end-of-block code (256), and the sequence match length codes (257..285). The second Huffman code set will encode the distance codes (0..29). We can use a separate table for that because we know when we are reading a distance code as one always follows a length code. We never know whether we are reading a literal or a length code so those need to be decoded the same way and therefore from the same table.

Previously we scanned the data and counted the occurrences of each symbol. We now know which symbols occur in the data and how frequently. We have defined our alphabets for each Huffman table. Now we need to create the Huffman trees themselves. This is where things get tricky.

Creating a Huffman tree is not very difficult. But creating a Huffman tree that is compatible with the DEFLATE format is quite another thing altogether. The DEFLATE specification dictates that the Huffman trees must meet two additional rules. In fact they need to adhere to three rules. The third is mentioned later in the specification.

  1. All codes of a given bit length have lexicographically consecutive values, in the same order as the symbols they represent.
  2. Shorter codes lexicographically precede longer codes.
  3. The bit length cannot exceed 15 bits for the literal and distance code sets. It cannot exceed 7 bits for the code length set (comes into play much later).

The first trick is to not generate a tree at all. If you create a tree using the standard Huffman approach you are almost guaranteed to not have a tree that is usable for DEFLATE. All you need from that effort are the bit lengths that end up being assigned to each symbol. You can get those from the same procedure without dealing with right and left links and an actual tree structure. You then use the procedure defined in the DEFLATE specification to create the compatible tree.

The standard Huffman approach is to take a list of all the symbols that occur in the data and sort it in descending frequency. Combine the rightmost two least frequent symbols into a node whose frequency is the total of the two symbols. Next resort the list so this new node repositions itself according to its combined frequency. Now repeat the process combining the next two rightmost entries which may be symbols (leaves) or previously create nodes. This continues until you have just one node which is the head of your tree.

We are not going to bother to build the tree structure. We are only going to keep a list of the symbols that fall beneath a node. We are also going to realize that the combination of the two rightmost entries in the sorted list merely increases by one the bit length of each of the new node’s member symbols. When we finally reach the point where there is only one node in the list we would have assigned a bit length to every symbol based on its frequency. That is all we need to then create the codes for Huffman coding in DEFLATE format using the procedure for that outlined in the specification.

Here is where things really get confusing. This process doesn’t always create a DEFLATE compatible Huffman code set. Sometimes the bit length will exceed 15 (or 7 for the table later). We need a procedure for dealing with that. It amounts to being able to create a less than optimal Huffman tree with bit lengths limited to a maximum. This was a puzzle but I have a way to get it done.

So next I’ll take us through examples.

Let’s use an example. Here we have an alphabet of 19 symbols (0..18). The data set consists of 150 of these and after counting the occurrences of each we have the following. To make discussion simpler I will assign these symbols uppercase names. On the right are the results of the tally for each.

 
A  0x000 4
B  0x001 0
C  0x002 0
D  0x003 2
E  0x004 6
F  0x005 3
G  0x006 2
H  0x007 2
J  0x008 53
K  0x009 26
L  0x00a 5
M  0x00b 4
N  0x00c 3
P  0x00d 1
Q  0x00e 1
R  0x00f 1
S  0x010 37
T  0x011 0
U  0x012 0

In the standard approach to creating a Huffman table we ignore the symbols that do not appear in the data and arrange the other in order of decreasing frequency.

Used symbols:
  A    D    E    F    G    H    J    K    L    M    N    P    Q    R    S 
  4    2    6    3    2    2   53   26    5    4    3    1    1    1   37

Sorted by decreasing frequency:
  J    S    K    E    L    A    M    F    N    D    G    H    P    Q    R
 53   37   26    6    5    4    4    3    3    2    2    2    1    1    1

Next we combine the lowest two frequency symbols into a new node with a combined total. I will name the nodes with lowercase characters just so you can track them. The shorter list is then resorted before proceeding to repeat. The process continues until this is only one node. Here I will show only the combining action for each step. We will get into more detail afterwards.

Starting set:
  J    S    K    E    L    A    M    F    N    D    G    H    P    Q    R
 53   37   26    6    5    4    4    3    3    2    2    2    1    1    1

Step #1 combines Q and R into (a) with new frequency of 2. The list is resorted.
  J    S    K    E    L    A    M    F    N    D    G    H   (a)   P
 53   37   26    6    5    4    4    3    3    2    2    2    2    1

Step #2 combines (a) and P into (b) with new frequency of 3.
  J    S    K    E    L    A    M    F    N   (b)   D    G    H
 53   37   26    6    5    4    4    3    3    3    2    2    2

Step #3 combines G and H into (c) with new frequency of 4.
  J    S    K    E    L    A    M   (c)   F    N   (b)   D
 53   37   26    6    5    4    4    4    3    3    3    2

Step #4 combines (b) and D into (d) with new frequency of 5.
  J    S    K    E    L   (d)   A    M   (c)   F    N
 53   37   26    6    5    5    4    4    4    3    3

Step #5 combines F and N into (e) with new frequency of 6.
  J    S    K    E   (e)   L   (d)   A    M   (c)
 53   37   26    6    6    5    5    4    4    4

Step #6 combines M and (c) into (f) with new frequency of 8.
  J    S    K   (f)   E   (e)   L   (d)   A
 53   37   26    8    6    6    5    5    4

Step #7 combines (d) and A into (g) with new frequency of 9.
  J    S    K   (g)  (f)   E   (e)   L
 53   37   26    9    8    6    6    5

Step #8 combines (e) and L into (h) with new frequency of 11.
  J    S    K   (h)  (g)  (f)   E
 53   37   26   11    9    8    6

Step #9 combines (f) and E into (i) with new frequency of 14.
  J    S    K   (i)  (h)  (g)
 53   37   26   14   11    9

Step #10 combines (h) and (g) into (j) with new frequency of 20.
  J    S    K   (j)  (i)
 53   37   26   20   14

Step #11 combines (j) and (i) into (k) with new frequency of 34.
  J    S   (k)   K
 53   37   34   26

Step #12 combines (k) and K into (m) with new frequency of 60.
 (m)   J    S
 60   53   37

Step #13 combines J and S into (n) with new frequency of 90.
 (n)  (m)
 90   60

Step #14 combines (n) and (m) into our final node (p) with new frequency of 150.
 (p)
150

We are done.

Did you notice how nodes that we created were often quickly reused in another combination? This is what leads to a tree structure exceeding the maximum code length. Imagine a tree with one long branch down its right edge. It is very common when there are a few symbols that appear with high frequency and the balance are relatively low frequency symbols.

This combining nodes exercise was all well and good but something else needs to occur to make it useful. In the typical Huffman case in creating a node you would assign one combining nodes to the left link (bit = 0) and the other to the right link (bit = 1). This would then develop the tree structure that would work for you although most likely not to be DEFLATE compatible.

I am going to avoid the tree but note that in the act of combination all of the symbols participating will have their code length increased by 1. Combining leaves into a node creates another level in the tree. That’s what we are doing. In the code I will use a linked list to simply collect the symbols that are a member of (or lie below) any given node. In combining I will concatenate the member lists for the two leaves/nodes being combined and then increment the code length for each member.

For each step I going to show the node membership and the code length for our alphabet as we proceed through the process. Hopefully this table will make sense to you.

             Step   1    2    3    4    5    6    7    8    9    10   11   12   13   14  clen
A  0x000 4     0    0    0    0    0    0    0    1g   1g   1g   2j   3k   4m   4m   5p   5
D  0x003 2     0    0    0    0    1d   1d   1d   2g   2g   2g   3j   4k   5m   5m   6p   6
E  0x004 6     0    0    0    0    0    0    0    0    0    1i   1i   2k   3m   3m   4p   4
F  0x005 3     0    0    0    0    0    1e   1e   1e   2h   2h   3j   4k   5m   5m   6p   6
G  0x006 2     0    0    0    1c   1c   1c   2f   2f   2f   3i   3i   4k   5m   5m   6p   6
H  0x007 2     0    0    0    1c   1c   1c   2f   2f   2f   3i   3i   4k   5m   5m   6p   6
J  0x008 53    0    0    0    0    0    0    0    0    0    0    0    0    0    1n   2p   2
K  0x009 26    0    0    0    0    0    0    0    0    0    0    0    0    1m   1m   2p   2
L  0x00a 5     0    0    0    0    0    0    0    0    1h   1h   2j   3k   4m   4m   5p   5
M  0x00b 4     0    0    0    0    0    0    1f   1f   1f   2i   2i   3k   4m   4m   5p   5
N  0x00c 3     0    0    0    0    0    1e   1e   1e   2h   2h   3j   4k   5m   5m   6p   6
P  0x00d 1     0    0    1b   1b   2d   2d   2d   3g   3g   3g   4j   5k   6m   6m   7p   7
Q  0x00e 1     0    1a   2b   2b   3d   3d   3d   4g   4g   4g   5j   6k   7m   7m   8p   8
R  0x00f 1     0    1a   2b   2b   3d   3d   3d   4g   4g   4g   5j   6k   7m   7m   8p   8
S  0x010 37    0    0    0    0    0    0    0    0    0    0    0    0    0    1n   2p   2

In this table we follow the code length (clen) associated with each occurring symbol through the step by step combination process. Here as a symbol is combined into a new node (letter changes) we increment its bit depth or code length. The final column shows the resulting clen for this table.

So we mechanically have shown how to derive the code lengths for an alphabet symbol set with given frequencies. Next we create the DEFLATE Huffman code table for this.

My code to perform the node creation and code length incrementing looks something like this.

// Establish bit length for the Huffman table based upon frequencies
static int _bitlength(struct huff_t *tbl, int ncodes, int maxb)
{
	uint16_t *list = MM_alloc(ncodes * sizeof(uint16_t), &_bitlength);
	int *freq = MM_alloc(ncodes * sizeof(int), &_bitlength);
	int nlist = 0;
	int n, c, p;
	int ret = TRUE;
 
	// List all of the symbols used in the data along with their frequencies. Note that
	//  we store pointers +1 so as to keep 0 as a linked list terminator.
	for (n = 0; n < ncodes; n++) if (tbl[n].cnt > 0)
		{
			list[nlist] = n + 1;
			freq[nlist] = tbl[n].cnt;
			nlist++;
		}
 
	// Note that there is a special boundary case when only 1 code is used. In this case
	//  the single code is encoded using 1 bit and not 0.
	if (nlist == 1)
		tbl[list[0] - 1].len = 1;
 
	// process this list down to a single node
	while (nlist > 1)
	{
		// sort the list by decreasing frequency
		for (n = 0; n < nlist - 1; n++)
			if (freq[n] < freq[n + 1]) { // swap order c = list[n]; list[n] = list[n + 1]; list[n + 1] = c; c = freq[n]; freq[n] = freq[n + 1]; freq[n + 1] = c; // need to sort back if (n > 0)
					n -= 2;
			}
 
		// Combine the member lists associated with the last two entries. We combine the
		//  linked lists for the two low frequency nodes.
		p = list[nlist - 2];
		while (tbl[p - 1].link)
			p = tbl[p - 1].link;
		tbl[p - 1].link = list[nlist - 1];
 
		// The new node has the combined frequency.
		freq[nlist - 2] += freq[nlist - 1];
		nlist--;
 
		// Increase the code length for members of this node.
		p = list[nlist - 1];
		while (p)
		{
			tbl[p - 1].len++;
			p = tbl[p - 1].link;
		}
 
	}
 
	MM_free(freq);
	MM_free(list);
	return (ret);
}

You might note the check at line 20 handling a special case when there is only one used item in our alphabet. Here we need to use a code length of 1.

Now to be fair there is a lot more that I will be adding to this routine before we are done.

Now that we know the lengths of the code that we will be using to compress the data we can predict the compression ratio.

        freq  clen     freq*clen
A  0x000 4     5           20
D  0x003 2     6           12
E  0x004 6     4           24
F  0x005 3     6           18
G  0x006 2     6           12
H  0x007 2     6           12
J  0x008 53    2          106
K  0x009 26    2           52
L  0x00a 5     5           25
M  0x00b 4     5           20
N  0x00c 3     6           18
P  0x00d 1     7            7
Q  0x00e 1     8            8
R  0x00f 1     8            8
S  0x010 37    2           74
                      -----------
                          416 bits (52 bytes)

If we multiply the frequency of a symbol times the code length and total that for the set we get total number of bits required to encode the original message. Originally we had 150 bytes or 1200 bits. When we are done we can store that same message in only 52 bytes. We’ve reduced the data to almost one third it’s original size.

Let’s see how to derive the bit codes that we will use in encoding the data.

We want to derive the actual binary code patterns for encoding this symbol set. The DEFLATE specification tells us to first count the number of codes for each code length.

N  bl_count[N]
0      0
1      0
2      3
3      0
4      1
5      3
6      5
7      1
8      2

Next we find the numerical value of the smallest code for each code length. The following code is provided by the specification.

        code = 0;
        bl_count[0] = 0;
        for (bits = 1; bits <= MAX_BITS; bits++) {
            code = (code + bl_count[bits-1]) << 1;
            next_code[bits] = code;
        }

In performing this procedure we get the following. Here I will also show the codes in binary form.

N  bl_count[N]  next_code[N]
0      0
1      0            0
2      3            0     00
3      0            6     110
4      1           12     1100
5      3           26     11010
6      5           58     111010
7      1          126     1111110
8      2          254     11111110 

Now we assign codes to each symbol based upon its length. The DEFLATE specification provides this code snippet. Basically the above defines the starting code which we increment after each use.

        for (n = 0; n <= max_code; n++) {
            len = tree[n].Len;
            if (len != 0) {
                tree[n].Code = next_code[len];
                next_code[len]++;
            }
        }

And this ends up giving us the following codes for encoding this data.

        freq  clen     code
A  0x000 4     5      11010
D  0x003 2     6      111010
E  0x004 6     4      1100
F  0x005 3     6      111011
G  0x006 2     6      111100
H  0x007 2     6      111101
J  0x008 53    2      00
K  0x009 26    2      01
L  0x00a 5     5      11011
M  0x00b 4     5      11100
N  0x00c 3     6      111110
P  0x00d 1     7      1111110
Q  0x00e 1     8      11111110
R  0x00f 1     8      11111111
S  0x010 37    2      10

Take a few moments to picture what is going on. Basically with just 2 bits you can encode no more than 4 symbols. Since we have more than 4 symbols to encode we cannot use all 4 combinations of two bits. We reserve 1 or more bit combinations as a prefix indicting that an additional bit or more will be needed to identify other symbols. The decompressor will be processing the bit stream 1 bit at a time as it doesn’t know in advance how many bits will be needed to identify the next symbol.

In this symbol set it turns out that we use all but the last combination of two bits and 11b becomes the prefix accessing the rest of the symbol set. Note how this coincides with the 3 high frequency codes. It then turns out to be most efficient not to use any 3 bit codes and to jump right to 4 bits for the next possible symbol encoding. In fact if for any bit length if you reserve only the last combination as a prefix then the next bit length has only two combinations (a node). For 3 bit codes here that would be 110b and 111b. If we save all for prefix then we can encode more symbols. Here for the 4 bit codes there are 4 combinations: 1100b, 1101b, 1110b, and 1111b. Again this tree decided to use only one of the 4 bit combinations for a symbol.

Another thing to notice is that the final code for the largest bit length corresponds to the rightmost leaf in the tree. For DEFLATE that lexicographically is the longest code and requires a series of 1 bits to reach. So for this 8 bit code the last symbol is identified by 11111111b. This last code should always be 2**N – 1 where N is the largest bit length (code length). Note that ‘**’ indicates exponentiation here. Two is raised to the power of N.

If you think about it, the set of code lengths have to be just right to end up properly assigning codes to end up this way or to not overflow those available for any one bit length. This is assured by the procedure. If you try some random code lengths you will quickly see what I mean. In general you will be trying to create an impossible tree.

But wait!!

This table looks suspiciously like the code length encoding (third Huffman table that we have not discussed as yet). If it is, didn’t you mention that it would be limited to a bit depth or code length of 7? This one is 8 bits. Is that okay?

No. Is is not okay. You are right. I purposely chose this real-world example which actually does violate that code length rule. So this is not a DEFLATE compatible table. At least not for that third Huffman coding. This table occurs in trying to compress the /etc/JanosClasses.jarfile currently on my development JNIOR. And actually before this the literal table exceeds the 15 bit limit. The resulting compression fails.

Most of what you read now tells you to “adjust the Huffman tree” accordingly and prods on without a hint as to how you might do that. You can certainly detach and reattach nodes to get it done but how do you know that you haven’t significantly affected the compression ratio? You could punt and use the fixed Huffman tables afforded by another DEFLATE block type BTYPE 01. You know that you can’t just fiddle with the code lengths because you will end up trying to create an impossible tree. So what now?

Well, I can show you how to get it done.

Adjust the Huffman Tree Accordingly

We have created the optimum Huffman table for coding our data. Unfortunately we find out that it is not compatible for use with DEFLATE. The DEFLATE specification dictates that this particular table have a code length maximum of 7 bits. That being forced by the fact that these code lengths are stored in the DEFLATE stream using 3 bits each. That limits code lengths for this table to the set containing 0 thru 7.

Our table is too deep. It requires 8 bits to encode our data. What do we do about that? It seems that anything we do will reduce the efficiency of the compression. We need to create a less than optimum Huffman coding. How do we do that and keep the impact at a minimum? How do we not seriously damage the compression ratio?

We are going to adjust our Huffman table so that it does not exceed the maximum bit depth. Ideally we want to stop incrementing the code length of an symbols that reach the maximum. How can we properly do that? Well, we have to go back to the math. Let’s understand what makes a certain set of code lengths valid while others are not?

In the prior post we assigned code lengths to symbols in our alphabet based upon a tree construction algorithm. The steps in this algorithm are those that create a valid tree. It is not surprising then that our set of code lengths represent a real tree. As a result when we calculate the starting codes for each bit length the process ends up with usable codes. And as we noticed the last code in the tree is in fact 2**N – 1. In our case this is 2**8 -1 or 255 and in binary that being 11111111b.

Let me expand the loop in the starting code generation so we can see what happens. Here we will generate starting codes (S0..S8) for our code lengths. We will use a shorthand for the code length counts (N0..N8). In our example those are N = {0 0 3 0 1 3 5 1 2}. Note too that a left shift of 1 is equivalent to multiplication by 2. The starting codes (S) are calculated as follows:

    S1 = (S0 + N0) << 1 = (0 + 0) << 1 = 2 * 0 = 0
    S2 = (S1 + N1) << 1 = (0 + 0) << 1 = 2 * 0 = 0         00
    S3 = (S2 + N2) << 1 = (0 + 3) << 1 = 2 * 3 = 6         110
    S4 = (S3 + N3) << 1 = (6 + 0) << 1 = 2 * 6 = 12        1100
    S5 = (S4 + N4) << 1 = (12 + 1) << 1 = 2 * 13 = 26      11010
    S6 = (S5 + N5) << 1 = (26 + 3) << 1 = 2 * 29 = 58      111010
    S7 = (S6 + N6) << 1 = (58 + 5) << 1 = 2 * 63 = 126     1111110
    S8 = (S7 + N7) << 1 = (126 + 1) << 1 = 2 * 127 = 254   11111110

There are two things to notice here other than the fact that hist matches the table generated earlier. First, the count of 8 bit code lengths (N8) doesn’t come into play. Yet we know that there are 2 and the first will be assigned 11111110b and the second 11111111b. This being the 2**8 -1 that we now expect. The second thing is that all of the starting codes are even numbers. That being driven by the fact that they are the product of multiplication by 2. We will use this fact later.

Now I can reverse this procedure to generate starting codes back from the maximum bit length knowing that the last code must be 2**N – 1. So for this table we get the following:

    S8 = 2**8 - N8 = 2**8 - 2 = 254
    S7 = S8/2 - N7 = 254/2 - 1 = 127 - 1 = 126
    S6 = S7/2 - N6 = 126/2 - 5 = 63 - 5 = 58
    S5 = S6/2 - N5 = 58/2 - 3 = 29 - 3 = 26
    S4 = S5/2 - N4 = 26/2 - 1 = 13 - 1 = 12
    S3 = S4/2 - N3 = 12/2 - 0 = 6
    S2 = S3/2 - N2 = 6/2 - 3 = 3 - 3 = 0
    S1 = 0
    S0 = 0

This is just a matter of running the calculations backwards and knowing (or realizing) that the last code has to be 2**N – 1. You can see it generates the same results.

Now what if we decide to not accept a bit depth exceeding the maximum? So we are going to force those two symbols wanting to be 8 bit codes to be two additional 7 bit codes. So our code length array will look like this: N = {0 0 3 0 1 3 5 3}. Here we combined S7 and S8 and eliminated 8 bit altogether. Legal? Of course not. You can’t visualize what that does to a tree. let’s try the calculations back from the new maximum of 2**7 – 1.

    S7 = 2**7 - N7 = 128 - 3 = 125

Here this fails immediately. We know that starting codes (S) must be even numbers and 125 is odd! Not surprising as we are kind of floating a pair of leaves up in the air somehow. Can we make a home for them?

Clearly if we were to reattach those leaves somewhere else in the tree structure other nodes must be increased in bit depth. We need another 7 bit symbol to get S7 to be an even 124. To do that with minimum impact on compression ratio we increase the 6 bit coded symbol with the lowest frequency to 7 bits. Our array now being N = {0 0 3 0 1 3 4 4}. Try again:

    S7 = 2**7 - N7 = 128 - 4 = 124
    S6 = 124/2 - N6 = 62 - 4 = 58
    S5 = S6/2 - N5 = 58/2 - 3 = 29 - 3 = 26
    S4 = S5/2 - N4 = 26/2 - 1 = 13 - 1 = 12
    S3 = S4/2 - N3 = 12/2 - 0 = 6
    S2 = S3/2 - N2 = 6/2 - 3 = 3 - 3 = 0
    S1 = 0
    S0 = 0

Um. Everything seemed to fit right in. Is this tree valid? Let’s see.

N  bl_count[N]  next_code[N]
0      0
1      0            0
2      3            0     00
3      0            6     110
4      1           12     1100
5      3           26     11010
6      4           58     111010
7      4          124     1111100
        freq  clen     code
A  0x000 4     5      11010
D  0x003 2     6      111010
E  0x004 6     4      1100
F  0x005 3     6      111011
G  0x006 2     6      111100
H  0x007 2     6      111101
J  0x008 53    2      00
K  0x009 26    2      01
L  0x00a 5     5      11011
M  0x00b 4     5      11100
N  0x00c 3     7      1111100
P  0x00d 1     7      1111101
Q  0x00e 1     7      1111110
R  0x00f 1     7      1111111
S  0x010 37    2      10

This appears to have successfully generated a Huffman tree that would work with DEFLATE format! Let’s look at the compression ratio for this.

        freq  clen     freq*clen
A  0x000 4     5           20
D  0x003 2     6           12
E  0x004 6     4           24
F  0x005 3     6           18
G  0x006 2     6           12
H  0x007 2     6           12
J  0x008 53    2          106
K  0x009 26    2           52
L  0x00a 5     5           25
M  0x00b 4     5           20
N  0x00c 3     7           21
P  0x00d 1     7            7
Q  0x00e 1     7            7
R  0x00f 1     7            7
S  0x010 37    2           74
                      -----------
                          417 bits (53 bytes)

Wait! This only cost us 1 bit? Yes it did but to store it we would need another whole byte. So the impact of this procedure is likely (though not proven) to have a minimum impact on compression ratio. Yet, it corrects the table to insure that it is compatible with DEFLATE.

To generalize the process, the reversed starting code calculation is repeated from 2**N – 1 when N is the maximum bit depth back to 0 for bit length 0. If at any point the calculated starting code is not even, you must set the bit depth for the next least frequent symbol to include it at this code length and make the starting code even.

In my next post I will show code for this.

The complete procedure for generating a DEFLATE format compatible Huffman table limited to a maximum bit depth is shown here. I know that this is not optimized code. There is some unnecessary execution but I had wanted to keep steps separate and clear. You can be sure that over time I will optimize the coding and obfuscate it suitably for all future generations.

This routine has the capacity to return FALSE if a table cannot be created. It was doing just that when the bit depth (code length) exceeded maximum. That has since been corrected. It will always return TRUE now.

// Establish bit length for the Huffman table based upon frequencies
static int _bitlength(struct huff_t *tbl, int ncodes, int maxb)
{
	uint16_t *list = MM_alloc(ncodes * sizeof(uint16_t), &_bitlength);
	int *freq = MM_alloc(ncodes * sizeof(int), &_bitlength);
	int nlist = 0;
	int n, c, p;
	int ret = TRUE;
	uint16_t *ptr;
 
	// List all of the symbols used in the data along with their frequencies. Note that
	//  we store pointers +1 so as to keep 0 as a linked list terminator.
	for (n = 0; n < ncodes; n++) if (tbl[n].cnt > 0)
		{
			list[nlist] = n + 1;
			freq[nlist] = tbl[n].cnt;
			nlist++;
		}
 
	// Note that there is a special boundary case when only 1 code is used. In this case
	//  the single code is encoded using 1 bit and not 0.
	if (nlist == 1)
		tbl[list[0] - 1].len = 1;
 
	// process this list down to a single node
	while (nlist > 1)
	{
		// sort the list by decreasing frequency
		for (n = 0; n < nlist - 1; n++)
			if (freq[n] < freq[n + 1]) { // swap order c = list[n]; list[n] = list[n + 1]; list[n + 1] = c; c = freq[n]; freq[n] = freq[n + 1]; freq[n + 1] = c; // need to sort back if (n > 0)
					n -= 2;
			}
 
		// Combine the member lists associated with the last two entries. We combine the
		//  linked lists for the two low frequency nodes.
		p = list[nlist - 2];
		while (tbl[p - 1].link)
			p = tbl[p - 1].link;
		tbl[p - 1].link = list[nlist - 1];
 
		// The new node has the combined frequency.
		freq[nlist - 2] += freq[nlist - 1];
		nlist--;
 
		// Sort the members of this node by decreasing code length. Longer codes to the
		//  left. This will also sort the frequency of the symbols in increasing order
		//  when code lengths are equal. We need this arrangement for the next step should
		//  we be required to balance the tree and avoid exceeding the maximum code
		//  length (maxb).
		p = TRUE;
		while (p)
		{
			p = FALSE;
 
			ptr = &list[nlist - 1];
			while (*ptr && tbl[*ptr - 1].link)
			{
				c = tbl[*ptr - 1].link;
				if ((tbl[*ptr - 1].len < tbl[c - 1].len) || (tbl[*ptr - 1].len == tbl[c - 1].len && tbl[*ptr - 1].cnt > tbl[c - 1].cnt))
				{
					n = tbl[c - 1].link;
					tbl[*ptr - 1].link = n;
					tbl[c - 1].link = *ptr;
					*ptr = c;
					p = TRUE;
				}
				ptr = &tbl[*ptr - 1].link;
			}
		}
 
		// Increase the code length for members of this node. We cannot exceed the maximum
		//  code length (maxb).
		p = list[nlist - 1];
		while (p)
		{
			if (tbl[p - 1].len < maxb)
				tbl[p - 1].len++;
 
			p = tbl[p - 1].link;
		}
 
		// Now verify the structure. This should be absolutely proper up until the point when
		//  we prevent code lengths from exceeding the maximum. Once we do that we are likely
		//  creating an impossible tree. We will need to correct that.
		p = list[nlist - 1];
		c = tbl[p - 1].len;
		if (c == maxb)
		{
 
			n = 1 << c;
			while (p)
			{
				if (tbl[p - 1].len == c)
					n--;
				else
				{
					// n must be even at this point or we extend the length group
					if (n & 1)
					{
						tbl[p - 1].len = c;
						n--;
					}
					else
					{
						c--;
						n /= 2;
					}
				}
 
				p = tbl[p - 1].link;
			}
		}
	}
 
	MM_free(freq);
	MM_free(list);
	return (ret);
}

Here are the steps that it performs:

  1. Create an array for all used alphabet symbols (those with non-zero frequency).
  2. Check for the special case where there is only one symbol. In that case we use a 1 bit code where one is unused.
  3. Sort the list array by decreasing frequency. The least frequent symbols are then at the end of the list.
  4. Combine the rightmost two least frequent symbols or nodes into a single new node having the combined frequency. All members of the two combined nodes become members of the new node.
  5. Sort the member list for the new node by decreasing bit depth (current code length) and secondly by increasing frequency.
  6. Increase the bit depth for all members of the new node by 1. If a symbol will exceed the maximum bit depth do not increment it.
  7. If we have reached the maximum bit depth then confirm the tree structure using the reverse starting code length calculations. Elevate the next least frequent symbol to the current bit depth if the calculated starting code is not an even number. Check all code lengths.
  8. If there is more than one entry in the list array (not down to one node yet) then repeat at Step #3.

So to review….

First, we covered how to perform a reasonably fast version of LZ77. This filling a 64 KB buffer with compressed data which contains literals and length-distance codes.

When we need to we will flush the 64 KB buffer and generate a block of DEFLATE format. At that point we call a routine that first analyzes the data for code frequencies. There are to be two alphabets. One for literals and length codes. The other for distance codes.

Now we’ve developed a method for generating DEFLATE compatible Huffman tables for our two alphabets. The bulk of the complexity is now behind us but there is still some work to do before we can start generate our compressed bit stream.

Moving on…

Now we need the size of our literal alphabet (HLIT) and our distance alphabet (HDIST) since it is unlikely the we have utilized all literal symbols (0..285) or distance codes (0..29). So here we scan each to trim unused symbols.

		// Now we combine the bit length arrays into a single array for run-length-like repeat encoding.
		//  In DEFLATE this encoding can overlap for the literal table to the distance code table as
		//  if a single array. First determine the alphabet size for each table.
		for (hlit = 286; 0 < hlit; hlit--)
			if (littbl[hlit - 1].len > 0)
				break;

		for (hdist = 30; 0 < hdist; hdist--)
			if (dsttbl[hdist - 1].len > 0)
				break;

So we have the two alphabets. One for literals and length codes (0..HLIT) and one for distance codes (0..HDIST). We’ve seen that all we need are the code lengths for each symbol set to define the associated Huffman tree. We need to convey this information to the decompressor. The DEFLATE format combines the code length arrays for the two alphabets now into one long array of HLIT + HDIST code lengths.

		// Now create the bit length array
		cntlist = MM_alloc((hlit + hdist) * sizeof(int), &_bufflush);
		for (n = 0; n < hlit; n++)
			cntlist[n] = littbl[n].len;
		for (n = 0; n < hdist; n++)
			cntlist[hlit + n] = dsttbl[n].len;

Now we know that the end-of-block code (256) is used as well as at least a handful of length codes and distance codes. So this array of code lengths itself is fairly long. It will be somewhere between 260 and 315 entries. Each code length is is in the set 0..15. So typically there is a lot of repetition. There can be a large number of 0’s in a row. Consider data that is constrained to 7-bit ASCII. In that case there are 128 literal codes that never occur and would have a code length of 0.

The DEFLATE specification defines a kind of run-length encoding for this array. This encodes sequences of code lengths using three additional codes.

         The Huffman codes for the two alphabets appear in the block
         immediately after the header bits and before the actual
         compressed data, first the literal/length code and then the
         distance code.  Each code is defined by a sequence of code
         lengths, as discussed in Paragraph 3.2.2, above.  For even
         greater compactness, the code length sequences themselves are
         compressed using a Huffman code.  The alphabet for code lengths
         is as follows:

               0 - 15: Represent code lengths of 0 - 15
                   16: Copy the previous code length 3 - 6 times.
                       The next 2 bits indicate repeat length
                             (0 = 3, ... , 3 = 6)
                          Example:  Codes 8, 16 (+2 bits 11),
                                    16 (+2 bits 10) will expand to
                                    12 code lengths of 8 (1 + 6 + 5)
                   17: Repeat a code length of 0 for 3 - 10 times.
                       (3 bits of length)
                   18: Repeat a code length of 0 for 11 - 138 times
                       (7 bits of length)

So as you can see we are heading towards our third Huffman table. This one with an alphabet of 19 codes (0..18). That’s why the symbol set I used for the example in handling maximum code length has 19 members. I used one of these tables as an example.

In my next post I will show the procedure I use for applying the repeat codes in this alphabet.

The current version of jniorsys.log on my development JNIOR compresses into a single block using the following literal and distance tables.

littbl
 0x00a 2 11 11111101110
 0x00d 2 12 111111111000
 0x020 93 6 101000
 0x027 4 11 11111101111
 0x028 14 9 111101010
 0x029 3 11 11111110000
 0x02a 2 12 111111111001
 0x02b 9 10 1111101000
 0x02c 9 10 1111101001
 0x02d 23 8 11101100
 0x02e 158 5 01100
 0x02f 18 9 111101011
 0x030 120 6 101001
 0x031 182 5 01101
 0x032 147 5 01110
 0x033 150 5 01111
 0x034 127 6 101010
 0x035 132 6 101011
 0x036 113 6 101100
 0x037 110 6 101101
 0x038 113 6 101110
 0x039 106 6 101111
 0x03a 91 6 110000
 0x03c 2 12 111111111010
 0x03d 3 11 11111110001
 0x03e 3 11 11111110010
 0x041 6 10 1111101010
 0x043 5 11 11111110011
 0x044 1 13 1111111111100
 0x046 2 12 111111111011
 0x048 1 13 1111111111101
 0x04a 3 11 11111110100
 0x04c 3 11 11111110101
 0x04d 5 11 11111110110
 0x04e 4 11 11111110111
 0x04f 5 10 1111101011
 0x050 11 9 111101100
 0x052 10 9 111101101
 0x053 6 10 1111101100
 0x054 5 10 1111101101
 0x055 2 12 111111111100
 0x057 3 11 11111111000
 0x05a 1 13 1111111111110
 0x05f 5 10 1111101110
 0x061 44 7 1101100
 0x062 7 10 1111101111
 0x063 19 9 111101110
 0x064 30 8 11101101
 0x065 52 7 1101101
 0x066 14 9 111101111
 0x067 13 9 111110000
 0x068 9 10 1111110000
 0x069 45 7 1101110
 0x06a 6 10 1111110001
 0x06b 3 11 11111111001
 0x06c 29 8 11101110
 0x06d 21 8 11101111
 0x06e 46 7 1101111
 0x06f 51 7 1110000
 0x070 31 8 11110000
 0x071 2 12 111111111101
 0x072 37 8 11110001
 0x073 27 8 11110010
 0x074 40 7 1110001
 0x075 16 9 111110001
 0x076 6 10 1111110010
 0x077 8 10 1111110011
 0x078 4 11 11111111010
 0x079 9 10 1111110100
 0x07a 4 11 11111111011
 0x100 1 13 1111111111111
 0x101 1248 2 00
 0x102 580 4 0100
 0x103 168 5 10000
 0x104 160 5 10001
 0x105 72 7 1110010
 0x106 119 6 110001
 0x107 82 6 110010
 0x108 17 9 111110010
 0x109 116 6 110011
 0x10a 253 5 10010
 0x10b 120 6 110100
 0x10c 52 7 1110011
 0x10d 69 7 1110100
 0x10e 79 6 110101
 0x10f 20 8 11110011
 0x110 54 7 1110101
 0x111 236 5 10011
 0x112 371 4 0101
 0x113 34 8 11110100
 0x114 6 10 1111110101
 0x115 17 9 111110011
 0x116 6 10 1111110110

dsttbl
 0x002 1 10 1111111100
 0x003 2 10 1111111101
 0x008 2 10 1111111110
 0x009 2 10 1111111111
 0x00a 14 8 11111110
 0x00b 204 4 0100
 0x00c 225 4 0101
 0x00d 46 7 1111110
 0x00e 168 5 11100
 0x00f 85 6 111110
 0x010 202 4 0110
 0x011 130 5 11101
 0x012 193 4 0111
 0x013 146 5 11110
 0x014 259 4 1000
 0x015 209 4 1001
 0x016 301 4 1010
 0x017 255 4 1011
 0x018 377 3 000
 0x019 293 4 1100
 0x01a 433 3 001
 0x01b 332 4 1101

We determine HLIT and HDIST for this and combine the code lengths into one HLIT + HDIST length array.

HLIT: 279
HDIST: 28
 0 0 0 0 0 0 0 0 0 0 11 0 0 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 11 9 11 12 10 10 8 5 9 6 5 5 5 6 6 6 6 6 6 6 0 12 11 11 0 0 10 0 11 13 0 12 0 13 0 11 0 11 11 11 10 9 0 9 10 10 12 0 11 0 0 13 0 0 0 0 10 0 7 10 9 8 7 9 9 10 7 10 11 8 8 7 7 8 12 8 8 7 9 10 10 11 10 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 13 2 4 5 5 7 6 6 9 6 5 6 7 7 6 8 7 5 4 8 10 9 10 0 0 10 10 0 0 0 0 10 10 8 4 4 7 5 6 4 5 4 5 4 4 4 4 3 4 3 4

So this array has 307 entries and you can see that there is a lot of repetition. We next apply the repeat codes as appropriate to shorten this array to 139 entries.

HLIT: 279
HDIST: 28
 0 0 0 0 0 0 0 0 0 0 11 0 0 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 11 9 11 12 10 10 8 5 9 6 5 5 5 6 6 6 6 6 6 6 0 12 11 11 0 0 10 0 11 13 0 12 0 13 0 11 0 11 11 11 10 9 0 9 10 10 12 0 11 0 0 13 0 0 0 0 10 0 7 10 9 8 7 9 9 10 7 10 11 8 8 7 7 8 12 8 8 7 9 10 10 11 10 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 13 2 4 5 5 7 6 6 9 6 5 6 7 7 6 8 7 5 4 8 10 9 10 0 0 10 10 0 0 0 0 10 10 8 4 4 7 5 6 4 5 4 5 4 4 4 4 3 4 3 4
NCNT: 139
 17/7 11 0 0 12 18/7 6 17/3 11 9 11 12 10 10 8 5 9 6 5 5 5 6 16/3 0 12 11 11 0 0 10 0 11 13 0 12 0 13 0 11 0 11 11 11 10 9 0 9 10 10 12 0 11 0 0 13 17/1 10 0 7 10 9 8 7 9 9 10 7 10 11 8 8 7 7 8 12 8 8 7 9 10 10 11 10 11 18/122 13 2 4 5 5 7 6 6 9 6 5 6 7 7 6 8 7 5 4 8 10 9 10 0 0 10 10 17/1 10 10 8 4 4 7 5 6 4 5 4 5 4 16/0 3 4 3 4

Here when one of the repeat codes are used I show the value of the extra bits we will insert following the Huffman code for the symbol.

Now we create the Huffman table for this array and determine frequencies. Note that extra bits are inserted later and are not part of the Huffman encoding. Here the maximum code length is 7 bits.

		// Ugh. Now we need yet another Huffman table for this run-length alphabet. First we establish
		//  the frequencies.  Note that we skip the byte defining the extra bits.
		cnttbl = MM_alloc(HUFF_HLEN * sizeof(struct huff_t), &_bufflush);
		for (n = 0; n < ncnt; n++)
		{
			cnttbl[cntlist[n]].cnt++;
			if (cntlist[n] >= 16)
				n++;
		}

		// We need to determine the bit lengths. 
		if (!_bitlength(cnttbl, HUFF_HLEN, 7))
		{
			err = TRUE;
			break;
		}

This results in the following third Huffman table that we will use to code the code lengths which in turn are used to generate the literal and distance tables for the data compression.

cnttbl
 0x000 17 3 000
 0x002 1 6 111100
 0x003 2 6 111101
 0x004 9 4 1000
 0x005 11 3 001
 0x006 9 4 1001
 0x007 11 4 1010
 0x008 10 4 1011
 0x009 10 4 1100
 0x00a 19 3 010
 0x00b 14 3 011
 0x00c 6 4 1101
 0x00d 4 5 11100
 0x010 2 6 111110
 0x011 4 5 11101
 0x012 2 6 111111

Are we there yet?

Well, we are getting close.

As before we will need to convey the code lengths in this ‘cnttbl’ to the decompressor. We have N = {3 0 6 6 4 3 4 4 4 4 3 3 4 5 0 0 6 5 6}. I mentioned earlier that these are stored each with 3 bits in the bit stream limiting the code length to 7 bits.

But DEFLATE wants to save every possible byte. Each of these code lengths has a probability of being used over some general set of data types. They decided to sequence these into an array i a custom order such that the least likely to be used code lengths fall at the end and can be trimmed. So we get the order from the specification and sequence these. Notice that our three 0 bit code length do in fact get trimmed.

		// Finally (haha) we establish a custom order for these bit lengths
		// array defined above
		//static const char hclen_order[HUFF_HLEN] = {
		//	16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2,
		//	14, 1, 15
		//};
		bitlens = MM_alloc(HUFF_HLEN * sizeof(int), &_bufflush);
		for (n = 0; n < HUFF_HLEN; n++)
			bitlens[n] = cnttbl[hclen_order[n]].len;

		// Now the the end of this array should be 0's so we find a length for the array
		for (hclen = HUFF_HLEN; 0 < hclen; hclen--)
			if (bitlens[hclen - 1] > 0)
				break;
 6 5 6 3 4 4 4 4 3 3 3 4 4 6 5 6 0 0 0
HCLEN: 16

It is hard to believe but we now have everything that we need to actually generate the compressed bit stream. Well, all except a couple of routines to actually do the serialization. So that will be the next step.

We’re going to output to a bit stream. Each write will likely involve a different number of bits. Someplace we have to properly sequence these into bytes to append to the output buffer.

To do this we need one routine to write called _writeb() and one to use at the end to flush any remaining unwritten bits called _flushb().

// Routine to stream bits
static void _writeb(int num, int code, struct bitstream_t *stream)
{
	// if no bits then nothing to do
	if (num == 0)
		return;

	// insert bits into the stream
	code &= ((1 << num) - 1);
	code <<= stream->nbits;
	stream->reg |= code;
	stream->nbits += num;

	// stream completed bytes
	while (stream->nbits >= 8)
	{
		stream->buffer[stream->length++] = (stream->reg & 0xff);
		stream->reg >>= 8;
		stream->nbits -= 8;
	}
}


// Routine flushes the bitstream
static void _flushb(struct bitstream_t *stream)
{
	if (stream->nbits)
	{
		stream->buffer[stream->length++] = (stream->reg & 0xff);
		stream->reg = 0;
		stream->nbits = 0;
	}
}

So to tie this together we use a structure.

// structure to assist with bit streaming
struct bitstream_t {
	char *buffer;
	int length;
	uint32_t reg;
	int nbits;
};

This seems simple enough but there are a couple things to understand. The DEFLATE specification gets into it right off the bat.

The first bit of the bit stream is the least significant bit of the first byte in the buffer. Once 8 bits are retrieved the 9th is the least significant bit of the second byte and so on.

A value appears in the stream starting with it’s least significant bit. That means that the bit order does not need to be reversed to pack the value at the tail of the current bit stream.

Huffman codes are processed a bit at a time. When you are reading a Huffman code you do not know how many bits you will need to retrieve the code for a valid symbol in the alphabet. So in this case you must insert the Huffman code so the most-significant bit is read first. The order of Huffman bits needs to be reversed. Armed with that fact, I have stored these codes in the tables in reverse order. That will be apparent in the following code to generate the tables for coding.

So the last thing we need to do is generate the actual Huffman codes for the three tables.

		// Now we need Huffman codes for these tables because eventually someday we will actually be
		//  generating a compressed bit stream.

		// Next we total the number of symbols using each bit length. These will be used to assign
		//  bit codes for each alphabet symbol.
		litcnts = MM_alloc(16 * sizeof(int), &_bufflush);
		for (n = 0; n < 286; n++)
			if (littbl[n].len > 0)
				litcnts[littbl[n].len]++;

		dstcnts = MM_alloc(16 * sizeof(int), &_bufflush);
		for (n = 0; n < 30; n++)
			if (dsttbl[n].len > 0)
				dstcnts[dsttbl[n].len]++;

		bitcnts = MM_alloc(16 * sizeof(int), &_bufflush);
		for (n = 0; n < HUFF_HLEN; n++)
			if (cnttbl[n].len)
				bitcnts[cnttbl[n].len]++;

		// Now we calculate starting codes for each bit length group. This procedure is defined in the
		//  DEFLATE specification. We can define the Huffman tables in a compressed format provided that
		//  the Huffman tables follow a couple of additional rules. Using these starting codes we
		//  can assing codes for each alphabet symbol. Note that Huffman codes are processed bit-by-bit
		//  and therefore must be generated here in reverse bit order.

		// Define codes for the literal Huffman table
		startcd = MM_alloc(16 * sizeof(int), &_bufflush);
		for (n = 0; n < 15; n++)
			startcd[n + 1] = (startcd[n] + litcnts[n]) << 1;
		for (n = 0; n < 286; n++)
		{
			len = littbl[n].len;
			if (len)
			{
				c = startcd[len]++;
				while (len--)
				{
					littbl[n].code <<= 1;
					littbl[n].code |= (c & 1);
					c >>= 1;
				}
			}
		}

		// Define codes for the distance Huffman table
		for (n = 0; n < 15; n++)
			startcd[n + 1] = (startcd[n] + dstcnts[n]) << 1;
		for (n = 0; n < 30; n++)
		{
			len = dsttbl[n].len;
			if (len)
			{
				c = startcd[len]++;
				while (len--)
				{
					dsttbl[n].code <<= 1;
					dsttbl[n].code |= (c & 1);
					c >>= 1;
				}
			}
		}

		// Define codes for the bit length Huffman table
		for (n = 0; n < 15; n++)
			startcd[n + 1] = (startcd[n] + bitcnts[n]) << 1;
		for (n = 0; n < HUFF_HLEN; n++)
		{
			len = cnttbl[n].len;
			if (len)
			{
				c = startcd[len]++;
				while (len--)
				{
					cnttbl[n].code <<= 1;
					cnttbl[n].code |= (c & 1);
					c >>= 1;
				}
			}
		}

Coming up next: Actually Generating the Bit Stream

The DEFLATE specification greatly oversimplifies the whole process by defining the block format in about a single page.

         We can now define the format of the block:

               5 Bits: HLIT, # of Literal/Length codes - 257 (257 - 286)
               5 Bits: HDIST, # of Distance codes - 1        (1 - 32)
               4 Bits: HCLEN, # of Code Length codes - 4     (4 - 19)

               (HCLEN + 4) x 3 bits: code lengths for the code length
                  alphabet given just above, in the order: 16, 17, 18,
                  0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15

                  These code lengths are interpreted as 3-bit integers
                  (0-7); as above, a code length of 0 means the
                  corresponding symbol (literal/length or distance code
                  length) is not used.

               HLIT + 257 code lengths for the literal/length alphabet,
                  encoded using the code length Huffman code

               HDIST + 1 code lengths for the distance alphabet,
                  encoded using the code length Huffman code

               The actual compressed data of the block,
                  encoded using the literal/length and distance Huffman
                  codes

               The literal/length symbol 256 (end of data),
                  encoded using the literal/length Huffman code

         The code length repeat codes can cross from HLIT + 257 to the
         HDIST + 1 code lengths.  In other words, all code lengths form
         a single sequence of HLIT + HDIST + 258 values.

There are a couple of reasons why the suggestion to teach JANOS how to compress files has been in Redmine for 5 years. That’s about how long ago when in development JANOS began to directly use JAR files for application programming. The obvious reason that it took 5 years to implement is that there really isn’t a huge need for compression in JNIOR. Storage was limited in the previous series and if you had wanted to keep log files around for any serious length of time it would have helped if we could compress them. The Series 4 has much more storage but still not a lot by today’s standards.

The real reason you may now realize. It is a bit involved. Given that JANOS uses no third party developed code and that I had not been able to find usable reference materials (search engines having been damaged by marketing greed).There were some hurdles that left this suggestion sit on the list at a low priority for practically ever. Well, we’ve got it done now.

Let’s generate the bit stream and be done with this tome.

First we stream the Block Header. The very first bit indicates whether or not the block is the last for the compression. For the last block this BFINAL bit will be a 1. It is 0 otherwise. We also are using the dynamic Huffman table type. The next two bits indicate the block type BTYPE. Here we use 10b as we will be providing our own Huffman tables. Yeah, we could have taken the easy path and used a fixed set of tables (BTYPE 01b) but that is no fun.

		// Now we have codes and everything that we need (Finally!) to generate the compressed bit
		//  stream.

		// set BFINAL and type BTYPE of 10
		_writeb(1, final ? 1 : 0, stream);
		_writeb(2, 0x02, stream);

We have already determined the sizes of our alphabets. We have HLIT defining the size of the literal alphabet that includes codes for sequence lengths. The complete alphabet space covers the range 0..285 but since we are not likely to use them all HLIT defines how many actually are used (0..HLIT). We also have HDIST playing a similar role for the distance alphabet which could range 0..29.

Those two parameters are supplied next. Here each values is conveyed in 5 bits. We can do that since we know that HDIST has to be at least 257 since we do have to use the end-of-block code of 256 in our alphabet. So we supply really the count to length codes used which is similar in magnitude to teh count of distance codes. So HLIT is supplied as HLIT – 257 and HDIST as HDIST – 1.

		// Now we output HLIT, HDIST, and HCLEN
		_writeb(5, hlit - 257, stream);
		_writeb(5, hdist - 1, stream);
		_writeb(4, hclen - 4, stream);

		// Output the HCLEN counts from the bit length array. Note that we have already ordered it
		//  as required.
		for (n = 0; n < hclen; n++)
			_writeb(3, bitlens[n], stream);

Following the delivery of HLIT and HDIST we supply HCLEN as HCLEN – 4. Recall that HCLEN defines the size of alphabet we are going to use to encode the array of code lengths for the main two Huffman tables. HCLEN defines the the number of elements we are going to supply for that 19 element array that has the custom order. The balance beyond HCLEN elements is assumed to be 0. This is the array whose elements are supplied with 3 bits and thus limited to the range 0..7. We include HCLEN elements simply as shown above.

Next we stream the concatenated array of code lengths. This array has HLIT + HDIST elements that have been compressed using repeat codes. Here we output the compressed array using the 19 symbol alphabet that we just defined for the decompressor by sending the array above. We encounter the extra bits usage for the first time. You can see in the following that after streaming the repeat codes 16, 17 or 18 we insert the field specifying the repeat count using the required number of extra bits.

		// Output the run-length compressed HLIT + HDIST code lengths using the code length
		//  Huffman codes. The two tables are blended together in the run-length encoding.
		//  Note that we need to insert extra bits where appropriate.
		for (n = 0; n < ncnt; n++)
		{
			c = cntlist[n];
			_writeb(cnttbl[c].len, cnttbl[c].code, stream);

			if (c >= 16)
			{
				switch (c)
				{
				case 16:
					_writeb(2, cntlist[++n], stream);
					break;
				case 17:
					_writeb(3, cntlist[++n], stream);
					break;
				case 18:
					_writeb(7, cntlist[++n], stream);
					break;
				}
			}
		}

If you are the decompressor at this point you have everything you need to develop the Huffman tables for literals and distance codes. You had to expand thee repeat compression and separate the two code length sets. You then tallied code length counts and calculated starting codes for each bit length. Finally you assigned codes to each used alphabet symbol. So we are ready to deal with the actual compressed data.

When we initially scanned the LZ77 compressed data to determine frequencies we had to temporarily expand the length-distance references to determine which codes from the length and distance alphabets we would be using. Well the following loop is practically identical because we again process the LZ77 compressed data expanding the length-distance references. This time we will output the codes to the bit stream. Here again we have the extra bits to insert. Where we ignored them before, now we insert them into the stream when necessary.

		// Unbelievable! We can actually now output the compressed data for the block! This is
		//  encoded using the literal and distance Huffman tables as required. Note we need to
		//  again process the escaping and length-distance codes.
		for (n = 0; n < osize; n++)
		{
			c = obuf[n];
			if (c != 0xff)
				_writeb(littbl[c].len, littbl[c].code, stream);
			else
			{
				// check and tally escaped 0xff
				if (obuf[++n] == 0xff)
					_writeb(littbl[0xff].len, littbl[0xff].code, stream);
				else
				{
					// table defined above
					//static const int lcode_ofs[29] = {
					//	3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 15, 17, 19, 23, 27, 31,
					//	35, 43, 51, 59, 67, 83, 99, 115, 131, 163, 195, 227, 258
					//};

					// determine required length code for lengths (3..258). This code is
					//  coded in the literal table.
					len = (obuf[n++] & 0xff) + 3;
					for (c = 0; c < 29; c++)
						if (lcode_ofs[c] > len)
							break;
					code = 256 + c;
					_writeb(littbl[code].len, littbl[code].code, stream);

					// insert extra bits as required by the code
					// table defined above
					//static const int lcode_bits[29] = {
					//	0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3,
					//	3, 4, 4, 4, 4, 5, 5, 5, 5, 0
					//};
					c = lcode_bits[code - 257];
					if (c)
						_writeb(c, len - lcode_ofs[code - 257], stream);

					// table define above
					//static const int dcode_ofs[30] = {
					//	1, 2, 3, 4, 5, 7, 9, 13, 17, 25, 33, 49, 65, 97, 129, 193,
					//	257, 385, 513, 769, 1025, 1537, 2049, 3073, 4097, 6145,
					//	8193, 12289, 16385, 24577
					//};

					// determine required distance code for distances (1..32768). This code is
					//  coded in the distance table.
					dst = (obuf[n++] & 0xff) << 8;
					dst |= (obuf[n] & 0xff);
					for (c = 0; c < 30; c++)
						if (dcode_ofs[c] > dst)
							break;
					code = c - 1;
					_writeb(dsttbl[code].len, dsttbl[code].code, stream);

					// insert extra bits as required by the code
					// table defined above
					//static const int dcode_bits[30] = {
					//	0, 0, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8,
					//	9, 9, 10, 10, 11, 11, 12, 12, 13, 13
					//};
					c = dcode_bits[code];
					if (c)
						_writeb(c, dst - dcode_ofs[code], stream);
				}
			}
		}

And finally we output that end-of-block code (256). If BFINAL were set and this were the last block in the compression (LZ77 flushing the last partial buffer of data) we would flush the serial buffer. This provides a final byte with the remaining bits for the stream.

			// And finally the end-of-block code and flush
			_writeb(littbl[256].len, littbl[256].code, stream);
			if (final)
				_flushb(stream);

At this point the buffer supplied by the LZ77 compression has been flushed to the bit stream. We would reset that buffer pointer and return. If there is more data the LZ77 compressor will then proceed with it. Remember that I had defined this buffer to be 64KB so there will likely be more in the case of larger files.

This DEFLATE compression capability will be part of JANOS v1.6.4 and later OS. I am likely to do some optimization before release. Here we see that it works. I’ll create an archive containing the files in my JNIOR’s root folder.

bruce_dev /> zip -c test.zip /
 6 files saved
bruce_dev /> 

bruce_dev /> zip test.zip
     Size   Packed          CRC32        Modified
    56096    10764   81%  0b779605  Jan 31 2018 08:42  jniorsys.log
    40302     6249   84%  b1dffc05  Jan 26 2018 07:38  web.log
    22434     9499   58%  059a09d9  Jan 25 2018 14:53  manifest.json
       89       89    0%  f97bbba2  Jan 28 2018 09:11  access.log
      990      460   54%  9699bbfe  Jan 30 2018 14:48  jniorboot.log.bak
      990      461   53%  8f3c0390  Jan 31 2018 08:42  jniorboot.log
 6 files listed
bruce_dev />

This archive verifies. This verification does decompress each file and confirm the CRC32.

bruce_dev /> zip -vl test.zip
  verifying: jniorsys.log (compressed)
  verifying: web.log (compressed)
  verifying: manifest.json (compressed)
  verifying: access.log
  verifying: jniorboot.log.bak (compressed)
  verifying: jniorboot.log (compressed)
 6 entries found - content verifies!
bruce_dev />

We can repeat the construction in verbose output mode. Here we see timing. Again, keep in mind that JANOS is running on a 100 MHz Renesas RX63N micro-controller.

bruce_dev /> zip -cl test.zip /
  deflate: /jniorsys.log (56096 bytes)
   saving: jniorsys.log (compressed 80.8%) 1.176 secs
  deflate: /web.log (40302 bytes)
   saving: web.log (compressed 84.5%) 0.671 secs
  deflate: /manifest.json (22434 bytes)
   saving: manifest.json (compressed 57.7%) 1.863 secs
   saving: access.log (stored) 0.011 secs
  deflate: /jniorboot.log.bak (990 bytes)
   saving: jniorboot.log.bak (compressed 53.5%) 0.054 secs
  deflate: /jniorboot.log (990 bytes)
   saving: jniorboot.log (compressed 53.4%) 0.054 secs
 6 files saved
bruce_dev />

The /etc/JanosClasses.jar file compresses and this file is where I first encountered Huffman tables whose bit depth (code length) exceeded the DEFLATE maximums (15 bit for literal and distance tables, 7 bit for code length encoding).

bruce_dev /> zip -cl test.zip /etc
  deflate: /etc/JanosClasses.jar (266601 bytes)
   saving: etc/JanosClasses.jar (compressed 11.2%) 24.215 secs
 1 files saved
bruce_dev /> 

bruce_dev /> zip test.zip
     Size   Packed          CRC32        Modified
   266601   236758   11%  20916587  Jan 11 2018 09:58  etc/JanosClasses.jar
 1 files listed
bruce_dev /> 

bruce_dev /> zip -vl test.zip
  verifying: etc/JanosClasses.jar (compressed)
 1 entries found - content verifies!
bruce_dev />

I know that 24 seconds for 1/4 megabyte file is nothing to write home about. Now that things are fully functional I can go back and work on the LZ77 where basically all of the time is consumed. I can certainly improve on this performance but as is is, it isn’t that bad. The JNIOR is a controller after all and you likely wouldn’t need to compress other archives.

I noticed that The JANOS runtime library for applications did not support a means of data encryption and decryption. It isn’t a problem to expose a cipher algorithm for use by applications. I have added the Security.rc4cipher() method for this purpose. I know that RC4 has been rumored to have been broken. For our purposes it remains plenty secure.

Here’s a test program. This requires JANOS v1.6.3-rc4 or later.

package jtest;
 
import com.integpg.system.Security;
 
public class Main {
    
    public static void main(String[] args) throws Exception {
        
        // source text and cipher key
        String text = "Best thing since sliced bread.";
        byte[] key = "Piece of cake".getBytes();
        
        // encrypt
        byte[] coded = Security.rc4cipher(text.getBytes(), text.length(), key);
        
        // encrypted content
        for (int n = 0; n < coded.length; n++) {
            System.out.printf(" %02x", coded[n] & 0xff);
            if (n % 16 == 15 || n == coded.length - 1)
                System.out.println("");
        }
        
        // decrypt
        byte[] result = Security.rc4cipher(coded, coded.length, key);
        
        // received message
        String msg = new String(result);
        System.out.println(msg);      
        
    }
}

This program outputs the following when run.

bruce_dev /> jtest
 ae 87 ae 84 bc 3e c2 b6 92 0f 25 c0 30 42 03 ef
 96 39 c5 cd b3 99 6f aa 36 ba c8 58 5b fd
Best thing since sliced bread.

bruce_dev />

To be honest I have not confirmed that the encoded string is in fact RC4. But JANOS uses the underlying cipher in many places and it has proven to be accurate there.

Remember PGP? I think that stood for (or stands for) Pretty Good Privacy. This basically was an simple approach to encrypting data for transfer through the email system. It used the RSA Private/Public Key technology. Well JANOS does RSA as part of my SSL/TLSv1.2 implementation. Why shouldn’t I expose that for use by applications. You may need to securely pass information.

Hypothetically the JNIOR could be monitoring doors and conveyors collecting numbers that might be directly related to sales or something that you might consider to be proprietary and quite sensitive. Each day you would like to forward the results to an email account. While the email transfer from the JNIOR is done over a secure connection the data is not stored at the other end in any encrypted format. Nor are you sure that the data is then transferred over any remaining connections securely.

The solution is to encrypt the data at the source and later decrypt. Well you can do that now with RC4 provided that you keep the key private. The same key is used to encrypt and then at the other location to decrypt. It is a risk.

Here the RSA key pair comes to the rescue. So I have exposed it in JANOS v1.6.3. Basically you can encrypt using a public key data which can only be decrypted by the corresponding private key some time later.

So you can use OpenSSL to generate an RSA key pair. Use a 1024-bit key as anything larger will tax the JNIOR a little too much. From that you can export the RSA Public Key in PEM format. It will look like this.

-----BEGIN PUBLIC KEY-----
MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQDEwEHsRkk592MEFyZXvvfsDkaF
u169uXwKugo2J7JMh8fkruiKe7B2tbuA143JSYeI0o4mpqWwd06CbjDG2gVEMgbf
5SK7quMdflJ5mW7t3ZPQZdMdryttPq3C4pzTfuH6/MGMzaNdobXSOQ7+SkH7goRd
sUYx6flLXn1KnQjPCQIDAQAB
-----END PUBLIC KEY-----

I will show you how you can use this PEM formatted Public Key to encrypt data for transfer. Later you can use the corresponding Private Key that you have kept secret and sequestered to access the data.

The following program uses new extensions to the com.integpg.system.Security class.

Here we are demonstrating encryption using our internal Public Key and then successful decryption using the internal Private Key.

package jtest;
 
import com.integpg.system.Debug;
import com.integpg.system.Security;
 
public class Main {
    
    public static void main(String[] args) throws Exception {
 
        String msg = "The quick brown fox jumped over the lazy dog.";
        System.out.println(msg);
 
        byte[] data = Security.encrypt(msg.getBytes(), msg.length(), Security.PUBKEY);
        System.out.println("encrypted: ");
        Debug.dump(data);
        
        byte[] result = Security.decrypt(data, data.length, Security.PRIVKEY);
        System.out.println("decrypted: ");
        Debug.dump(result);
        
    }
        
}
bruce_dev /> jtest
The quick brown fox jumped over the lazy dog.
encrypted: 
 17 66 0a 66 d8 aa 67 7c-a6 41 81 69 b1 c9 d2 82    .f.f..g| .A.i....
 ab a6 9d ef fd 31 7b 67-2a 3a 23 82 05 55 3d dd    .....1{g *:#..U=.
 8a 33 36 2d 5c 61 ae 25-39 b6 40 28 5f 1f de d2    .36-\a.% 9.@(_...
 77 b4 47 9d 53 6c ee 7a-4b e2 29 8c e0 79 06 9f    w.G.Sl.z K.)..y..
 30 3c 2e 6e d0 41 cf 40-a2 2b e5 bd 03 dd d4 b4    0<.n.A.@ .+......
 a2 b4 d1 8b 33 31 f1 2e-84 e0 8d 01 b0 4d 7b 54    ....31.. .....M{T
 65 61 56 44 ee f4 45 fb-4a 39 96 c1 c9 0e 2a 2a    eaVD..E. J9....**
 3d 2b a6 71 a8 89 91 c0-cf 80 0b 3d e3 dc dc 8e    =+.q.... ...=....
decrypted: 
 54 68 65 20 71 75 69 63-6b 20 62 72 6f 77 6e 20    The.quic k.brown.
 66 6f 78 20 6a 75 6d 70-65 64 20 6f 76 65 72 20    fox.jump ed.over.
 74 68 65 20 6c 61 7a 79-20 64 6f 67 2e             the.lazy .dog.

bruce_dev />

By the way the dump() method in the com.integpg.system.Debug is also new. I got tired of coding a dump so it will be available now.

I will show you how to use an external Public Key for encryption next.

To show you how to encrypt using a supplied Public Key I will extract the internal public key and apply it as you would one obtained from a file let’s say. The following program uses a method in the class that supplies the Public Key.

package jtest;

import com.integpg.system.Debug;
import com.integpg.system.Security;

public class Main {
    
    public static void main(String[] args) throws Exception {
        
        // Let's see the Public Key
        byte[] pubkey = Security.pubkey();
        System.out.println(new String(pubkey));
 
        String msg = "The quick brown fox jumped over the lazy dog.";
        System.out.println(msg);
 
        byte[] data = Security.encrypt(msg.getBytes(), msg.length(), pubkey, 0);
        System.out.println("encrypted: ");
        Debug.dump(data);
        
        byte[] result = Security.decrypt(data, data.length, Security.PRIVKEY);
        System.out.println("decrypted: ");
        Debug.dump(result);
        
    }
        
}
bruce_dev /> jtest
-----BEGIN PUBLIC KEY-----
MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQDEwEHsRkk592MEFyZXvvfsDkaF
u169uXwKugo2J7JMh8fkruiKe7B2tbuA143JSYeI0o4mpqWwd06CbjDG2gVEMgbf
5SK7quMdflJ5mW7t3ZPQZdMdryttPq3C4pzTfuH6/MGMzaNdobXSOQ7+SkH7goRd
sUYx6flLXn1KnQjPCQIDAQAB
-----END PUBLIC KEY-----

The quick brown fox jumped over the lazy dog.
encrypted: 
 a8 35 44 4a 15 4e 1f fe-b4 30 c3 e6 51 38 90 be    .5DJ.N.. .0..Q8..
 e4 4f 7c 5d fb e6 38 16-63 f1 93 ba a5 3f 24 00    .O|]..8. c....?$.
 eb 46 5d 27 25 f1 5a b1-bf 0e 46 f9 5b 1b e9 13    .F]'%.Z. ..F.[...
 ac 6c 77 db bd 1e 22 be-b5 32 6b 5c cc 0b 46 d7    .lw...". .2k\..F.
 3f 1b 30 4c 61 03 eb 2f-dd 84 54 d5 35 86 32 56    ?.0La../ ..T.5.2V
 16 56 7c 41 a3 ef 2f 70-2d 67 3f a5 97 fb 60 c2    .V|A../p -g?...`.
 df 61 5f 5a 76 90 56 db-21 66 6f f3 00 af aa a8    .a_Zv.V. !fo.....
 71 a2 a1 2e 31 7d 82 ab-34 e2 cc 3b 52 64 32 09    q...1}.. 4..;Rd2.
decrypted: 
 54 68 65 20 71 75 69 63-6b 20 62 72 6f 77 6e 20    The.quic k.brown.
 66 6f 78 20 6a 75 6d 70-65 64 20 6f 76 65 72 20    fox.jump ed.over.
 74 68 65 20 6c 61 7a 79-20 64 6f 67 2e             the.lazy .dog.

bruce_dev />

You can export the JNIOR’s Public Key now using the CERTMGR command.

bruce_dev /> help certmgr
CERTMGR

 -V             Verify installed keys and certificate
 -C [file]      Regenerate Certificate [Install file]
 -S file        Verify signature on certificate
 -K file        Install RSA Key Pair
 -D [file]      Decode and dump certificate [file]
 -E file        Export certificate to file
 -P file        Export public key to file
 -B             Export in binary
 -G [len]       Generate key pair [bit length]
 -R             Restore default credentials

SSL Certificate Management.

bruce_dev />

Here I will export the public key to a file. I’ll show you what is in the file and I’ll use CERTMGR to dump the encoded ASN.1 format for the key.

bruce_dev /> certmgr -p mykey.pub

bruce_dev /> cat mykey.pub
-----BEGIN PUBLIC KEY-----
MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQDEwEHsRkk592MEFyZXvvfsDkaF
u169uXwKugo2J7JMh8fkruiKe7B2tbuA143JSYeI0o4mpqWwd06CbjDG2gVEMgbf
5SK7quMdflJ5mW7t3ZPQZdMdryttPq3C4pzTfuH6/MGMzaNdobXSOQ7+SkH7goRd
sUYx6flLXn1KnQjPCQIDAQAB
-----END PUBLIC KEY-----

bruce_dev /> certmgr -d mykey.pub

0000  30 81 9F       SEQUENCE {  (159 bytes)
0003  30 0D          |  SEQUENCE {  (13 bytes)
0005  06 09          |  |  OBJECT IDENTIFIER 1.2.840.113549.1.1.1
0010  05 00          |  |  NULL 
                     |  }
0012  03 81 8D       |  BITSTRING[140] Encapsulates {
0000  30 81 89       |  |  SEQUENCE {  (137 bytes)
0003  02 81 81       |  |  |  INTEGER 
                     |  |  |     C4C041EC464939F76304172657BEF7EC0E4685BB5EBDB97C
                     |  |  |     0ABA0A3627B24C87C7E4AEE88A7BB076B5BB80D78DC94987
                     |  |  |     88D28E26A6A5B0774E826E30C6DA05443206DFE522BBAAE3
                     |  |  |     1D7E5279996EEDDD93D065D31DAF2B6D3EADC2E29CD37EE1
                     |  |  |     FAFCC18CCDA35DA1B5D2390EFE4A41FB82845DB14631E9F9
                     |  |  |     4B5E7D4A9D08CF09
0087  02 03          |  |  |  INTEGER 010001
                     |  |  }
                     |  }
                     }

bruce_dev />

You might see now that you can take mykey.pub and send it to another JNIOR that can load it as the pubkey for encryption as demonstrated.

NO. THERE IS NO WAY TO EXPORT THE JNIOR’S PRIVATE KEY.

And, the encryption and decryption does not support use of a private key PEM format.

Why limit key size to 1024-bits on the JNIOR?

A 1024-bit Private Key operation (encrypting a single block of 128 bytes) on the JNIOR take about 3.4 seconds. The same operation using a 2048-bit key takes almost 26 seconds. That will cause browsers to timeout when trying to use HTTPS among other things.

A 2048-bit key can be installed on the JNIOR. You need a 2048-bit key pair which you can generated with OpenSSL.

OpenSSL> genpkey -out private.pem -des3 -algorithm rsa rsa_keygen_bits:2048
.................++++++
....................++++++
Enter PEM pass phrase:
Verifying - Enter PEM pass phrase:
OpenSSL> genpkey -out private.pem -des3 -algorithm rsa -pkeyopt rsa_keygen_bits:
2048
................................................................................
..............................................................................++
+
................................+++
Enter PEM pass phrase:
Verifying - Enter PEM pass phrase:
OpenSSL>

Move the resulting private.pem file onto the JNIOR and run the CERTMGR -K command to load it.

bruce_dev /> certmgr -k private.pem
Passphrase: *****
keys installed

bruce_dev />

Now let’s validate that it works.

bruce_dev /> certmgr -v            
2048-bit key pair verifies
private key operation requires about 25.7 seconds
certificate verifies 
certificate not valid with current keys

bruce_dev />

Oh, and we can update the certificate. That would likely happen automatically at some point but we can force it.

bruce_dev /> certmgr -c
certificate updated

bruce_dev /> certmgr -v
2048-bit key pair verifies
private key operation requires about 25.7 seconds
certificate verifies 

bruce_dev />

Let’s run the program mentioned earlier to see if it succeeds.

bruce_dev /> jtest
-----BEGIN PUBLIC KEY-----
MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEArvnTH4JvTzzVW76iFOKf
akQ2EqbXVhEEoDZ0d1x2Q/8R8jvwZAdvvlcV63ixvTBSR+xInCfVAsjQDzeOVQq/
kKsQm7VNeqTHAZ4TobKYpcG2N3n4PGQRhT1H0bwqfopEWg/iqauCejKX6ivInZC6
kPD1rkCbr6HRSgnKKNbnmIL37nrx3XlhkrHeOV1/jEBGTFm1KIpNmVkaN83PuZSk
Xj2RJ9TGTdVwtrSNVe+VnQ4s+66BlPZrXBfi4P8lSyNd4J0eIVujfXor3Kxz2TAD
zmgyMylyJuHO/Ss/3PdGwnXIx7fbNKEK4OnMwRdz0DtvSMcJE6NhBiimD/Jm2F0H
awIDAQAB
-----END PUBLIC KEY-----

The quick brown fox jumped over the lazy dog.
encrypted: 
 41 2b 02 92 30 d5 50 7c-92 b6 95 eb 8c 8d f4 76    A+..0.P| .......v
 f1 22 0a c5 63 48 f7 1b-af 85 47 4e 1d b2 0d bc    ."..cH.. ..GN....
 5b 6a f9 6d c7 1a c5 90-69 f9 28 4c 93 e1 8c 2e    [j.m.... i.(L....
 3f 5b 95 26 9d c4 ae 15-15 84 74 1e c4 a5 21 29    ?[.&.... ..t...!)
 e2 e0 c8 f7 f0 3e 99 aa-ed a9 36 ab 18 4f f8 ca    .....>.. ..6..O..
 cc 23 b3 57 d2 5c d6 6f-fa 83 2b 44 82 a5 ab ef    .#.W.\.o ..+D....
 c7 44 98 14 6d 8e 58 a2-05 b9 e0 9c 87 fc 52 22    .D..m.X. ......R"
 ee 46 38 2e 32 4e 4d c1-92 cd fc 3d 80 1c 81 19    .F8.2NM. ...=....
 1b 95 56 93 ff 4a 06 e0-9e c2 30 0c 83 ee 01 08    ..V..J.. ..0.....
 8f 98 d7 f3 50 b5 2b 80-0c 9b 23 8b 45 df 56 85    ....P.+. ..#.E.V.
 60 06 30 e3 35 a1 3c 82-19 57 b6 7e cf a2 02 e4    `.0.5.<. .W.~....
 55 3f b4 3c a8 39 77 79-0f f0 d6 aa da 1d b4 73    U?.<.9wy .......s
 7e ef 13 54 a8 d7 b0 a1-d2 67 0a 66 08 b9 81 13    ~..T.... .g.f....
 11 17 c2 d4 be 98 b5 fe-50 34 49 ab da cf 75 d7    ........ P4I...u.
 c1 b5 18 4e 32 27 2f e4-81 35 51 4a 62 42 6e a1    ...N2'/. .5QJbBn.
 47 67 e5 e4 c4 2c 70 c2-9b ea d8 09 5a 52 fd cb    Gg...,p. ....ZR..
decrypted: 
 54 68 65 20 71 75 69 63-6b 20 62 72 6f 77 6e 20    The.quic k.brown.
 66 6f 78 20 6a 75 6d 70-65 64 20 6f 76 65 72 20    fox.jump ed.over.
 74 68 65 20 6c 61 7a 79-20 64 6f 67 2e             the.lazy .dog.

bruce_dev />

If you are *REAL* patient the 2048-bit key works with SSL/TLS.

This took a couple of minutes to come up and the browser did once tell me that the site was taking too long to respond. This was with Chrome.

The bottom line is that a 1024-bit key is really secure enough for controller device like the JNIOR.

In the Programming Tips I show you how to make an outgoing secure connection using SSS/TLSv1.2. A secure connection encrypts data with a 128-bit key that is itself securely negotiated. So the information that you exchange cannot be read from the wire as it passes. But Windows still says “Not Secure” sometimes. Why is that?

By “Not Secure” Windows is really telling you that the connection is Not Trusted. The data is still encrypted but Windows doesn’t recognize the connected client. If you have connected to a JNIOR using the HTTPS URL before you might heave seen a Privacy Error page that you must bypass in order to access the unit. This occurs as the browser receives a certificate from the JNIOR which is not traceable through a Root Certificate Authority in its database. The JNIOR certificates are self-signed and not issued by such an authority. Ergo the privacy concern.

When your application makes an outgoing connection the destination sends a copy of its certificate. Can you verify that certificate as the browser does? Well, with limitations you can. I have added a method in the Socket class allowing you to retrieve that certificate.

Once you indicate that the connection should be secure there is a brief period in which the negotiations transpire. The following code waits for a certificate. The getCertificate() method returns an empty byte array until a certificate has been received.


bruce_dev /> jtest
 30 82 02 ed 30 82 02 56-a0 03 02 01 02 02 04 24    0...0..V .......$
 99 a9 00 30 0d 06 09 2a-86 48 86 f7 0d 01 01 0b    ...0...* .H......
 05 00 30 81 81 31 20 30-1e 06 03 55 04 0a 0c 17    ..0..1.0 ...U....
 49 4e 54 45 47 20 50 72-6f 63 65 73 73 20 47 72    INTEG.Pr ocess.Gr
 6f 75 70 20 49 6e 63 31-17 30 15 06 03 55 04 0b    oup.Inc1 .0...U..
 0c 0e 4a 4e 49 4f 52 20-43 6f 6e 74 72 6f 6c 73    ..JNIOR. Controls
 31 1d 30 1b 06 03 55 04-03 0c 14 68 6f 6e 65 79    1.0...U. ...honey
 70 6f 74 2e 69 6e 74 65-67 70 67 2e 63 6f 6d 31    pot.inte gpg.com1
 25 30 23 06 09 2a 86 48-86 f7 0d 01 09 01 16 16    %0#..*.H ........
 62 63 6c 6f 75 74 69 65-72 32 40 63 6f 6d 63 61    bcloutie r2@comca
 73 74 2e 6e 65 74 30 1e-17 0d 31 37 30 33 32 32    st.net0. ..170322
 31 37 33 30 32 33 5a 17-0d 31 39 30 33 32 32 31    173023Z. .1903221
 37 33 30 32 33 5a 30 81-81 31 20 30 1e 06 03 55    73023Z0. .1.0...U
 04 0a 0c 17 49 4e 54 45-47 20 50 72 6f 63 65 73    ....INTE G.Proces
 73 20 47 72 6f 75 70 20-49 6e 63 31 17 30 15 06    s.Group. Inc1.0..
 03 55 04 0b 0c 0e 4a 4e-49 4f 52 20 43 6f 6e 74    .U....JN IOR.Cont
 72 6f 6c 73 31 1d 30 1b-06 03 55 04 03 0c 14 68    rols1.0. ..U....h
 6f 6e 65 79 70 6f 74 2e-69 6e 74 65 67 70 67 2e    oneypot. integpg.
 63 6f 6d 31 25 30 23 06-09 2a 86 48 86 f7 0d 01    com1%0#. .*.H....
 09 01 16 16 62 63 6c 6f-75 74 69 65 72 32 40 63    ....bclo utier2@c
 6f 6d 63 61 73 74 2e 6e-65 74 30 81 9f 30 0d 06    omcast.n et0..0..
 09 2a 86 48 86 f7 0d 01-01 01 05 00 03 81 8d 00    .*.H.... ........
 30 81 89 02 81 81 00 a9-94 83 17 4b 2e bc 85 78    0....... ...K...x
 ec ea 5b e9 f7 58 40 70-3b 06 ea 49 d9 33 3d 49    ..[..X@p ;..I.3=I
 3d 03 5a 8d 84 db 5a b7-e5 49 1d 33 4b af 1b 59    =.Z...Z. .I.3K..Y
 a3 a2 71 e2 5c 42 76 d4-10 f3 b3 c9 0e 80 1e 89    ..q.\Bv. ........
 a1 62 c6 a2 82 ec 51 ab-05 cf 97 31 56 1a 95 22    .b....Q. ...1V.."
 a0 b3 03 9d f7 2f a2 5b-a1 06 1e 6b bb 7a 1a a6    ...../.[ ...k.z..
 b2 87 a3 14 fd db b9 e1-03 4b 45 d5 e1 ff c1 5a    ........ .KE....Z
 59 c4 0d 77 2d 3c da d6-14 2a 70 76 50 f1 1e bc    Y..w-<.. .*pvP...
 d3 0c ff 75 e6 5e 91 02-03 01 00 01 a3 70 30 6e    ...u.^.. .....p0n
 30 1d 06 03 55 1d 0e 04-16 04 14 29 cb 03 57 bc    0...U... ...)..W.
 dd 26 e7 8a d5 e5 64 c1-d0 87 b0 3b 58 30 82 30    .&....d. ...;X0.0
 0c 06 03 55 1d 13 04 05-30 03 01 01 ff 30 3f 06    ...U.... 0....0?.
 03 55 1d 11 04 38 30 36-87 04 32 c5 22 4b 82 14    .U...806 ..2."K..
 68 6f 6e 65 79 70 6f 74-2e 69 6e 74 65 67 70 67    honeypot .integpg
 2e 63 6f 6d 82 08 68 6f-6e 65 79 70 6f 74 82 0e    .com..ho neypot..
 68 6f 6e 65 79 70 6f 74-5f 6a 6e 69 6f 72 30 0d    honeypot _jnior0.
 06 09 2a 86 48 86 f7 0d-01 01 0b 05 00 03 81 81    ..*.H... ........
 00 2b 42 e0 5e 33 1a ee-b2 65 f4 da c1 18 df 73    .+B.^3.. .e.....s
 e7 f5 55 d7 26 05 f6 ec-ab 67 d8 60 32 4a 7c 50    ..U.&... .g.`2J|P
 56 14 c5 20 33 37 a9 8c-21 57 d8 5c 57 a7 36 b8    V...37.. !W.\W.6.
 2d da 88 47 5e 93 a6 c9-fc 2c 59 83 67 8c 8d 46    -..G^... .,Y.g..F
 1a 9c e7 f5 3a 27 66 db-bd 26 c0 b9 9c e1 f4 51    ....:'f. .&.....Q
 4f 6b ac 3d 09 c3 30 00-bc 7e 5f 61 51 c0 ba 17    Ok.=..0. .~_aQ...
 5f 29 b6 e7 3b 8e 7f eb-ae 10 99 26 9a 9a fd 70    _)..;... ...&...p
 67 17 c6 7c f9 c7 f1 7e-bb 3f 8d b2 ed 43 53 c2    g..|...~ .?...CS.
 d1                                                 .


bruce_dev /> 

So you can see here that we receive something that looks to have the company name in it. This is the certificate and unfortunately it is in a binary ASN.1 format. That at this point is not very useful. You would have a lot of work to do if you were to parse information out of that.

So let’s see if I can help in that department.

Since in this example we attempt to connect to the HoneyPot I can separately pull its certificate in PEM format. In that form it looks like this.

bruce_dev /> cat flash/honeypot.cer
-----BEGIN CERTIFICATE-----
MIIC7TCCAlagAwIBAgIEJJmpADANBgkqhkiG9w0BAQsFADCBgTEgMB4GA1UECgwX
SU5URUcgUHJvY2VzcyBHcm91cCBJbmMxFzAVBgNVBAsMDkpOSU9SIENvbnRyb2xz
MR0wGwYDVQQDDBRob25leXBvdC5pbnRlZ3BnLmNvbTElMCMGCSqGSIb3DQEJARYW
YmNsb3V0aWVyMkBjb21jYXN0Lm5ldDAeFw0xNzAzMjIxNzMwMjNaFw0xOTAzMjIx
NzMwMjNaMIGBMSAwHgYDVQQKDBdJTlRFRyBQcm9jZXNzIEdyb3VwIEluYzEXMBUG
A1UECwwOSk5JT1IgQ29udHJvbHMxHTAbBgNVBAMMFGhvbmV5cG90LmludGVncGcu
Y29tMSUwIwYJKoZIhvcNAQkBFhZiY2xvdXRpZXIyQGNvbWNhc3QubmV0MIGfMA0G
CSqGSIb3DQEBAQUAA4GNADCBiQKBgQCplIMXSy68hXjs6lvp91hAcDsG6knZMz1J
PQNajYTbWrflSR0zS68bWaOiceJcQnbUEPOzyQ6AHomhYsaiguxRqwXPlzFWGpUi
oLMDnfcvoluhBh5ru3oaprKHoxT927nhA0tF1eH/wVpZxA13LTza1hQqcHZQ8R68
0wz/deZekQIDAQABo3AwbjAdBgNVHQ4EFgQUKcsDV7zdJueK1eVkwdCHsDtYMIIw
DAYDVR0TBAUwAwEB/zA/BgNVHREEODA2hwQyxSJLghRob25leXBvdC5pbnRlZ3Bn
LmNvbYIIaG9uZXlwb3SCDmhvbmV5cG90X2puaW9yMA0GCSqGSIb3DQEBCwUAA4GB
ACtC4F4zGu6yZfTawRjfc+f1VdcmBfbsq2fYYDJKfFBWFMUgMzepjCFX2FxXpza4
LdqIR16Tpsn8LFmDZ4yNRhqc5/U6J2bbvSbAuZzh9FFPa6w9CcMwALx+X2FRwLoX
Xym25zuOf+uuEJkmmpr9cGcXxnz5x/F+uz+Nsu1DU8LR
-----END CERTIFICATE-----

bruce_dev />

There is a nice option in the CERTMGR command to dump that in some meaningful form.


bruce_dev /> certmgr -d flash/honeypot.cer

0000  30 82 02 ED    SEQUENCE {  (749 bytes)
0004  30 82 02 56    |  SEQUENCE {  (598 bytes)
0008  A0 03          |  |  [0] EXPLICIT {  (3 bytes)
000A  02 01          |  |  |  INTEGER 02
                     |  |  }
000D  02 04          |  |  INTEGER 2499A900
0013  30 0D          |  |  SEQUENCE {  (13 bytes)
0015  06 09          |  |  |  OBJECT IDENTIFIER 1.2.840.113549.1.1.11
0020  05 00          |  |  |  NULL 
                     |  |  }
0022  30 81 81       |  |  SEQUENCE {  (129 bytes)
0025  31 20          |  |  |  SET {  (32 bytes)
0027  30 1E          |  |  |  |  SEQUENCE {  (30 bytes)
0029  06 03          |  |  |  |  |  OBJECT IDENTIFIER 2.5.4.10
002E  0C 17          |  |  |  |  |  UTF8STRING 'INTEG Process Group Inc'
                     |  |  |  |  }
                     |  |  |  }
0047  31 17          |  |  |  SET {  (23 bytes)
0049  30 15          |  |  |  |  SEQUENCE {  (21 bytes)
004B  06 03          |  |  |  |  |  OBJECT IDENTIFIER 2.5.4.11
0050  0C 0E          |  |  |  |  |  UTF8STRING 'JNIOR Controls'
                     |  |  |  |  }
                     |  |  |  }
0060  31 1D          |  |  |  SET {  (29 bytes)
0062  30 1B          |  |  |  |  SEQUENCE {  (27 bytes)
0064  06 03          |  |  |  |  |  OBJECT IDENTIFIER 2.5.4.3
0069  0C 14          |  |  |  |  |  UTF8STRING 'honeypot.integpg.com'
                     |  |  |  |  }
                     |  |  |  }
007F  31 25          |  |  |  SET {  (37 bytes)
0081  30 23          |  |  |  |  SEQUENCE {  (35 bytes)
0083  06 09          |  |  |  |  |  OBJECT IDENTIFIER 1.2.840.113549.1.9.1
008E  16 16          |  |  |  |  |  IA5STRING 'bcloutier2@comcast.net'
                     |  |  |  |  }
                     |  |  |  }
                     |  |  }
00A6  30 1E          |  |  SEQUENCE {  (30 bytes)
00A8  17 0D          |  |  |  UTCTIME[13] 170322173023Z
00B7  17 0D          |  |  |  UTCTIME[13] 190322173023Z
                     |  |  }
00C6  30 81 81       |  |  SEQUENCE {  (129 bytes)
00C9  31 20          |  |  |  SET {  (32 bytes)
00CB  30 1E          |  |  |  |  SEQUENCE {  (30 bytes)
00CD  06 03          |  |  |  |  |  OBJECT IDENTIFIER 2.5.4.10
00D2  0C 17          |  |  |  |  |  UTF8STRING 'INTEG Process Group Inc'
                     |  |  |  |  }
                     |  |  |  }
00EB  31 17          |  |  |  SET {  (23 bytes)
00ED  30 15          |  |  |  |  SEQUENCE {  (21 bytes)
00EF  06 03          |  |  |  |  |  OBJECT IDENTIFIER 2.5.4.11
00F4  0C 0E          |  |  |  |  |  UTF8STRING 'JNIOR Controls'
                     |  |  |  |  }
                     |  |  |  }
0104  31 1D          |  |  |  SET {  (29 bytes)
0106  30 1B          |  |  |  |  SEQUENCE {  (27 bytes)
0108  06 03          |  |  |  |  |  OBJECT IDENTIFIER 2.5.4.3
010D  0C 14          |  |  |  |  |  UTF8STRING 'honeypot.integpg.com'
                     |  |  |  |  }
                     |  |  |  }
0123  31 25          |  |  |  SET {  (37 bytes)
0125  30 23          |  |  |  |  SEQUENCE {  (35 bytes)
0127  06 09          |  |  |  |  |  OBJECT IDENTIFIER 1.2.840.113549.1.9.1
0132  16 16          |  |  |  |  |  IA5STRING 'bcloutier2@comcast.net'
                     |  |  |  |  }
                     |  |  |  }
                     |  |  }
014A  30 81 9F       |  |  SEQUENCE {  (159 bytes)
014D  30 0D          |  |  |  SEQUENCE {  (13 bytes)
014F  06 09          |  |  |  |  OBJECT IDENTIFIER 1.2.840.113549.1.1.1
015A  05 00          |  |  |  |  NULL 
                     |  |  |  }
015C  03 81 8D       |  |  |  BITSTRING[140] Encapsulates {
0000  30 81 89       |  |  |  |  SEQUENCE {  (137 bytes)
0003  02 81 81       |  |  |  |  |  INTEGER 
                     |  |  |  |  |     A99483174B2EBC8578ECEA5BE9F75840703B06EA49D9333D
                     |  |  |  |  |     493D035A8D84DB5AB7E5491D334BAF1B59A3A271E25C4276
                     |  |  |  |  |     D410F3B3C90E801E89A162C6A282EC51AB05CF9731561A95
                     |  |  |  |  |     22A0B3039DF72FA25BA1061E6BBB7A1AA6B287A314FDDBB9
                     |  |  |  |  |     E1034B45D5E1FFC15A59C40D772D3CDAD6142A707650F11E
                     |  |  |  |  |     BCD30CFF75E65E91
0087  02 03          |  |  |  |  |  INTEGER 010001
                     |  |  |  |  }
                     |  |  |  }
                     |  |  }
01EC  A3 70          |  |  [3] EXPLICIT {  (112 bytes)
01EE  30 6E          |  |  |  SEQUENCE {  (110 bytes)
01F0  30 1D          |  |  |  |  SEQUENCE {  (29 bytes)
01F2  06 03          |  |  |  |  |  OBJECT IDENTIFIER 2.5.29.14
01F7  04 16          |  |  |  |  |  OCTETSTRING[22] Encapsulates {
0000  04 14          |  |  |  |  |  |  OCTETSTRING[20] 
                     |  |  |  |  |  |     29CB0357BCDD26E78AD5E564C1D087B0  )..W..&....d....
                     |  |  |  |  |  |     3B583082                          ;X0.
                     |  |  |  |  |  }
                     |  |  |  |  }
020F  30 0C          |  |  |  |  SEQUENCE {  (12 bytes)
0211  06 03          |  |  |  |  |  OBJECT IDENTIFIER 2.5.29.19
0216  04 05          |  |  |  |  |  OCTETSTRING[5] Encapsulates {
0000  30 03          |  |  |  |  |  |  SEQUENCE {  (3 bytes)
0002  01 01          |  |  |  |  |  |  |  BOOLEAN TRUE(255)
                     |  |  |  |  |  |  }
                     |  |  |  |  |  }
                     |  |  |  |  }
021D  30 3F          |  |  |  |  SEQUENCE {  (63 bytes)
021F  06 03          |  |  |  |  |  OBJECT IDENTIFIER 2.5.29.17
0224  04 38          |  |  |  |  |  OCTETSTRING[56] Encapsulates {
0000  30 36          |  |  |  |  |  |  SEQUENCE {  (54 bytes)
0002  87 04          |  |  |  |  |  |  |  [7] 32C5224B  2."K
0008  82 14          |  |  |  |  |  |  |  [2] 
                     |  |  |  |  |  |  |     686F6E6579706F742E696E7465677067  honeypot.integpg
                     |  |  |  |  |  |  |     2E636F6D                          .com
001E  82 08          |  |  |  |  |  |  |  [2] 686F6E6579706F74  honeypot
0028  82 0E          |  |  |  |  |  |  |  [2] 686F6E6579706F745F6A6E696F72  honeypot_jnior
                     |  |  |  |  |  |  }
                     |  |  |  |  |  }
                     |  |  |  |  }
                     |  |  |  }
                     |  |  }
                     |  }
025E  30 0D          |  SEQUENCE {  (13 bytes)
0260  06 09          |  |  OBJECT IDENTIFIER 1.2.840.113549.1.1.11
026B  05 00          |  |  NULL 
                     |  }
026D  03 81 81       |  BITSTRING[128]  0 unused bits
                     |     2B42E05E331AEEB265F4DAC118DF73E7  +B.^3...e.....s.
                     |     F555D72605F6ECAB67D860324A7C5056  .U.&....g.`2J|PV
                     |     14C5203337A98C2157D85C57A736B82D  .. 37..!W.\W.6.-
                     |     DA88475E93A6C9FC2C5983678C8D461A  ..G^....,Y.g..F.
                     |     9CE7F53A2766DBBD26C0B99CE1F4514F  ...:'f..&.....QO
                     |     6BAC3D09C33000BC7E5F6151C0BA175F  k.=..0..~_aQ..._
                     |     29B6E73B8E7FEBAE1099269A9AFD7067  )..;......&...pg
                     |     17C67CF9C7F17EBB3F8DB2ED4353C2D1  ..|...~.?...CS..
                     }

bruce_dev />

Uh, This is still likely quite cryptic for your use. ASN.1 is fun. You might notice the hexadecimal in this dump follows that dumped by our application in the prior post. This demonstrates the inherent structure in the ASN.1 Certificate Format.

So if I am going to be of any help there is more work to be done.

After thinking about some kind of conversion from ASN.1 to JSON I have decided to stick with ASN.1 for this purpose. I’ll develop an ASN1 class that will help with parsing. The reason to hang with ASN.1 is that you will be able to confirm signatures.

That reminds me too that, I should write something about Signing. Since you now have access to RSA cryptography…

Alright. A couple of posts back we extracted the certificate from our secure connection. I dumped it in binary and also using CERTMGR to see the ASN.1 structure.

First of all the certificate is delivered in DER format. This defines the binary encoding used to transfer the signed certificate and that we see in the dump. A standard ASN.1 definition for a signed certificate is compiled into DER. The format defined for these certificates is x509 which is defined in RFC 5280. You may also need information contained in RFC 5246 which is the latest for TLSv1.2.

Okay so that is a lot of work and if you have to read all of that then forget it, right? Let me try to gloss over it and drive to doing something meaningful with this binary certificate stuff.

I have started to pull together an Asn1 class which will help us work with the DER encoded binary data. It was apparent from the CERTMGR dump that there is some structure to it. I’ll try to vaguely describe that from the top down.

First notice that the whole signed certificate as obtained from the connection is enclosed in a SEQUENCE. That is an ASN.1 object which in DER has a tag (ASN_SEQUENCE), a length, and data or content. From the RFCs we expect the following structure.

Certificate  ::=  SEQUENCE  {
    tbsCertificate       TBSCertificate,
    signatureAlgorithm   AlgorithmIdentifier,
    signatureValue       BIT STRING  }

So the top SEQUENCE contains three objects. Here “TBS” stands for To Be Signed. So the tbsCertificate is the Certificate to be signed or that has been signed. It is information that by itself is a SEQUENCE of objects. The signatureAlgorithm defines the procedure used in the signing. That is a SEQUENCE too with some objects within. And, the signatureValue turns out to be some tacked on bit data in a BITSTRING. That we will see is the actual signature.

So let’s modify our little program that gets the target host’s certificate to use my prototype Asn1 class. We will first confirm that the initial SEQUENCE covers all of the signed certificate and then itemize its content.

package jtest;
 
import com.integpg.system.Debug;
import java.net.Socket;
 
public class Main {
    
    public static void main(String[] args) throws Exception {
 
        // Establish a Secure Socket, get streams, and set a timeout
        Socket dataSocket = new Socket("50.197.34.75", 443);
        dataSocket.setSecure(true);
        
        // Obtain the certificate
        byte[] cert;
        while ((cert = dataSocket.getCertificate()).length == 0)
            System.sleep(100);
        dataSocket.close();
 
        // analyze
        Asn1 asn = new Asn1(cert);
 
        // details about the object
        System.out.println("Overall Signed Certificate Length: " + cert.length);
        System.out.println("ASN.1 Object tag: " + asn.getTag());
        System.out.printf("ASN.1 Object flags: 0x%02x\n", asn.getFlags());
        System.out.println("ASN.1 Object content size: " + asn.getLength());
        
        // skip the object and check for more data (should be only 1 object)
        asn.skip();
        if (!asn.hasMoreData())
            System.out.println("Signed Certificate is a sigle object as expected");
        else
            System.out.println("Something is wrong!");
    }
        
}
bruce_dev /> jtest
Overall Signed Certificate Length: 753
ASN.1 Object tag: 16
ASN.1 Object flags: 0x20
ASN.1 Object content size: 749
Signed Certificate is a sigle object as expected

bruce_dev />

This demonstrates that the SEQUENCE object contains the entire signed certificate. 753 bytes were delivered and aside from the 4-byte header (tag and length) the content covers the rest of the data. The tag of 16 tells us it is a SEQUENCE and the flag 0x20 tells us it is a CONSTRUCT.

Here are tags and flags that I have defined in the Asn1 class.


    static public final int ASN_BOOLEAN = 1;
    static public final int ASN_INTEGER = 2;
    static public final int ASN_BITSTRING = 3;
    static public final int ASN_OCTETSTRING = 4;
    static public final int ASN_NULL = 5;
    static public final int ASN_OBJECTID = 6;
    static public final int ASN_OBJECTDESC = 7;
    static public final int ASN_INSTANCEOF = 8;
    static public final int ASN_REAL = 9;
    static public final int ASN_ENUM = 10;
    static public final int ASN_EMBEDDED = 11;
    static public final int ASN_UTF8STRING = 12;
    static public final int ASN_RELATIVEOID = 13;
    static public final int ASN_SEQUENCE = 16;
    static public final int ASN_SET = 17;
    static public final int ASN_NUMERIC = 18;
    static public final int ASN_PRINTABLE = 19;
    static public final int ASN_T61 = 20;
    static public final int ASN_VIDEOTEX = 21;
    static public final int ASN_IA5STRING = 22;
    static public final int ASN_UTCTIME = 23;
    static public final int ASN_GENTIME = 24;
    static public final int ASN_GRAPHIC = 25;
    static public final int ASN_VISIBLESTR = 26;
    static public final int ASN_GENSTRING = 27;
    static public final int ASN_UNIVSTRING = 28;
    static public final int ASN_CHARSTR = 29;
    static public final int ASN_BMPSTR = 30;
    static public final int ASN_HIGHFORM = 31;

    static public final int ASN_CONSTRUCT = 0x20;
    static public final int ASN_APPLICATION = 0x40;
    static public final int ASN_CONTEXT = 0x80;
    static public final int ASN_PRIVATE = 0xC0;

So let’s look into the overall SEQUENCE and see that those three objects are to be found. We’ll just list the tags for the objects we find. Her are the changes to our test program.

        // analyze
        Asn1 asn = new Asn1(cert);
        
        // descend into the SEQUENCE object and itemize the objects it contains.
        asn.descend();
        
        while (asn.hasMoreData()) {
            System.out.println("ASN.1 Object tag: " + asn.getTag());
            System.out.println("ASN.1 Object length: " + asn.getLength());
            System.out.println("");
            asn.skip();
        }
bruce_dev /> jtest
ASN.1 Object tag: 16
ASN.1 Object length: 598

ASN.1 Object tag: 16
ASN.1 Object length: 13

ASN.1 Object tag: 3
ASN.1 Object length: 129


bruce_dev />

So there are three parts. Two SEQUENCEs and a BITSTRING. Those correspond to tbsCertificate, signatureAlgorithm and signatureVauerespectively which is what is expected.

Certificate  ::=  SEQUENCE  {
    tbsCertificate       TBSCertificate,
    signatureAlgorithm   AlgorithmIdentifier,
    signatureValue       BIT STRING  }

Let’s extract the key parts of this signed certificate and dump the signatureValue.

        // analyze
        Asn1 asn = new Asn1(cert);
        asn.descend();
        
        // obtain the certificate
        Asn1 tbsCertificate = new Asn1(asn.getData());
        asn.skip();
        Asn1 signatureAlgorithm = new Asn1(asn.getData());
        asn.skip();
        byte[] bitstring = asn.getData();
        
        // remove leading unused bit count supplied with BITSTRING
        byte[] signatureValue = new byte[bitstring.length - 1];
        ArrayUtils.arraycopy(bitstring, 1, signatureValue, 0, signatureValue.length);
        
        // dump the signature
        Debug.dump(signatureValue);
bruce_dev /> jtest
 2b 42 e0 5e 33 1a ee b2-65 f4 da c1 18 df 73 e7    +B.^3... e.....s.
 f5 55 d7 26 05 f6 ec ab-67 d8 60 32 4a 7c 50 56    .U.&.... g.`2J|PV
 14 c5 20 33 37 a9 8c 21-57 d8 5c 57 a7 36 b8 2d    ...37..! W.\W.6.-
 da 88 47 5e 93 a6 c9 fc-2c 59 83 67 8c 8d 46 1a    ..G^.... ,Y.g..F.
 9c e7 f5 3a 27 66 db bd-26 c0 b9 9c e1 f4 51 4f    ...:'f.. &.....QO
 6b ac 3d 09 c3 30 00 bc-7e 5f 61 51 c0 ba 17 5f    k.=..0.. ~_aQ..._
 29 b6 e7 3b 8e 7f eb ae-10 99 26 9a 9a fd 70 67    )..;.... ..&...pg
 17 c6 7c f9 c7 f1 7e bb-3f 8d b2 ed 43 53 c2 d1    ..|...~. ?...CS..

bruce_dev />

We see from the CERTMGR dump a few posts back that this is correct.

How can we check the signature?

To start since I know this is from our HoneyPot unit I will grab the public key directly from the JNIOR. I’ll save this in a pubkey.pem file. Since this is a self-signed certificate this public key is already in the tbsCertificate but to avoid the complexity of digging in to get it we’ll start with a handy copy of the key. We can also tell that this certificate’s signature was done with RSA encryption and the SHA256 or SHA2 hash. There are other signature algorithms. This is the one that the JNIOR used. So to keep it simple we’ll just work with that right now.

The Certificate Signing procedure is “simple”. When the certificate was signed the JNIOR

  1. computed the SHA256 hash over the ASN.1 DER encoded tbsCertificate object
  2. built a simple ASN.1 structure defining the algorithm with an OID and storing the hash as an OCTET STRING
  3. encrypted the DER encoded hash value using the JNIOR’s RSA Private Key
  4. appended the signingAlgorithm information and the signatureValue to the tbsCertificate creating the signed certificate.

So to verify the Signed Certificate we can reverse the process. So we will do the following:

  1. extract the tbsCertificate ASN.1 DER encoding from the signed certificate
  2. calculate the SHA256 over the tbsCertificate block
  3. obtain the BIT STRING appended to the signed certificate
  4. decrypt the BIT String using the JNIOR’s RSA Public Key
  5. look into the resulting ASN.1 structure for the stored copy of the hash
  6. if our calculated hash matches that stored then the certificate verifies
        // analyze
        Asn1 asn = new Asn1(cert);
        asn.descend();
        
        // obtain the certificate
        byte[] tbsCertificate = asn.getObject();
        asn.skip();
        byte[] signatureAlgorithm = asn.getData();
        asn.skip();
        byte[] bitstring = asn.getData();
        
        // remove leading unused bit count supplied with BITSTRING
        byte[] signatureValue = new byte[bitstring.length - 1];
        ArrayUtils.arraycopy(bitstring, 1, signatureValue, 0, signatureValue.length);

Here we parse the signed certificate to extract both the tbsCertificate and the signatureValue. Note that I used getObject() from the Asn1class to not only get the certificate content but also the header for the ASN.1 SEQUENCE. The hash includes all of it.

Next we calculate the SHA256 for the tbsCertificate block. The SHA256 methods are exposed in JANOS v1.6.3 and later.

        // calculate SHA-256 on tbsCertificate and signatureAlgorithm
        byte[] hash = Security.hashMessage256(tbsCertificate);
        Debug.dump(hash);
        System.out.println("");
bruce_dev /> jtest
 db 67 e8 3b 8a 7e c1 ab-ef 76 16 0b 2b 45 e1 26    .g.;.~.. .v..+E.&
 c6 fa eb 31 4a 1c d0 5f-23 b0 a7 0f 7a 03 5b e6    ...1J.._ #...z.[.

Finally we read the HoneyPot’s public key from the file and perform the RSA decryption. This dumps the decrypted BIT STRING content.

        // fetch the HoneyPot public key
        File keyfile = new File("/flash/pubkey.pem");
        DataInputStream fin = new DataInputStream(new FileInputStream(keyfile));
        byte[] pubkey = new byte[fin.available()];
        fin.readFully(pubkey);
        fin.close();
        
        byte[] sig = Security.decrypt(signatureValue, 0, pubkey, 0);
        Debug.dump(sig);
bruce_dev /> jtest
 db 67 e8 3b 8a 7e c1 ab-ef 76 16 0b 2b 45 e1 26    .g.;.~.. .v..+E.&
 c6 fa eb 31 4a 1c d0 5f-23 b0 a7 0f 7a 03 5b e6    ...1J.._ #...z.[.

 30 31 30 0d 06 09 60 86-48 01 65 03 04 02 01 05    010...`. H.e.....
 00 04 20 db 67 e8 3b 8a-7e c1 ab ef 76 16 0b 2b    ....g.;. ~...v..+
 45 e1 26 c6 fa eb 31 4a-1c d0 5f 23 b0 a7 0f 7a    E.&...1J .._#...z
 03 5b e6                                           .[.

bruce_dev />

If the public key properly decrypts the signingValue you will see a valid ASN.1 DER encoded structure. Manually we see that it starts with a SEQUENCE and the length is 49 bytes. In that SEQUENCE there is another of just 13 bytes. That contains the OID. After that there is a 32 byte OCTET STRING containing the hash.

So just by eye we see that the last 32 bytes of the decrypted signingValue do match the calculated SHA256 hash. We have verified the signature!

One of the parts of the tbsCertificate defines the Issuer and the other the Subject of the certificate. Since the JNIOR creates a self-signed certificate the Issuer and Subject are the same.

If you look back to the CERTMGR dump of the certificate you see that INTEG Process Group Inc appears twice. The first is for the Issuer and the second the Subject. There is a SEQUENCE following that which contains a BIT STRING that encapsulates two INTEGERs. That is the Subject’s RSA Public Key. That would match the HoneyPot’s Public Key. We could have gone into the certificate for that key. But that works ONLY for a self-signed certificate like this.

More generally the Issuer signs the Certificate using the Issuer’s RSA Private (and highly secret) Key and the Issuer is not the same as the Subject. In that case the Issuer’s RSA Public Key is NOT in the certificate. We would need to find an independent source for the key. Windows, for instance, looks to the Trusted Root Certificate Authorities store for another certificate, one for the Issuer where the public key can be found.

It can even be more complex as there might be a chain of trust. If the certificate is signed by an Issuer that is likely not to be found in the system’s certificate store then an additional one or more certificates might be transmitted during TLS negotiation. We would have to follow the chain verifying each certificate until we reached a trusted certificate from the system’s store or otherwise.

The JNIOR does not contain a specific trusted certificate store for this purpose. If we were to be verifying certificates in this way we would need to create something or otherwise rely on a remote system.

To demonstrate an outgoing HTTP request I am going to use the IP Address Location service that our HoneyPot unit uses. This creates the JSONdatabase used to generate the map at http://honeypot.integpg.com/map.php .

The JANOS Runtime Library does not provide classes to handle different web requests. Perhaps over time we will supply external libraries for that. But, you can easily do that directly. And, it is probably more educational to know how things work at the low level.

The procedure is straight forward.

  1. Establish an outgoing socket. (Lines 19-22)
  2. Issue a minimally formatted HTTP request. (Lines 25-27)
  3. Read the response. (Lines 45-50)
  4. Use the data. (Line 53)
package jtest;
 
import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.net.Socket;
 
public class Main {
    
    public static void main(String[] args) throws Exception {
 
        // IP Address query
        String ipaddr = "50.197.34.75";
        
        // Location services
        String serverHostname = "ip-api.com";
        int port = 80;
 
        // Establish a Socket, get streams, and set a timeout
        Socket dataSocket = new Socket(serverHostname, port);
        DataOutputStream sockout = new DataOutputStream(dataSocket.getOutputStream());
        DataInputStream sockin = new DataInputStream(dataSocket.getInputStream());
        dataSocket.setSoTimeout(5000);
 
        // Issue the HTTP request
        sockout.writeBytes("GET /json/" + ipaddr + " HTTP/1.1\r\n");
        sockout.writeBytes("Host: " + serverHostname + "\r\n");
        sockout.writeBytes("\r\n");
 
        // Process the response header
        int length = 0;
        String response;
        while ((response = sockin.readLine()) != null) {
            
            // Header ends with blank line
            if (response.length() == 0)
                    break;
            
            System.out.println(response);
            if (response.startsWith("Content-Length: ")) 
                length = Integer.parseInt(response.substring(16));
        }
        System.out.println();
 
        // Obtain the entire response (if any)
        response = "";
        if (length > 2) {
            byte[] resp = new byte[length];
            sockin.readFully(resp);
            response = new String(resp, "UTF8");
        }
 
        // Data (should be JSON)
        System.out.println(response);
 
        // Close the Socket
        sockout.close();
        sockin.close();
        dataSocket.close();
    }
        
}
bruce_dev /> jtest
HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Content-Type: application/json; charset=utf-8
Date: Fri, 08 Dec 2017 14:01:58 GMT
Content-Length: 321

{"as":"AS7922 Comcast Cable Communications, LLC","city":"Pittsburgh","country":"United States","countryCode":"US","isp":"Comcast Business","lat":40.4406,"lon":-79.9959,"org":"Comcast Business","query":"50.197.34.75","region":"PA","regionName":"Pennsylvania","status":"success","timezone":"America/New_York","zip":"15282"}

bruce_dev />

So you can see that the response is JSON and can be easily used.

If you replace line 53 with Debug.dump(response.getBytes()); which is the new dump method in the library the data can be more easily reviewed.

bruce_dev /> jtest
HTTP/1.1 200 OK
Access-Control-Allow-Origin: *
Content-Type: application/json; charset=utf-8
Date: Fri, 08 Dec 2017 14:27:52 GMT
Content-Length: 321

 7b 22 61 73 22 3a 22 41-53 37 39 32 32 20 43 6f    {"as":"A S7922.Co
 6d 63 61 73 74 20 43 61-62 6c 65 20 43 6f 6d 6d    mcast.Ca ble.Comm
 75 6e 69 63 61 74 69 6f-6e 73 2c 20 4c 4c 43 22    unicatio ns,.LLC"
 2c 22 63 69 74 79 22 3a-22 50 69 74 74 73 62 75    ,"city": "Pittsbu
 72 67 68 22 2c 22 63 6f-75 6e 74 72 79 22 3a 22    rgh","co untry":"
 55 6e 69 74 65 64 20 53-74 61 74 65 73 22 2c 22    United.S tates","
 63 6f 75 6e 74 72 79 43-6f 64 65 22 3a 22 55 53    countryC ode":"US
 22 2c 22 69 73 70 22 3a-22 43 6f 6d 63 61 73 74    ","isp": "Comcast
 20 42 75 73 69 6e 65 73-73 22 2c 22 6c 61 74 22    .Busines s","lat"
 3a 34 30 2e 34 34 30 36-2c 22 6c 6f 6e 22 3a 2d    :40.4406 ,"lon":-
 37 39 2e 39 39 35 39 2c-22 6f 72 67 22 3a 22 43    79.9959, "org":"C
 6f 6d 63 61 73 74 20 42-75 73 69 6e 65 73 73 22    omcast.B usiness"
 2c 22 71 75 65 72 79 22-3a 22 35 30 2e 31 39 37    ,"query" :"50.197
 2e 33 34 2e 37 35 22 2c-22 72 65 67 69 6f 6e 22    .34.75", "region"
 3a 22 50 41 22 2c 22 72-65 67 69 6f 6e 4e 61 6d    :"PA","r egionNam
 65 22 3a 22 50 65 6e 6e-73 79 6c 76 61 6e 69 61    e":"Penn sylvania
 22 2c 22 73 74 61 74 75-73 22 3a 22 73 75 63 63    ","statu s":"succ
 65 73 73 22 2c 22 74 69-6d 65 7a 6f 6e 65 22 3a    ess","ti mezone":
 22 41 6d 65 72 69 63 61-2f 4e 65 77 5f 59 6f 72    "America /New_Yor
 6b 22 2c 22 7a 69 70 22-3a 22 31 35 32 38 32 22    k","zip" :"15282"
 7d                                                 }

bruce_dev />

By the way, the Lat and Lon returned by these sites varies in accuracy. We use the above as a free service. I believe that some services will provide more precise locations when used in a paid mode. The free data however is just fine when mapped on the globe (http://honeypot.integpg.com/map.php).