3.3.3 LZW. Lempel Ziv Welch

Cf.

From www.cs.sfu.ca/.../LZW.html:
LZW compression has its roots in the work of Jacob Ziv and Abraham Lempel. In 1977, they published a paper on "sliding-window" compression, and followed it with another paper in 1978 on "dictionary" based compression. These algorithms were named LZ77 and LZ78, respectively. Then in 1984, Terry Welch made a modification to LZ78 which became very popular and was dubbed LZW (guess why). The LZW algorithm is what we are going to talk about here.

The Concept:

Many files, especially text files, have certain strings that repeat very often, for example "the" in english text. With the spaces, the string takes 5 bytes, or 40 bits to encode. But what if we were to add the whole string to the list of characters after the last one, at 256. Then every time we came across "the", we could send the code 256 instead of 32,116,104,101,32. This would take 9 bits instead of 40 (since 256 does not fit into 8 bits).

This is exactly the approach that LZW compression takes. It starts with a "dictionary" of all the single character with indexes 0..255. It then starts to expand the dictionary as information gets sent through. Pretty soon, redundant strings will be coded as a single bit, and compression has occured.

The Algorithm:

Ok, so how is this done? Here is the basic algorithm:

    STRING = get input character
    WHILE there are still input characters DO
        CHARACTER = get input character
        IF STRING+CHARACTER is in the string table then
            STRING = STRING+character
        ELSE
            output the code for STRING
            add STRING+CHARACTER to the string table
            STRING = CHARACTER
        END of IF
    END of WHILE
    output the code for STRING

So what happens here? The program reads one character at a time. If the code is in the dictionary, then it adds the character to the current work string, and waits for the next one. This occurs on the first character as well. If the work string is not in the dictionary, (such as when the second character comes across), it adds the work string to the dictionary and outputs the work string without the new character. It then sets the work string to the new character.

How about decompression?

Note that when a new entry is added to the dictionary, then its root is given by the code that is being sent and the suffix character is the first chacter for the code which is send next.

So in order to decompress we have to build up the dictionary by adding a new entry every time a code is received. The entries root is just the translation of the code being received and its suffix character is the first character of the translation of the next code which gets received.

      Read OLD_CODE
      output OLD_CODE
      WHILE there are still input characters DO
          Read NEW_CODE
          STRING = get translation of NEW_CODE
          output STRING
          CHARACTER = first character in STRING
          add OLD_CODE + CHARACTER to the translation table
          OLD_CODE = NEW_CODE
      END of WHILE

There is one problem, namely if the next code being sent is identically to the one just being added to the dictionary. So this happends if "root" was already found, "root"+'suffix' appears first and 'suffix' + what follows equals "root"+'suffix'. So the input looks like:

$\displaystyle \underbrace{\text{suffix} + \text{string}}_{\text{root}} + \underbrace{\text{suffix} + \text{string}}_{\text{root}}$    

In this case the decoder doesn't know yet the translation of this code but since the its first character has to be equal to the first character of the current code, in can use this one instead.

	CHARACTER = read first code 
	output CHARACTER
	OLD_CODE = CHARACTER
	WHILE there are still input characters DO
	    NEW_CODE = read next code
	    IF NEW_CODE is not in the translation table THEN
	        STRING = get translation of OLD_CODE
	        STRING = STRING + CHARACTER
	    ELSE
	        STRING = get translation of NEW_CODE
	    END of IF
	    output STRING
	    CHARACTER = first character in STRING
	    add OLD_CODE + CHARACTER to the translation table
	    OLD_CODE = NEW_CODE
	END of WHILE

The nice thing is that the decompressor builds its own dictionary on its side, that matches exactly the compressor's, so that only the codes need to be sent.

For an explanation concerning the gif variant see:
www.dcs.ed.ac.uk/.../GIF-comp.txt

Andreas Kriegl 2003-07-23