CSci 115: Computing and the Internet
Home Syllabus Assignments Tests

Lempel-Ziv-Welch compression

Compression algorithm

1. Initialize table.
2. cur ← first character of file.
3. K ← next character of file.
4. Is cur + K in table?
    Yes: curcur + K.
    No: Add cur + K to table.
Output code for cur.
curK
6. If characters remain in file, repeat from step 3.
7. Output code for cur.

The easiest way to see how this works is to examine how it works on an example. We'll see what happens when we attempt to compress BANANANA, where the table is initialized with B mapped to 0, A mapped to 1, and N mapped to 2.

table cur K output
B:0, A:1, N:2 B A 0
B:0, A:1, N:2, BA:3 A N 1
B:0, A:1, N:2, BA:3, AN:4 N A 2
B:0, A:1, N:2, BA:3, AN:4, NA:5 A N
B:0, A:1, N:2, BA:3, AN:4, NA:5 AN A 4
B:0, A:1, N:2, BA:3, AN:4, NA:5, ANA:6 A N
B:0, A:1, N:2, BA:3, AN:4, NA:5, ANA:6 AN A
B:0, A:1, N:2, BA:3, AN:4, NA:5, ANA:6 ANA 6

Thus, the output for BANANANA is 0,1,2,4,6.

Uncompression algorithm

Basically, what Lempel-Ziv-Welch compression does is to build up a table of character sequences found earlier in the file; then, when a character sequence is later found that occurred earlier, it can simply refer back to the earlier sequence. Thus, if the file has any frequently occurring sequences, each will be compressed into just one code. This happens with English text, for example: For one thing, it has the three-letter sequence the quite a bit.

The trick to Lempel-Ziv-Welch compression in to build the table in such a way that the table can be reconstructed as the file is decompressed. The following algorithm does this.

1. Initialize table.
2. k ← first code in file.
3. Output string for k.
4. oldk.
5. k ← next code in file.
6. Does code k exist in table?
    Yes: Output string for k.
s ← string for old + first character of string for k.
Add s to table.
    No: s ← string for old + first character of string for old.
Output s.
Add s to table.
7. If codes remain in the file, repeat from step 4.

Again, we'll see how to uncompress this. Suppose our initial table again has B mapped to 0, A mapped to 1, and N mapped to 2, and we want to uncompress the code sequence 0,1,2,4,6.

table old k output
B:0, A:1, N:2 0 B
B:0, A:1, N:2 0 1 A
B:0, A:1, N:2, BA:3 1 2 N
B:0, A:1, N:2, BA:3, AN:4 2 4 AN
B:0, A:1, N:2, BA:3, AN:4, NA:5 4 6 ANA
B:0, A:1, N:2, BA:3, AN:4, NA:5, ANA:6 6

When we read past the last code in the file, we stop. In this case, the overall output is BANANANA, which is indeed the string that we started out with.