Lempel-Ziv data compression algorithm can be very unstable
Imagine the following (strange) scenario: you have a pretty big file, let us say 1 TB, you compress it using your favorite compression algorithm and you get a file of size 100MB. This is a very nice compression size. But at some point you realize there is one typo in your original file… you correct the typo and compress the file again and, this time, the compressed file is much larger, of size 0.9 TB… for just a one-letter difference! Most compression algorithms fortunately do not have this strange behaviour; but if your favorite compression algorithm is called LZ’78, one of the most famous and studied of them, then this surprising scenario might well happen… In rough terms, this is what we have shown here, a result that was presented last year at the SODA conference 2018 (Symposium on Discrete Algorithms).
Lempel-Ziv algorithms refer to several lossless data compression algorithms based on a notion of dictionary which is constructed during the compression process. LZ’78 is one of them, created in 1978 by Abraham Lempel and Jacob Ziv, right after their first one called LZ’77. These algorithms are largely used in practice. For instance, gzip relies on an algorithm called deflate which is a combination between Huffman encoding and LZ’77; as other examples the image format GIF or the unix command compress are based on LZ’78.
It is fair enough to ask for some stability from a compression algorithm. In particular, it is reasonable to expect from two similar files to have similar compressed sizes. Yet, although widely used, such a robustness remained unclear for LZ’78. The one-bit catastrophe question, introduced by Jack Lutz in the late 90s, asks whether an infinite word compressible by LZ’78 can become incompressible by adding a single bit in front of it. Here, a word
is nothing else than a sequence of 0 and 1 that one can consider as a file written in binary.
Our main result is to answer that question positively, by constructing a word which is highly compressible but the perturbation of this word (when one bit is added in front of it) becomes incompressible. In fact, the situation is worse since there are infinite many such words. So, is this lack of stability a reason for stopping the use of LZ’78? Not at all. First, Lempel-Ziv algorithms already proved useful in many ways. Then, the structures of the words used to get catastrophes are highly specific and it is very unlikely to get them in practice. Of course, we can see some variations in the compression sizes but this is not by any mean close to the behavior of a catastrophe.