Bibliothèque Babel (2013), Jean-Fran­cois Rau­zier

The Secret Behind Lempel-Ziv Compression Algorithm

Aviv Yaniv
Courisity is a Drug

--

Do you like exchanging funny GIFs with your friends?😂 Zipping files🤐 or have just downloaded a PNG image from your last trip?🌴 even if not all of the above... you must have enjoyed an EA game! 🎮

The commonality to all these data marvels is that they owe part of their existence to the following genius minds:

Abraham Lempel
Jacob Ziv

They invented a way to compress information without losing any part of it during the process (lossless compression in fancy words), utilizing our beloved data structure from interviews (dictionary of course)! 📖

How it All Began

What makes this story even more amazing is how bright insight that has sparked when looking at old prayer became the basis of so many compression techniques and granted them the IEEE Medal of Honor! 🎖

So let’s start from the beginning…

This is my second article that tells about the algorithm and tracks what sparked the Eureka invention behind it💡

When Prof. Ziv was asked about the inspiration for their compression algorithm he mentioned an old Jewish prayer called “Avinu Malkinu” (“Our Father, Our King”), which is recited in ten days of repentance praying earnestly to God.

The prayer is very long so here are the first five sentences:

אָבִינוּ מַלְכֵּנוּ חָטָאנוּ לְפָנֶיךָ רַחֵם עָלֵינוּ:

אָבִינוּ מַלְכֵּנוּ אֵין לָנוּ מֶלֶךְ אֶלָּא אָתָּה:

אָבִינוּ מַלְכֵּנוּ עֲשֵׂה עִמָּנוּ לְמַעַן שְׁמֶךָ:

אָבִינוּ מַלְכֵּנוּ חַדֵּשׁ עָלֵינוּ שָׁנָה טוֹבָה:

אָבִינוּ מַלְכֵּנוּ בַּטֵּל מֵעָלֵינוּ כָּל־גְּזֵרוֹת קָשׁוֹת וְרָעוֹת:

As you can see this is a very repetitive prayer with the prefix “אָבִינוּ מַלְכֵּנוּ” (which is also the name of the prayer) starting each sentence.

Can we represent it in a less verbose manner?

Let’s write it in the first sentence and then refer to the other sentences starting with this prefix (אָבִינוּ מַלְכֵּנוּ) by the sign (‘‘):

אָבִינוּ מַלְכֵּנוּ חָטָאנוּ לְפָנֶיךָ רַחֵם עָלֵינוּ:

“ אֵין לָנוּ מֶלֶךְ אֶלָּא אָתָּה:

“ עֲשֵׂה עִמָּנוּ לְמַעַן שְׁמֶךָ:

“ חַדֵּשׁ עָלֵינוּ שָׁנָה טוֹבָה:

“ בַּטֵּל מֵעָלֵינוּ כָּל־גְּזֵרוֹת קָשׁוֹת וְרָעוֹת:

This is 26% of chars reduction for just 5 sentences!

This was a piece of cake! 🍰

Are we on the verge of the IEEE medal discovery?

Simplifying Lempel-Ziv Algorithm

Let’s explore Lempl-Ziv algorithm by phrasebook for travelers analogy:

  1. Travel Phrasebook Starter Kit: Imagine you’re traveling and want to learn the local language. You buy a small phrasebook with basic words (“hello,” “goodbye”). This is similar to the Lempel-Ziv algorithm, which starts with a dictionary of basic building blocks like individual letters, numbers, and symbols.
  2. Learning on the Go: As you travel further, you encounter new situations and need new phrases. You might ask a local vendor, “How much does this cost?” and jot that down in your phrasebook. The Lempel-Ziv algorithm works the same way. It scans the data it’s processing, like a phrase you hear, and looks for familiar parts. In this case, it might recognize “this cost” from previous encounters.
  3. Shorthand for Efficiency: Here’s the cool trick: you don’t write down every single word. If you hear “the price is too high,” you can just add “the price is too” to your phrasebook (assuming “high” is already there). This is where Lempel-Ziv shines. It keeps track of recent words and phrases (like a window) and checks for existing combinations. If it finds a match (like “the price is”), it creates a shorter code for that whole phrase instead of writing each word separately.

By constantly analyzing the data and building upon existing phrases, the Lempel-Ziv algorithm, like your growing phrasebook, becomes more efficient at understanding and communicating in the new language. This “shorthand” approach allows it to compress data by recognizing and reusing these patterns, making the data smaller without losing any essential information.

Delving to Algorithm

Lempel-Ziv have published two articles detailing two versions of this algorithm, first in 1977 (aka LZ77 or LZ1) and the second in 1978 (aka LZ78 or surprisingly LZ2), ever since then there have been many more bells and whistles seasoning these first algorithms.

I chose to demonstrate Lempel–Ziv–Welch below, the reason is that if you imagine you’re summarizing a long book, LZ would be like rewriting every sentence that appears more than once, LZW would be like creating a list of unique sentences and then referencing them by number throughout the summary. This saves space by avoiding repeated sentences.

This is because while the basic LZ doesn’t maintain a dictionary of the repeating patterns. It builds the dictionary on the fly while processing the data. This can be inefficient because it needs to search the entire processed data to find matches for new patterns. LZW processes the data, it checks the dictionary for existing patterns. If a pattern is found, it outputs a reference code for that pattern instead of the entire pattern again. This dictionary approach reduces redundant searches and makes compression more efficient.

Enjoyed this article? Feel free to long-press the 👏 button below 😀

--

--

Aviv Yaniv
Courisity is a Drug

Senior Software Development Engineer 🖥️ Economist 📈 Beer Brewer 🍻 Photographer 📷 ~ “Curiosity is my drug”