Understanding Compression

Aka Why the hell did I write a book?

As engineers, we always have that dirty little side project that steals our attention. Things that don’t really relate to our day job, but they get us excited, and keep us passionate and up late at night obsessing over things we don’t know.

For me, that obsession (addiction really) is Data Compression.

I’ve never found an area of computer science that was just so brilliantly complex, mis-understood, and absolutely critical to the operation of modern computing.

As such, I’ve written a book called Understanding Compression (available through O’reilly or on Amazon) which is my humble attempt to get other engineers as excited about data compression as I am.

As the final version of the book was sent off to the printers today (w00t!), I felt that it would be appropriate to give a little history on how it all got started.

We got lucky that our book about data compression includes a cover animal that makes itself smaller.

Hitting the books

My ability to learn data compression has been at the very generous time of all those experts who’ve created the algorithms, and then went on to document things for future engineers.

There’s a ton of already great, information dense, books out there on the topic Here’s a few of my favorites:

Data Compression : The complete reference

Handbook of data compression

Managing gigabytes

Burrows Wheeler Transform

Variable Length Codes for Data Compression

Fundamental data compression

Compression algorithms for real programmers

I’ve read all those books, and they are absolutely fantastic. Full of brilliant insight, algorithm descriptions, and history on where these algorithms came from. For almost 30 years now, it’s been texts like the ones above that’s been responsible for teaching engineers how to write compression algorithms. If you want to get into data compression, this is basically where you need to start.

I remember lugging a copy of Data Compression : The complete reference around in my backpack during trips to conferences; for almost a decade now, it’s been my go-to for in-flight reading.

Math is hard.

This is the fundamental algorithm for all data compression. Pretty clear to understand, right?

For all the research and reading, books on data compression tend to fall into two camps:

1) A language-specific discussion of the algorithms. Where you spend 2 pages describing how Arithmetic Compression works, and then 43 describing how to make it work in C++ where you don’t have arbitrary floating point division, and performance is a problem.

2) A math-specific discussion of the algorithms. Where you spend one paragraph discussing how things work, and then 2 pages of mathematical symbols which are used to describe the underpinnings of the theory.

If you stick with it, and re-read things like 40 times, it all starts to mesh together and make sense; but until then, both of these techniques create a large barrier to entry for engineers to understand and get excited about data compression.

This barrier was what has always bothered me about these types of books; They make it really hard for engineers with less than 5 years of CS experience AND a BS in Mathematics to get on board. They were technically accurate, from a lot of levels, but really difficult to approach for the day-to-day engineer.

For a long time now, I’ve wanted to try my hand at fixing this : How can we teach more engineers about Data Compression?

“Compressor Head”

In 2014 we were able to film the Compressor Head video series, which at it’s core, was always about two things:

1) Explaining data compression in the simplest way possible.

2) Emulating Alton Brown as much as possible.

As such, on the first day of scripting, I put a big sign above my desk that said:

Explain it with pictures

Make it entertaining

This was my attempt at trying to find a middleground to teach data compression : instead of using math, or code to describe the algorithm (which focused either too much on the math, or too much on the syntax) instead focus on how the data moved through the algorithm, and how the data structures were updated as a result. And describe it all with diagrams, sticky-notes, and physical props. I figured that if I could describe these really hard concepts using sticky-notes, pretty much anybody could learn it.

The Make it entertaining part meant trying to keep your attention. I mean, Data compression is a dry subject, and is typically presented in a very dry manner. Not to mention that video is a medium that competes with cats on the internet. So to get people to want to learn it, I had to resort to puppets, interviews, physical gags and a horrible attempt at acting.

Although the views were never as large as some of the Android Performance Patterns content I’ve done, Compressor Head has always had a great reception by the folks who’ve seen the content:

Every now and again, Reddit says something nice about me.

Fun fact : I’ve been approached @ conferences more times for the Compressor Head series, than any of my work on Android Performance Patterns.

But there was always one thing wrong about Compressor Head: it never told a cohesive story.

The topics were sort of hodgepodge sampling of names and algorithms which developers would generally search for, and find.

As it turned out, really understanding these topics required them to be presented in context of other algorithms, and where things fit into the bigger picture. I mean, how does LZ and Arithmetic fit together? Where does Huffman fit in? Should I apply BWT before all that? Is any of this used in JPG?

I wanted to start addressing a lot of these gaps in Season 3 of the series, but sadly, the view counts weren’t high enough to warrant another season :( As such I needed to try a different approach.

Fist Fights with Claude Shannon

When I originally approached O’Reilly about doing a book on data compression, the original title was “Fistfights with Claude Shannon”.

My hope was to present a text version of Compressor Head , where we explained algorithms with both story, history, diagrams, and again, no code. We could see how all the algorithms fit together in their grouping category, and see how they all played off each other to create large gains in compression. The intention was that you should understand the algorithm first, and then once that’s done, dive into all the cruft that needed to happen at an implementation level. All while keeping things light, funny, and entertaining (well.. as entertaining as a book could be..)

The title itself came from an observation I came to while writing the content : Most of the data transforms we go through (LZ, BWT, RLE etc) are about cheating the entropy value of a data stream. By transforming it into another form, we make the entropy value lower. So, in a way, data compression is all about sidestepping Claude Shannon’s work, while giving it a nod as we pass by.

In retrospect, the title is both infinitely awesome, and horrible for SEO. And thankfully, Aleks Haecky jumped on board to help rein in my obtuse ramblings, and turn the book into a cohesive story that made sense, and actually taught you something.

The result was the amazingly awesome tome we have today.

Understanding Compression

And that’s the whole point. I love data compression, and I want every computer engineer to love it too. It’s an amazing skillset that’s useful to have, no matter what your sub-industry or profession. So for about 8 years now, I’ve been trying to teach other people about data compression, and get them excited about it too.

Even if you don’t buy our amazing, awesome, fantastic book, you should still go learn about Data Compression. Read the other books. Watch the other videos. Download samples. Whatever. Just go do it.

Because, as I like to say : Every Bit Counts