Honest book review: The Data Compression Book
Ever since I got into computer science and started noticing how good computers were at compressing files, I always wondered about the algorithms behind them. This book takes you back in time to explain how everything works.
This book is older than me, and it shows. You start reading the introduction chapter and read a bunch of stuff warning you about portable C code, to choose if you wanted to target 16 or 32 bit machines (yes, no typo there), and somewhere in there they mention some memory complexities that they consider significant (65KB) and that should be avoided in benefit of adaptive solutions that would not take up so much memory.
I’m typing this on a 64-bit quad-core processor with hyperthreading, rocking 16GB of RAM and I take every single of of those bits as granted. My generation is spoiled :)
I know what you’re thinking. Really. This book is indeed more stale than that bag of chips under your couch.
I like history, and let me tell you, this book has plenty.
Often times reading on a particular topic from a historical point of view gives you a better understanding about why things developed the way they did. If you didn’t read about how 65KB of RAM were significant back then, you’d think every researcher in the area was one of those micro-optimisation freaks. It turns out that adaptive techniques were only thought out because of memory constraints. I find that interesting and not that obvious, because as far back as I can remember, both Disk and RAM on my systems were always in the order of GigaBytes.
It’s a whole different computer science area.
Unsurprisingly, there are maybe 10–20 other books I could read after this one. But this is an introductory title, and you get a very good idea of how things, work down to the data structures used to represent compressed data.
There are things you just don’t think about before reading a book like this if you haven’t been exposed to the topic. For instance, compression results always depend on patterns. Input random noise in any compression algorithm and you will likely get either an output of the same size or even a larger sized file. This makes sense: compression is about summing things up and removing redundancy. If there is no redundancy or patterns to be removed, then there is no possible compression.
Turns out that lossy compression isn’t that scary either: it’s basically a static analysis of symbol frequencies (that results in a distribution), and you can smooth out that distribution. The part that gets smoothed is the loss, which can and should be parameterizable if the algorithm wants any decent chunk of users.
I only wanted to get the gist of it, but there is plenty of interesting reading material there — with code examples too. And in C, which even today sounds like a reasonable language choice to make when the goal is to design compression algorithms.
Those code examples though…
I read the book mostly at night, time at which my brain’s ability to parse and compile C code is significantly reduced (cheap excuse, right?). The included code was not missing comments, but the programming style did not really seem like modern day C programming. I didn’t see any wizardry going on, but it sure wasn’t easy to read! In any case, things would calm down as soon as you left the code area to get to another algorithm specification.
Summing up (TL;DR)
This book contains an excellent history lesson as well as good explanations about how data compression works, how you can measure its effectiveness, and presents some of the algorithms that are going to be studied.
My only issue with it is its structure. It is a top-down approach, so there is a somewhat repetitive pattern of New Algorithm Implementation code throughout the book. All of the algorithms themselves are different and serve different purposes, like compressing text, audio or image files, with a few esoteric and unexpected chapters like fractal compression techniques. In this sense, I would much rather see a theoretical presentation on all of the algorithms, and them leaving the code as a separate piece of literature only for interested nerds. I had to skim through lots of code listings because they are literally the C version of the previously explained algorithm. Sure, “Process stream until end of file is reached” is vague enough to justify an example implementation, but for me as an introductory reader it was just getting in my way.
I really appreciate you reading this far. As a token of my appreciation, here’s the end of the post.