Unexplored areas in Data Compression

A good colleague of mine, Rich Geldreich put up a interesting article talking about his viewpoints in future directions in lossless compression technology. So I figured that I might toss out my own thoughts out too. (hey, it’s the internet, right?).

From my perspective, there’s two active areas of data compression progress right now:

The first one, is focused on finding a sweet-spot between runtime decoding performance and data compression ratio (LZHAM, Brotli, LZFSE). Which, from a pragmatic perspective, is the right thing to focus on. I mean after all, the viability of a compression system is tied directly to its helpfulness for the programs that can implement it, so these folks are fighting the good fight.

While the second one, is focused on applying Neural Networks to data compression. Sure ZPAQ, and other Context Mixing based codecs aren’t as applicable for your day-to-day app developer (due to high memory and long runtimes), but they are still the cutting-edge with respect to pure compression ratio. It’s a simple concept : Given the right amount of modeling information, Context Mixing algorithms can achieve the greatest level of compression.

Amazing to see that the top 15 data compressors for large text data are all built on Context Modeling style compressors. Props to BWT for staying in the top 20.

In a nutshell, both of these big movements are focusing on making improvements to things we already know. Basically combining various existing stages of data transforms, statistical encodings, and context mixing to produce better results.

But something here doesn’t sit right with me.

All of this work is fine, but it seems to be directing itself towards a point of diminishing returns. Now, some would say that reaching that point means compression is “solved” but I don’t buy that. I’ve been absorbing a lot of books on the topic lately, (perhaps too many..) and things are far from solved: we’ve got some large holes in our approach to data compression

Perhaps some of this is too theoretical to be valid. Perhaps it’s just crazy. But there’s gaps in our current approach to information theory & data compression that should be addressed.

#1 Permutations break compression

Modern data compression theory is designed around statistics. It’s been almost 200 years since the first concept of assigning the smallest codes to the most probable symbol was created. And in all that time, we haven’t done anything more than create new ways to apply the concept. Worse, we’ve been anchored by it.

See, according to modern data-compression theory, once all the statistical information has been extracted, we can’t compress things any further. Basically, once we’re done w/ statistics, we’re done. But this mentality may be one of our biggest faults.

Consider a classical permutation [5, 7, 1, 4, 6, 3, 2, 0]. According to Information theory, this is incompressible: each symbol is equally likely to occur, and there’s no duplicates. Statistically speaking, we’re done here.

A permutation, and a representation of its cycle ordering.

As such, statistical encoders (huffman, arithmetic, ANS) won’t help. Neither will higher-level transforms like LZ, BWT, and RLE. Delta compression might work, but in practice doesn’t reduce the ranges enough to warrant its use. Binary versions of Context Mixing might be able to give us some wins, as long as the bits fall into a particular patterns.

Basically, every part of our current data compression models are broken when it comes to permutations. There’s shocking little research in this area considering its difficulty.

I’m sure no one really cares about this topic. I mean the chances of your data ever being a perfect permutation is astoundingly low, so most researchers aren’t looking here. But to me, it seems odd that a concept as simple as a permutation can fundamentally break everything that our current data-compression empires are built on.

#2 Where’s BWT’s cousins?

It’s amazing how tied our current compression systems are to linearity of data. Literally every algorithm takes into account some form of adjacent context information in its process, and uses it for encoding, no one really messes with the order of symbols in our streams.

This makes sense though, encoding ordering is hard, and requires extra data. If I sorted the entirety of my dataset, I could apply any number of algorithms to help produce better compression. However putting it back in the original order would require a massive chunk of data that is basically a permutation (and we already talked about how hard that is to compress).

BWT appears to be the only algorithm which adjusts the ordering of data in any significant way. (Granted you don’t get perfect sorting, but the data gets clustered enough to be relevant)

Despite my shenanigans, BWT is still a top-algorithm for compressing human genomes.

What’s amazing about BWT, is that there’s nothing else like it. As far as lexicographical permutations go, BWT stands alone in the compression world. Sure, there’s a few modifications to BWT, but nothing like LZ, which has something like 90 variants. Where’s the rest of our sorting transforms?

To be fair, the lack of diversity here make some sense; even the original authors of BWT don’t know how they found the algorithm; so we can’t expect random engineers to find similar venues without blindly stumbling into it.

But BWT can’t be the only type of transform like this. As the old saying goes: “where there’s one, there’s many”. We just haven’t found the rest of them yet.

#3 Kolmogorov complexity

Entropy is a broken measurement. The fact that [0,1,2,3] and [0,3,1,2] have the same entropy value is annoying, since we can explicitly view the dataset and realize that one of them can be represented with a smaller form.

Kolmogorov complexity is a really interesting measurement; Basically, what is the size of your dataset, including the size of the program that generates the dataset. Which is a new concept wrt data compression: what can we move from the stream, and into the algorithm?

Kolmogorov Complexity really starts to shine when you start talking about using logic synthesis or program synthesis for compression, which in essence take the bit stream of your data set, and reverse generate a program that will uniquely generate it. Then the problem becomes producing the smallest code that represents the dataset uniquely. It’s kinda like code-golf, but for data compression.

This above image was created entirely using Fractal Algorithms. The size of the total algorithm footprint is significantly smaller than the output image is, even when ran through top image compressors.

This is a completely unexplored space, and perhaps the ultimate frontier. Being able to take a block of data, and extract some portion of it into an algorithm, where the sum of the algorithm size, and the resulting data is smaller than the source information. But I fear this might be more sci-fi than reality. Given everything in our current understanding of computers, I’m not convinced we even have the vocabulary to describe such an undertaking. In any case, it’s exciting to consider.

We ain’t done yet.

Looking at these three problems tells me that data compression is far from being solved; We’ve just been focusing on the things we can influence, rather than the things that are possible. The truth is that there’s lots of exploration out there left to be done. Information Theory, is still a theory after all, and hasn’t graduated to law just yet.

And as they say, theories are meant to be broken.