Beating the best compressor on the planet, by 1 bit.

You read that right : I can beat the best compressor on the planet, by 1 bit, every time.

Doesn’t matter the compressor. Doesn’t matter the source data. It doesn’t matter how much statistical data has been removed from the stream.

I win. Here’s How.

Compressing 4 bits, to 3 bits.

Let’s first introduce you to a completely stupid, arbitrary, and useless encoding that I came up with one night, when I should have been doing something important instead.

Let’s take a random 4 bit number : 1011.

Divide it into a pair two 2 bit numbers : [10,11] (A and B, respectively)

Sort the pairs, such that the smaller value is first. : [10,11]

Once we have this pair, we can then start eliminating pair-combinations in our 2-bit, pairwise space (yes, confusing.. Stick with me here..)

Now, math tells us that there are 16 possible combinations of these two numbers (shown below as 2 bit pairs):

[00,00], [00,01], [00,10], [00,11], [01,00], [01,01], [01,10], [01,11],

[10,00], [10,01], [10,10], [10,11], [11,00], [11,01], [11,10], [11,11].

This count is known: there are 4 total bits, and thus 2⁴ number of pairs. (yay math!)

However, note that some of the pairs are same values, but in different orders. For example [10,00] and [00,10] have the same values, but the order of the elements is reversed in each pair.

It’s pretty easy to see at this point that given 16 order-dependent paris, there’s 10 order independent pairs :

[00,00], [00,01], [00,10], [00,11], [01,01],[01,10], [01,11], [10,10], [10,11], [11,11].

Of these 10, there’s 4 uniform pairs : [00,00], [01,01], [10,10], [11,11]. For the sake of argument, we’re going to set those aside for now, and I’ll get back to them in a little bit.

This leaves us with 6 order-independent, non-uniform (OINU) pairs: [00,01], [00,10], [00,11], [01,10], [01,11], [10,11].

The encoding portion of this transform, involves finding which of the OINU pairs is equal to our sorted source pair ([10,11]) In this case, it’s the pair at index 5, which requires log2(5)=3 bits to encode. As such, we write out the binary value 101 to our output stream.

See what we did there? Our source set was 4 bits, but our encoded set is 3 bits.

Decoding is straightforward. Given 3 bits, we know that we’re decoding a 4 bit number (since we know we’re only saving 1 bit), and we’re indexing into pairs of order-independent, non-uniform 2 bit values. We just generate the permutations, and use the 3 bits as an index.

But, we’re ignoring something: Ordering.

In this encoding, 1011 and 1110 would both hash to the same output value of 101 (since their sorted pairs would be the same). We’ve effectively lost the order of the pairs through this encoding (thus, it’s a lossy encoding). But, this is a good thing. Let’s look at why this works.

It’s just math.

There’s actually nothing fancy going on here. This is a simple matter of probability space reduction.

Consider any pair of 4 bit values, where order doesn’t matter, you end up with 120 unique combinations (15+14+13…) And log2(120) = 7. As such, any 4 bit pair can be encoded with 7 bits, as long as you don’t care about ordering.

And thus, this works for all values of N. A 32 bit number can be encoded in 31 bits, and a 10000000 bit number can be encoded in 999999 bits; just follow the above flow.

But, the catch is that we don’t retain the order of the pair during decoding, and we don’t allow for situations of uniform pairs. Thankfully, those world-class, cutting edge data compressors remove this problem for us.

Beating GZIP

I love harping on GZIP, so I’m biased to choose it as the victim of this experiment.

Assume we’ve taken some data and GZIP’d it. The resulting bit-stream is 1024 bits long, and has a bit-entropy of 1.0. Now, apply the above algorithm to it; we end up with a 1023 bit stream.

During the decoding phase, we now have a pair of 512 bit values, but we have to answer the question of which value should go first?

This is the magic part.

Most compression formats nowadays (gzip, lzham, zpaq) have the interesting feature that they are also file formats, and contain some sort of header information that identifies them as a valid compressed archive. For example, gzip has the 2 byte header of 0x1f8b. It always exists as the first part of the gzip file. If it’s not there, then it’s an invalid gzip file, and it can’t be decoded.

As such, figuring out which of my pair values goes first, is really as simple as figuring out which one has the GZIP Header value: that’s the right one. So, as long as the gzip header doesn’t magically exist at the exact split-point of the bit-stream (false header), you can encode any post-gzipped data by 1 bit.

This also gives us a chance to revisit that “uniform pairs” thing that I discarded earlier. For just the standard encoding to work, those uniform pairs force us to use more-bits. But, when we’re compressing post-gzip’d data, there’s a massively low probability that a post-compressed stream would ever produce a perfectly uniform set of bits across our pair values. If, by some cosmic occurrence, it does, then we can just say “sorry, can’t save the bit.”

With this in mind, I define the “Colt Bit”:

Colt Bit: A bit which can be removed for any stream as long as the header of your stream can be identified by your decoder, and the probability of uniform data of the pairs is improbable.

That’s the stupidest thing I’ve heard today.

I completely agree. There’s no way this is practical in any significant way.

I mean, who cares about saving 1 bit of a 2.4GB file? And, if we know what type of file it is, why not eliminate the _entire_ header, and save 16 bits? Not to mention that there are a few caveats which sound pretty limiting. Like what happens when you’ve got the header at the start of the 2nd bitsting? But that isn’t new. Every data compression algorithm has some edge case that it can’t compress. (For example run a 256byte text file w/o repeated symbols through GZIP, and you’ll end up getting a bloated stream. Or re-size your PNG file by two pixels, and end up with a file 2x larger. )

The idea here is to highlight two points:

Firstly, I’ve mentioned how interesting Kolmogorov complexity is, with respect to data compression. In reality, this metric is about “knowing things” about your data stream, such that you can move it from the stream and into the algorithm. The Colt Bit is a perfect example of that, in action. We know that we’re going to be dealing with an archived format, and we know it’s going to have a header, so, we can save one bit, as a result. The Facebook engineers did something similar with their JPG Thumbnails. They were able to completely remove the header from the data, since they knew exactly what filetype was being transferred.

Secondly, In my previous post, I talked about how statistically focused our current world-view is wrt data compression. Once all the symbols become statistically even, our compression systems stop working. My stupid little encoding has nothing to do with symbol probability, yet it can still compress information that other compressors can’t, by taking advantage of knowledge about the stream. The point: There’s much more out there to be discovered that has nothing to do with removing statistical redundancy.

And while developers are scrambling to get their 1mb images compressed to 200bytes so that their app isn’t too expensive for 1 billion users in the India market, maybe we should be thinking about new ways to approach the problem.

So, I know what you’re thinking : Can the Colt Bit be removed, recursively?

Yes it can.. But that’s a different post.