Polar Opposites

Colt McAnlis
Google Developers
Published in
4 min readAug 12, 2015

--

Compression and Encryption have much more in common than you could have guessed.

Sometimes, I stare at code too long. Sometimes, all the mathematical symbols, research papers, and memory-algorithms blur together in some geosynchronous manner that resembles Gallifreyan. But in those moments of insanity, some things become clear : Why is it so difficult to compress data that’s been encrypted?

For most folks, compression and encryption have nothing to do with each other. They are generally different types of algorithms with completely different end goals: One to reduce the amount of redundant information in a data set; the other, to secure a data set from unwarranted viewers.

And the reason that they don’t work together very well has everything to do with how each one approaches statistical manipulation.

Compression & Statistics

The earliest compression algorithms (and in modern cases, the most powerful) relied on the observation of redundancy in datasets. That is, if the words “bald guy” are repeated multiple times in a text-file, chances are, we we can get rid of some of that redundancy. Variable Length Codes, Huffman Trees, and Arithmetic Compressors can all be described as statistical compressors. That is, these algorithms seek to take advantage of the skewed probability of a data set in order to achieve compression.

Take Morse Code, for example. This international code was invented back in 1836 to transmit basic data across a telegraph device.

In order to reduce the amount of work needed to communicate the data, Morse Code re-assigned letters with a different number of dashes and dots. More probable letters, like “e” are given the shortest length codes. And really almost 150 years later, this is still how compression works : assign the most probable symbol the least bits; and boom, compression. In fact, the more skewed the histogram, the better compression you get.

A histogram of a dataset that statistical compressors would really like to take on a date and get to know better.

It’s worth pointing out that modern compression systems take this process even further. Statistical compressors are usually preceded by a Contextual Transform. Contextual compression transforms (like Dictionary Encoding, Delta encoding, Or Move To Front encoding) all seek to transform a given data set, such that it creates more skewing in probability, thus making it more efficient for a statistical compressor to encode.

The gist is this: compression works by exploiting probability skewing in a data set.

Encryption & Statistics

On the complete other side of this coin, is encryption : Encryption is the process of encoding messages or information in such a way that only authorized parties can read it. Some of the most basic and traditional ways to do this involve re-assigning letters. For example the Caesar Shift works by replacing a letter with the Nth letter to the left of it.

While this keeps most decoder rings safe, the invention of statistical modeling to encryption cracking during the early 1950s meant that standard shifting systems no longer worked. Because “E” happens to be the most probable symbol in the english language, you could easily assume that the most probable symbol in the encrypted data, was the letter “E” and then work backward to figure out the rest of the symbols. Effectively, the statistical skewing of the English language is leveraged to break the codes.

Modern encryption algorithms fight against this type of cracking process by reassigning symbols in the encrypted version. Rather than a 1:1 ratio of replacement, these algorithms will assign letters based upon various types of math to reduce statistical skewing.

The result is that encrypted data has a much flatter histogram. No single symbol matches the probability distribution of english text directly, so it’s much more difficult to crack these things.

A histogram of letter usage in the english language, and a histogram of an encrypted block of text.

The gist is this : encryption seeks to remove all symbol skewing, to improve security.

Oil & Water.

In effect, this is the primary reason that it’s so hard to compress encrypted data. A good encryptor will remove all statistical skewing of information, which is needed for a compressor to work reliably. Sure, there may be some entropy left around to play with, but in general, the compression savings you’ll get on post-encrypted data will be really small.

Looking to learn more great party tricks and other tidbits on information theory? Then check out the Compressor Head video series, on the Google Developers YouTube Channel.

--

--