The art of compression
A noob’s take on making things smaller
Data compression is a niche topic in Computer Science. Simply put, it’s the art of making files smaller while preserving the information they hold. People usually create zip files when they want to make their files smaller. But have you ever wondered what happens behind the scenes? How is the data made smaller? Who is that pretty lady in the picture?
To answer the most important question — The pretty lady in the picture is Lena, A swedish model. Although, she has nothing to do with compression, this image has been used as a sample in many compression papers to compare image compression efficiencies across different algorithms.
This article assumes that you are familiar with programming and haven’t skipped your high school math classes. The goal of this article is to make you understand some of the noob principles behind data compression. If you are looking to start off with some advanced stuff feel free to skip this article.
It’s easier to understand if you have a working idea of Information Theory. Don’t know it? Nothing to worry about … We are going to demystify this lost art, step by step.
Reducing symbol space
Let’s have a look at how we can naturally compress text, without diving into the deep waters.
Let’s say, we want to compress the text (S) — “abracadabra” which has a length of 11 characters. First, let’s find which characters we use in this text. The chars are, ‘a’, ‘b’, ‘c’, ‘d’ and ‘r’ — which is five characters in total.
In a normal text file, each character takes up a byte whose value ranges from 0 to 255. This means that S takes up 11 bytes of space. But if you think about it, we only have 5 characters. A byte can represent one of 256 characters. This means that, we can represent each character of S in less than a byte!
We now know that we can make S smaller. But how small can we make it?
Time for some math!
1-bit can represent 2 = 2¹ characters.
2-bits can represent 2² = 4 characters.
3-bits can represent 2³ = 8 characters.
So, we can see that we need at least 3-bits to represent any character in S. Now S can be compressed to a size of 11 * 3-bits = 33 bits. The original size of S is, 11 * 8-bits = 88 bits. Great! we saved more than 50% of space, with some simple math. Eh… Not really. One of the key things to remember while compressing data is to make sure that you can invert it back to the original — at least for lossless compression.
Pause: Take some time off and try to figure out what went wrong with our little algorithm.
To get a better idea of what went wrong, let’s do a dry run. Let’s compress the first three characters of S with our method.
We create a map between characters and their compressed representations:
a=000
b=001
c=010
d=011
r=100
Note: From here on, we will call characters as symbols and their compressed representations as code words.
Now, we start compressing the first three symbols of S —
a b r = 000 001 100
Now, we try to read our compressed file —
000 001 100
Notice that the problem here is that we forgot to write the symbol map we created. We don’t know which symbol 000 represents.
Now, we need to figure out a way to aptly store the symbol map into the file.
Tip: Information like symbol maps are what we call as metadata. While compressing a file, the metadata should always be written first so that while decompressing it’s easier to read the metadata, without having to seek through the entire file.
First, we write the number of unique symbols we found in S. Since the number of unique symbols cannot exceed 256 in our case, we only need 1-byte of space to store this value.
Compressed data : 5 …
Then, we write the symbols directly to the file.
Compressed data : 5, a b c d r …
Finally, after the symbol map is written, we write the code words to the file. Our compressed data will look like —
Compressed data : 5, a b c d r, a(000) b(001) r(100) a(000) c(010) a(000) d(011) a(000) b(001) r(100) a(000)
Now, let’s calculate the compressed size —
1-byte for symbol map length, 5-bytes for symbol map keys, 33-bits for compressed data — A total of 8 + 40 + 33 = 81 bits.
With the addition of metadata, our size reduction dropped from 50% to 9%. And if we consider the fact that we need to pad the last byte (since you cannot write bits to a file, only bytes), we would end up with 81 + 7 = 88 bits of compressed data — 0% reduction in size.
This is nothing to really worry about because compression usually doesn’t perform well for small amounts of data. For e.g., if we tried to compress S + S = “abracadabraabracadabra”, we would have a compressed size of 8 + 40 + 2 * 33 = 114 bits, while the original data has a size of 88 * 2 = 176 bits. Congrats! you finally compressed it!
Pro tip: The compression ratio that we get here is calculated by original-size/compressed-size = 176 / 114 = 1.54
Pause: Now take some time off to think of cases where our little algorithm will not yield any favorable results.
We are trying to make our symbol space compact in this approach. This will only work if our data does not “cover” all characters in the range 0–255.
For example, if S = [0,1,2,…,255] (where each value in the list denotes the value of the byte), then we would just be encoding the data as-is with the overhead of including the symbol map.
Aliasing duplicate sequences
This should be obvious. In the previous step, we just aliased symbols to uhh…. well smaller symbols? Now, we alias a group of symbols instead.
Let’s take a small example:
S = “abracabradabra”
Notice that “abra” is repeated again in S. So we can try replacing it with an alias symbol.
Let X = “abra”
Now S becomes: XcXdX
And don’t forget the metadata — X=abra
Pause: Try calculating how many bits this compressed representation might take.
First of all, we need to decide what should be the encoded representation for our alias X. We know that the values in [0, 255] are reserved for each byte value. So we need to extend this range beyond 255 if we want to include aliases. But wait, then each symbol will take more than 8 bits.
That right there folks, is the deal that we make with the “sith lord of compression”. One of the things about the art of compression is that you trade something for something else. Like Anakin Skywalker traded his soul for using the dark side of the force, we trade better compression in some scenarios with worse compression in other scenarios. In our previous approach, our algorithm works better if there are less types of symbols involved and starts to detoriate when there are too many types of symbols at which point our compression ratio jumps off the cliff (less than one).
Star wars references aside, let’s come up with a metadata + codeword schema for including aliases as well.
Codewords:
byte := '0' + byte
alias:= '1' + alias
Metadata:
<Number of aliases (8-bit)> <Alias-k <Alias-length in bytes (8-bit)> <Alias-value>> ...
We add a zero-bit in front of each byte and 1’s in front of our aliases.
So S becomes:
S = abracabradabra
S' = 1(00000001) 0(c) 1(00000001) 0(d) 1(00000001)
Compression schema:
Metadata:
1 (Number of aliases)
4 abra (Alias-1 length=4, value=abra)
While decoding, we simply read the metadata first to create a map of aliases to actual values. Then we read 9-bits at a time — The first bit tells us whether the symbol is a plain old byte or an alias. Then the next 8-bits tell us which byte or alias we want to use.
Let’s decode S’.
S' = 1(00000001) 0(c) 1(00000001) 0(d) 1(00000001)
^-alias->abra ^-byte->c ^-alias->abra ^-byte->d ^-alias->abra
S = abra c abra d abra
Compressed size = 8 + 8 + 32 (metadata) + 5 * 9 (encoding) = 93 bits
Original size = 14 * 8 = 112 bits
Compression ratio = 112/93 = 1.2
Pause: Take some time to think about the tradeoff we made here. It’s important for you to nurture a sense of conviction as you progress in making your own compression algorithm.
The biggest tradeoff here is the extra bit that we use to distinguish bytes from aliases. We now take 9-bits per byte.
Conclusion
This article is for developing a fundamental understanding of how one would make a compression algorithm without any prior knowledge about shannon limit / entropy coding / probabilistic modelling etc. I know that this might be too basic for some readers (sorry for pissing you off but in my defense I did mention “noobs” in the subtitle) but I feel these techniques are important to cover so that you understand different ways in which we can compress data.
The sith lord of compression is merciless. The art of compression requires you to know how to compromise on scenarios which are not relevant to your use case to achieve better compression for your use case. For example, if you are compressing images, you can compromise on compressing audio. In the next article we will delve a little deeper into formal approaches on how we can harness this art for better results —