About a strange data compression method

Another user tries to write a new data piece to his hard drive, but there is no space left for it. He also refuses to delete anything since ‘he will need it later’. What should we do with it?

Such a problem is not unique. We have terabytes of data laying somewhere on our hard drives, and it is not going to shrink any time soon. But how unique is it? At last, every file is just a bit sequence, and it is probable, a new one not all that different from already stored ones.

Sure, you should not search for existing pieces of data on a hard drive — if it is not a predictable failure, it is not practical at least. On the other hand, if the difference is not that great, maybe we can somehow fit one into the another…

Image for post
Image for post
Image from rematelier.ru

TL;DR — about a very strange data optimization technique.

About bits and a difference

If we look at two totally random pieces of data we’ll find out about 50% of their bits will be equal on the same places on average. Indeed, between all possible combinations for every pair (‘0–0, 0–1, 1–0, 1–1’) a half has equal bits in it, simple as that.

But of course, if we just take two files and fit one to another, we will lose one of them. Saving changes will lead to us reinventing delta-encoding which is already happily existed for a long time without us, although it is not used in such tasks. We can try to a smaller bit sequence into a larger one, but with a risk of losing some critical segments.

Between what and what can we shift the difference? I mean, every new file is just a bit sequence, and we can do nothing with it (think of it as of compressed binary data). Then we should find such bits on our hard drive, that could be changed without a need to store anything about it and live through trier loss without serious outcomes. And there is no need to change the file as it stored on the filesystem, only some less sensitive information inside. But which and how?

To help us out there are lossy-compressed files. All that JPEGs, MP3s and others, although has already lost part of their data, still has many bits we can safely alter. It is also possible to use advanced techniques able to modify their parts in unnoticeable ways on different encoding steps. Wait. Advanced techniques… unnoticeable modifications… one data into the other…. are we talking about the steganography?

And really, embedding one information altering the other is looking like nothing else similar to such methods. Changes invisibility for a man’s senses is also nice-to-have in our task. But there is no need in secrecy: we only need to allow storing more on the same media, it will only hurt the user (he’ll probably only lose it).

That’s why, although they’re usable, we need to adapt them beforehand. How? I will answer that below utilizing one of the existing methods and a well-known file format.

Compress it more and I’ll kill you

If to compress, then the most compressed thing in the world. Of course, I’m talking about JPEG-files. Not only there are a metric ton of tools and already existing embedding methods, but it also the most popular graphics format on this planet.

Image for post
Image for post
JPEG!!1

Still, we need to limit our field of action in files of such format. No one likes these squares from too much of compression, that’s why we also should avoid re-encoding the data as much as we can and alter only stored lossless data. To be more specific — with integer coefficients left by the lossy operations: DCT and quantization, as one can see from the full encoding scheme below:

Image for post
Image for post
The scheme

There are a lot of existing JPEG optimization techniques. For example, lossless (jpegtran) and “[kinda lossless](https://tinyjpg.com)"™©® (yeah, it is not) ones, but we should not care about any them. After all, if the user is ready to embed one information into the other to increase available space on his hard drive, he has either done beforehand it or is not willing to lose quality at all.

F5

We can filter a whole family of algorithms by such criteria. The most advanced in it is F5 by Andreas Westfeld, operating by processing the coefficients from the luminance component (the human eye is the least sensitive to its changes). Moreover, F5 uses advanced encoding scheme based on matrix encoding, allowing to make the fewer changes embedding the same information amount, the larger container size.

The processing itself is reduced to decreasing coefficients absolute value on some condition (not even 50% of them are changed at all), allowing to use F5 to optimize stored on your drive data. The fact is, the coefficient after such change will probably use less bit in a file after Huffman encoding due to values statical distribution in JPEG and zeroes encoding with RLE.

The only thing we need to change is a secrecy part (password-based permutation) to save computation resources and time and an ability to embed one piece of data distributed into multiple files. These modifications are probably not that interesting for a reader, so I will skip their description (if you want, read the source code).

High tech

To proof the concept, I implemented everything in pure C and optimized the result by both execution speed and memory consumption (you can’t even imagine how bit decompressed jpegs are). It is made cross-platform by using libjpeg, pcre and tinydir. Thanks, guys. Everything built with simple `make` that’s why windows guys should have something like MSYS, Cygwin or Linux subsystem for windows.

Implementation is available in command-line tool and library forms. More on how to use the latter you can read in GitHub repository linked at the end of the post.

Carefully. Used for the compression images chosen via regular expression in the specified root directory. After the process, you can move, rename and copy in it any files you want, switch file- and operating systems, etc. But you also should be careful and do not change the actual content. Changing even one bit could lead to loss of all packed data.

The utility will produce a specific archive file carrying all of the data required to unpack compressed data, including information on used images. It occupies only a few kilobytes of space on your hard drive at most and therefore has no real impact on its free space. But you should not modify, move or touch it at all since his path determines the root search folder.

You can analyze available capacity with the `-a` flag:

./f5ar -a [root directory] [Perl-compatible regex]

Pack your data with:

./f5ar -p [root directory] [Perl-compatible regex] [file to compress] [archvive file name]`

And unpack it with:

./f5ar -u [archive file] [unpacked file name]

To show you the effectiveness of the method I downloaded 255 totally free dogs photos from Unsplash and found big (45 meters long) pdf of the second volume Knuth’s The Art of Computer Programming.

Demonstration itself is pretty simple:

Image for post
Image for post
Transparent terminal emulatoris transparent

Since hashes are equal and you can (and should) still read the book:

Image for post
Image for post
No, really, go ahead and start reading right now

As you can see, we came from the initial 633 + 36 == 669 megabytes of data on my hard drive to more enjoyable 551. Such a difference can be explained through coefficients decrease affecting following lossless data compression: we can lose a couple bytes in the resulting size by decreasing only a single value. However, it is still a loss of data, even if a small one, and we should put up with.

Luckily, a human eye has no way of noticing them. You can try it out with your own or check out computed luminance difference by these links: original, with some data in it, difference (dimmer the color, less the difference in the block).

Instead of conclusion

Looking at all of those difficulties you are probably thinking that buying a hard drive or using cloud storage is way more simpler solutions. If so — you are 100% correct. But there is no guarantee that you’ll be able to buy one or even will have things like Internet access at all. Not even all countries have it today. On the other hand, you can always count on data storage you already have at your home.

Written by

Russian information security specialist (literally stated in my ISCED-7 diploma). Looking for a job. https://labunsky.info

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store