Is GZIP deterministic?

Given two files with the same content, you compress them with gzip, will they still have the same content and the same checksum?

In computer science, a deterministic algorithm is an algorithm which, given a particular input, will always produce the same output, with the underlying machine always passing through the same sequence of states.

Checksums

For many good reasons, from understanding whether file X is the same as file Y or not to detecting any file corruption or MITM attack during a download, we rely on a fixed-length string called checksum.

The reason why checksums and hashing functions (the algorithms that produce checksums) work is because they are deterministic: if they weren’t so, our life would be much harder!

Hashes are omnipresent, in fact they help us:

  1. detecting data corruption during a data transfer
  2. making datacenters more efficient by deduplicating data
  3. securing our communications

and much more!

The Task

I’ve recently got the following task: check whether two gzipped files are the same.

Is a1.gz the same as a2.gz?

The actual use case and scenario are much more complicated but to shed some more light on it without going too much into the details, we have an executable which produces some output either in compressed or uncompressed format if a given flag is specified or not.

$ executable -o file1.gz -compress

The two files are just the result of two separate executions of the executable using the -compress flag.

The real question is whether the (uncompressed) data produced by the executable is the same or not, so one of the preliminary questions is understand what effect gzip has on the output and whether it’s safe to compare the compressed version or not: in fact, if we were to, while we would potentially save space and gain speed in the hashing process, we would add more variables and complexity to our experiment: e.g. gzip could lose data, or could simply produce two (slightly) different files.

For the sake of curiosity, let’s just pretend we don’t have a choice here and that the executable only spits out data in a compressed format, and let’s try to live with that.

Data Loss

For the sake of simplicity and because gzip has been around for 24 years, we can probably rule out the possibility of a bug in it that makes us lose data during the compression process.

Determinism

Does gzip play with dices?

Question: Given two (or more) identical files, will gzip produce the same output?

As we are not just programmers but scientists, let’s find that out!

#!/bin/bash
for i in $(seq 1 3);
do
echo "Hello world" > test$i.txt;
md5sum test$i.txt;
done

The above produces three files (test1.txt, test2.txt, test3.txt) with the same content. Let’s test that.

f0ef7081e1539ac00ef5b761b4fb01b3  test1.txt
f0ef7081e1539ac00ef5b761b4fb01b3 test2.txt
f0ef7081e1539ac00ef5b761b4fb01b3 test3.txt

Checksums confirm that. Let’s compress them and check the checksums!

#!/bin/bash
for i in $(seq 1 3);
do
gzip test$i.txt;
md5sum test$i.txt.gz;
done

and the result is…

c052e0c3ce29e8b54bc84be0e5c8f3b1  test1.txt.gz
3ec536a63dd755fd7354712642ba7674 test2.txt.gz
903475971589df8a8f5ff3d9d4c437d4 test3.txt.gz

Three different signatures. Which suggests gzip is not so deterministic afterall, but is this the real verdict? We have tested it three times and we got three different signatures, so the experiment succeeded. However, does this process show any faults? Have we made a bad premise?

The original question was: “Given two identical files, will gzip produce the same output?”.

Are we asking the right question? 
Shouldn’t determinism be about the same input?
(Technically, we’re not passing the file’s content, but the whole file and “same content” != “same file”. Three different files are three different inputs as, for once, they have different inode and different metadata associated to it.)

Let’s make an explorative test: let’s create a file, compress it, hash it, decompress it, compress it and hash it again, three times.
We have observed (what looks like) a lack of determinism for three times in a row, if gzip effectively lacks of determinism, we will have three different signatures even for the same exact file.

#!/bin/bash
echo "Hello world." > test.txt
for i in $(seq 1 3);
do
gzip test.txt
md5sum test.txt.gz
gzip -d test.txt.gz
done

The result is dichotomic.

f85bcb043046d97a2ca018242afd3ad3  test.txt.gz
f85bcb043046d97a2ca018242afd3ad3 test.txt.gz
f85bcb043046d97a2ca018242afd3ad3 test.txt.gz

After rectifying our experiment we can tell GZIP is deterministic.

So what is our first experiment saying?
We can hypotise that gzip is picking up the differences in inode number and the whole set of metadata associated to it: ownership, permissions, timestamp, file size, to name a few…

By reading RFC 1952 we can confirm our new hypotesis: GZIP does include some of the original file metadata in the new compressed file.

What can we do here? 
Does this mean we cannot reliably tell whether two gzip files have the same content?

Not all is lost:

  1. We can use the -n flag in gzip which will make gzip omit the timestamp and the file name from the file header;
  2. We can extract the CRC checksum from the file. 
    Although some may argue that CRC is not secure enough for a given use case (e.g. when a file may have been maliciously altered);
  3. If you feel adventurous and you know that the two files have been produced in the same way (e.g. same compression level) we can just skip the first N* bytes (file header) and do a checksum on the remaining body.
    * N may vary, as the file header contains the original file name.

Conclusion

  1. GZIP is deterministic.
  2. Thoughts are very powerful. 
    Thinking and making assumptions are the start of the scientific method, as a result “wrong assumptions” = “wrong conclusion”. 
    Think slowly and careful. Test everything at every step, even your assumptions. Multiple times.