Given two files with the same content, you compress them with gzip, will they still have the same content and the same checksum?
In computer science, a deterministic algorithm is an algorithm which, given a particular input, will always produce the same output, with the underlying machine always passing through the same sequence of states.
For many good reasons, from understanding whether file X is the same as file Y or not to detecting any file corruption or MITM attack during a download, we rely on a fixed-length string called checksum.
Hashes are omnipresent, in fact they help us:
- detecting data corruption during a data transfer
- making datacenters more efficient by deduplicating data
- securing our communications
and much more！
I’ve recently got the following task: check whether two gzipped files are the same.
Is a1.gz the same as a2.gz?
The actual use case and scenario are much more complicated but to shed some more light on it without going too much into the details, we have an executable which produces some output either in compressed or uncompressed format if a given flag is specified or not.
$ executable -o file1.gz -compress
The two files are just the result of two separate executions of the executable using the -compress flag.
The real question is whether the (uncompressed) data produced by the executable is the same or not, so one of the preliminary questions is understand what effect gzip has on the output and whether it’s safe to compare the compressed version or not: in fact, if we were to, while we would potentially save space and gain speed in the hashing process, we would add more variables and complexity to our experiment: e.g. gzip could lose data, or could simply produce two (slightly) different files.
For the sake of curiosity, let’s just pretend we don’t have a choice here and that the executable only spits out data in a compressed format, and let’s try to live with that.
For the sake of simplicity and because gzip has been around for 24 years, we can probably rule out the possibility of a bug in it that makes us lose data during the compression process.
Question: Given two (or more) identical files, will gzip produce the same output?
As we are not just programmers but scientists, let’s find that out!
#!/bin/bashfor i in $(seq 1 3);
echo "Hello world" > test$i.txt;
The above produces three files (test1.txt, test2.txt, test3.txt) with the same content. Let’s test that.
Checksums confirm that. Let’s compress them and check the checksums!
#!/bin/bashfor i in $(seq 1 3);
and the result is…
Three different signatures. Which suggests gzip is not so deterministic afterall, but is this the real verdict? We have tested it three times and we got three different signatures, so the experiment succeeded. However, does this process show any faults? Have we made a bad premise?
The original question was: “Given two identical files, will gzip produce the same output?”.
Are we asking the right question?
Shouldn’t determinism be about the same input?
(Technically, we’re not passing the file’s content, but the whole file and “same content” != “same file”. Three different files are three different inputs as, for once, they have different inode and different metadata associated to it.)
Let’s make an explorative test: let’s create a file, compress it, hash it, decompress it, compress it and hash it again, three times.
We have observed (what looks like) a lack of determinism for three times in a row, if gzip effectively lacks of determinism, we will have three different signatures even for the same exact file.
#!/bin/bashecho "Hello world." > test.txt
for i in $(seq 1 3);
gzip -d test.txt.gz
The result is dichotomic.
After rectifying our experiment we can tell GZIP is deterministic.
So what is our first experiment saying?
We can hypotise that gzip is picking up the differences in inode number and the whole set of metadata associated to it: ownership, permissions, timestamp, file size, to name a few…
What can we do here?
Does this mean we cannot reliably tell whether two gzip files have the same content?
Not all is lost:
- We can use the -n flag in gzip which will make gzip omit the timestamp and the file name from the file header;
- We can extract the CRC checksum from the file.
Although some may argue that CRC is not secure enough for a given use case (e.g. when a file may have been maliciously altered);
- If you feel adventurous and you know that the two files have been produced in the same way (e.g. same compression level) we can just skip the first N* bytes (file header) and do a checksum on the remaining body.
* N may vary, as the file header contains the original file name.
- GZIP is deterministic.
- Thoughts are very powerful.
Thinking and making assumptions are the start of the scientific method, as a result “wrong assumptions” = “wrong conclusion”.
Think slowly and careful. Test everything at every step, even your assumptions. Multiple times.