In one of our projects we have to compare PDFs and the current solution for testing generated PDFs works very well. Unfortunately, it takes too long for the tests to run because the PDFs are first converted into images using ImageMagick and then the images are compared to check for differences.
As part of continuous improvement we wanted to see how we can speed up the tests but at the same time not loose the benefits of our current solution. Couple of initial solutions considered were:
- Generate a hashes (e.g. SHA1) of the two PDF files and compare them.
- Compare the byte streams of the two PDF files (using Ruby
However, neither solutions worked. Even though two PDF files are “visually identical” their byte streams are not. Some reasons for difference in PDFs byte steams are:
- Each PDF contains a unique file identifier (/ID) in the trailer section of the file. According to the specification (ISO 32000–1) it is recommended that a PDF writer add this information: The ID entry is optional but should be used.
- The document information dictionary (e.g.
ModDate) would be different.
- Different compressions algorithms used to reduce size of PDF
As it became clear that comparing two PDF’s byte streams would not give an accurate comparison we then investigated the possibility of comparing the “contents” of the PDFs. After some experimentation we settled on using HexaPDF and PDF::Reader to compare the “contents” of two PDFs page by page. With this solution there was significant improvement in overall execution time of our tests. If a test fails we then fallback to generating images and comparing them using ImageMagick. The solution to comparing two PDFs content was packaged into a Ruby gem called Identikal.
Identikal is a tiny Ruby gem that compares two given PDF files and returns true when their “contents” are identical and false otherwise.
$ identikal file_a.pdf file_b.pdf
true$ identikal file_a.pdf file_c.pdf -t
base_path = File.expand_path(File.join(__dir__), '/../pdfs')
pdf_a = File.join(base_path, 'report_a.pdf')
pdf_b = File.join(base_path, 'report_b.pdf')
if Identikal.files_same?(pdf_a, pdf_b)
# some action when files are identical
# another action when files are different
Now that we have saved a few minutes with faster tests it’s time to get some coffee.