Git advanced (text) diff: .odt, .pdf, .doc, .xls, .ppt

Git is a fantastic choice for tracking code changes and collaborating with your team. But no tools are perfect and Git is not an exception. There are issues to fix and restrictions to deal with.

One of the common problems when working with Git are the binary files and not having clear comparison results with the standard diff command.

But this should not be a worry at all, as there are lots of possibilities to deal with this issue in different environments.

Want to know more? Read on.

MS Office

If, as a developer, you are under company constraints and must use MS Office, you’ll encounter some issues when trying to diff MS Office files.

Maybe you’re asking yourself: what’s the problem with that?

Here it is: MS Office will produce binary files which Git won’t be able to compare.

But there are great tools out there that you can download, convert your files and get the clear diffs you are need:

  • catdoc (for Word)
  • xls2csv (for Excel)
  • catppt (for Powerpoint)

Note that the one you pick works well with your operating system (Ex: There are some that don’t work with Git bash).


Now that you’ve selected and downloaded the tool, let’s get into how we can configure Git to best use it.

  1. Add the following lines into your $HOME/.config/git/attributes file. If on Windows, $HOME is your user’s root directory (Ex: C:\Users\<your-user>).
*.doc diff=doc
*.xls diff=xls
*.ppt diff=ppt

If you want project level configuration only and not global:

  • in <your project dir>/.gitattributes
  • in <your project dir>/.git/info/attributes if you don’t want it to be committed with your project

2. Now, add the following to your global configuration file at $HOME/.gitconfig (or $HOME/.config/git/config):

[diff "word"]
textconv = catdoc
binary = true
[diff "xls"]
textconv = xls2csv
binary = true
[diff "ppt"]
textconv = catppt
binary = true

You can also do the same in console, without opening the files:

git config --global diff.doc.textconv catdoc
git config --global diff.xls.textconv xls2csv
git config --global diff.ppt.textconv catppt

Note that to make local project level changes, you should either edit the <project dir>.git/config file, or strip the global ‘-- ’ flags in the above commands.

Open Office

If you are an Open Office user, you’d probably like to do the same. the process is pretty much the same as MS Office. Here is a summary of what to do:

In your attributes file:

*.odt diff=odt

In your config file:

[diff "odt"]
textconv = odt2txt
binary = true

.odt files are compressed directories, the contents is XML.

In the French edition of the Git Book, the author writes his own PERL scripts, which didn’t work for me.
I recommend you use odt2txt. You can find packages for Linux and MacOS (brew install odt2txt).

PDF

There is a nice tool that extracts PDFs as text, written in Python: PDF miner, that you can download here : https://github.com/euske/pdfminer/

Configuration is as simple as the previous ones, just make the following changes in your attributes file:

*.pdf diff=pdf

In your config file:

[diff "pdf"]
textconv = pdf2txt.py
binary = true

Boost the performance, a friendly advice!

Converting binary files into text could take a while, so, I’d recommend enabling the cache to improve the performance.

To expand the diff driver definitions in your config file:

[diff "DIFF_DRIVER_NAME"]
textconv = …
cachetextconv = true

And if you need to expire a cache manually:

git update-ref -d refs/notes/textconv/DIFF_DRIVER_NAME

Please refer to the following to read more:

Last word

In this article we’ve seen the easy way to manage binary files diffs using text comparison.

There are lots of tools that can give better results depending on the kind of binary you’re trying to diff. For instance, check Github image diff.