From Horses and Zebras to Document Image Adaptation with CycleGAN

The deep learning document image generator

You might have heard of image-to-image translation and seen funny gifs like these:

Zhu et al. (2017)

One of the things we do at omni:us is exploring ways to better train document analysis systems. Specifically, in the case where we deal with one type (or domain) of document, such as pictures, scans, or PDFs containing handwritten or printed text:

Source: ImageNet

In case we only have clean document images (such as PDFs) to train a model, this model will not perform as well on scanned document images. Such a domain mismatch often occurs for document datasets, as we unfortunately rely on tiny, uniform, and private datasets.

Why not train a model that is able to translate images in one domain such that it looks like one of the other mentioned domains? The model will then be appropriately specified for the target domain and we can even utilise more datasets within any other given domain. This is called Domain Adaptation and significant scientific literature about this topic can already be found.

A visualisation of domain adaptation for documents

There are all sorts of adaptation techniques to use, but one that got an increasing amount of attention over the past years is the Generative Adversarial Network (GAN) by Goodfellow et al. (2014).

If you have never heard of GANs…

The basic idea is that two adversarial models, called the generator and the discriminator, are played against each other in a mini-max game setting. The generator creates images from some random noise: G: z → x. The discriminator tries to classify images as false (coming from the generator) or true (coming from the true data distribution): D: x → {real, fake}, by maximising log(D(x)). The goal of the generator is to fool the discriminator by minimising log(D(G(z)). This way, you end up with a generator that has learned to generate realistic images! More informative reading on GANs can be found in this blogpost.

Now, let’s use the GAN to translate documents!

How would a GAN be able to translate a specific image when its only job is to generate a realistic-looking image? Two ways I have experimented with and will explain to you, are:

  • Conditional GANs
  • Cycle-consistent GANs

Conditional GANs
Isola et al. (2016) introduced “Pix2Pix” [github], which is a method that conditions the generator G (and sometimes also the discriminator) on an input image x. This method, however, raises the following problem: The generated translation G(x) in the image below, must be compared with a ground-truth translation y, otherwise the content of x might not be preserved. This implies that you need to have a ground-truth target document of every input document in order to train this model. This is usually not available!

“Pix2Pix” (Isola et al., 2016)

Cycle-consistent GANs
To solve the issue of not having a ground-truth document of every input, Zhu et al. (2017) proposed “CycleGAN” [github], that incorporates a reverse mapping F:

“CycleGAN” (Zhu et al. 2017)

Why a reverse mapping?

If we want our generator G to translate a document in domain X to domain Y, we only want to change the looks of the document while preserving the content. That means it should be possible to map the image back to domain X with another generator F and have a reconstruction of our initial image: x → G(x) → F(G(x)) ≈ x. At the same time, we train in the opposite direction: y → F(y) → G(F(x)) ≈y. This way, we do not need ground-truthed target documents anymore. Problem solved! The other nice thing is that we have an adaptation model in both ways now.

Like every great thing in life, the original CycleGAN does have some downsides. Firstly, the forward generator learns to cheat, and creates adversarial examples. Adversarial examples are inputs to a machine learning model that are created with the intent to let the model make a mistake. If you are interested in this topic, I would recommend the paper by Chu et al. (2017): “CycleGAN: a Master of Steganography”, which explains the problem with CycleGAN in detail. Secondly, CycleGAN is deterministic, which means it always produces the same translation of one image. This could be a problem if we want to generate multiple translations from one document.

Proposed solutions to the mentioned problems are “CyCADA” (Hoffman et al., 2017), which incorporates additional loss terms, or “Augmented CycleGAN” (Almahairi et al., 2018), which adds latent space sampling.


So how does that look?

We show the results for when we try to translate clean, printed text documents to historical looking ones. We use Augmented CycleGAN, which uses the cycle-consistency loss and a prior drawn from a latent space.

As you can see in the visualisation below, an input document patch (first column) that contains clean printed text, is translated to a historic looking patch (third column) by using an actual historic document patch (second column) as a prior. You see that the output contains the content of the first column but the looks of the second column.

Training the model gives us the following results over time:

from left to right: input, prior, generated output

When is this useful?

Translating full documents this way, enables us, first of all, to enlarge our training set. An example task which you could easily apply this to, is Document Image Classification, where the goal is to predict the class of a page (e.g. form, letter, resume). The other reason would be when our test data looks very different from our training data. We could translate our training data such that it looks like the data in our test set and therefore, reach a higher performance on our test data!

On the other hand, the above output might not yet be useful for Optical Character Recognition, as the generated characters still seem unclear and therefore, unreadable.


This shows that we have come a long way in manipulating documents in a useful way already.

One fun and potentially useful tool to have would be a generator that is able to convert printed text to handwritten text.

Take a look at the output (second image) of our trained Pix2Pix model:

from left to right: printed text — fake handwritten text — real handwritten text

The fake handwritten text already looks quite like handwriting. However, as you can see, there is still some room for improvement in order to make this useful. So there are still some challenges ahead of us!


I presented this topic as a lightning talk at a meetup we hosted here at omni:us. This meetup was organised by Berlin PyLadies: “Building and Implementing a Deep Learning Speech Classifier in Python”.