Aged Document Binarization Using the U-Net Architecture

Edward Roe
12 min readAug 30, 2021

1. Introduction

Thresholding can be seen as a classification problem where, usually, there are two classes and, this is why thresholding is also called binarization. For document images, we expect that a thresholding algorithm correctly classifies the ink as black and the paper as white resulting in a binarized image. The easiest way to achieve this with a digital grayscale image, is to choose a threshold value, says th, and to assign white the gray level above this value, and black to the remaining levels. The problem is to correctly find the value, if there is, that makes a perfect match for foreground and background elements (see Figure 1.1). For document images, it is known what is the expected result although there are several issues that make this domain so challenging as ageing degradations. Ageing artifacts include foxing (the brownish spots that appear on the paper surface), back-to-front ink interference, crumpled paper, adhesive tape marks, folding mark, etc. Figure 1.2 shows some of the problems cited.

Figure 1.1: From this animation, using the Threshold tool from Photoshop, it is possible to see that there isn’t a threshold value ideal for processing the image shown. — animation by the author.
Figure 1.2: Examples of many kinds of problems caused by ageing process as (top right and bottom right) foxing, (bottom left and middle right) back-to-front interference and (top left) caused by human manipulation as adhesive tapes and (bottom right) crumpled paper [1]. — image by the author.

Several binarization techniques, like Otsu[2], Sauvola[3] and Niblack[4], are popular, but they don’t work very well with documents like the ones shown in Figure 1.2, as shown in Figure 1.3.

Figure 1.3: Example of results obtained with well-known binarization algorithms. — image by the author.

In this article, we will see how it is possible to make the binarization of documents, with different types of problems, through a classification using a trained model based on the U-Net architecture based on Convolutional Neural Network (CNN). The typical use of CNNs is on classification tasks, where the output to an image is a single class label. However, in many visual tasks, the desired outcome should include not if the object is present in the image or not, but its localization, i.e., each pixel is supposed to be assigned to a class label.

We are going to use, in this project, images from document databases that have the respective ground truth (binarized reference images). Some of the databases are from competitions, such as DIBCO and ICFHR.

2. The Dataset

The dataset is composed of a total of 5,027 images and their respective ground truth, which are binary reference images. The images used are part of the following datasets:

  • Document Image Binarization (DIB) - The Nabuco Dataset: 15 images [5]
  • DIBCO and H-DIBCO (years: 2009, 2010,2011, 2012, 2013, 2014, 2016, 2017): 116 images [6]
  • ICFHR 2016 Binarization of Palm Leaf Manuscript Images challenge: 99 images [7]
  • ICDAR2017 Competition on Historical Document Writer Identification: 4,782 images [8]
  • PHIBD 2012 Persian Heritage Image Binarization Dataset: 15 images [9]

To increase the number of samples, we will apply data augmentation to both the originals and the binary reference images.
As the model only accepts images with a size of 256×256, we will divide the images instead of resizing them; this way, we will not lose information and increase the number of samples for training.

2.1 Data augmentation

The data augmentation process begins with applying transformations to both the original image and its ground truth. I choose only flip (vertical and horizontal) and rotations (90°, 180° and 270°) augmentation transformations. Other transformations are also possible such as blurring, adding noise, changing brightness and contrast, etc. Just remembering that for these types of transformations, the respective ground truth must not receive them.
After applying the transformations, the resulting and original images go through the cutting process, resulting in images with 256×256 pixels. In the code to generate augmentation, there is the possibility of increasing the number of resulting images, leaving the cut step smaller than 256, generating overlapping cuts.
Figure 2.1 illustrates the cutting process through an animation. The grey lines show where the cutter will split the image. On the left image, the step is 256, while on the right, it is 128. Note that the final size of the cropped image is the same in both cases (256×256), as shown by the square highlighted in white.

With a dataset with 5,027 images, the cutting with a step of 256 produces 27,630 images, while a step of 128 produces 97,146 images. Note that this is not even four times larger for reasons of proximity with the edges of the image, which I’m not dealing with (see Figure 2.1).

Figure 2.1: Animation showing the cutting process for generating images for use in the model (256×256) with two steps, 256 (left) and 128 (right). — animation by the author.

Figure 2.2 shows the result of the cutting and the augmentation process.

Figure 2.2: Some images resulting from the data augmentation process, both original cut images (left) and the reference binarized cut images (right). — image by the author.

3. The U-Net Model Architecture

U-Net, evolved from the traditional convolutional neural network, was first designed and applied in 2015 by Olaf Ronneberger et al. [10] for biomedical image segmentation. A general convolutional neural network focuses on image classification, where input is an image and output is one or mode labels. In biomedical cases, distinguishing whether there is a disease can be not enough; it is also necessary to localize the area of abnormality [10].

3.1 Convolutions

Before describing the U-Net architecture, we will see a little about an important aspect that is convolution. If you are already aware of convolutions, you can skip this section.

More formally, a convolution is an integral that expresses the amount of overlap of one function g as it is shifted over another function f [mathworld.wolfram] but in digital image processing and deep learning, convolution is a mathematical way of combining two images to form a third image. Generally, one of the two combining images is not an image but a filter (or a kernel), a matrix of values whose size and values determine the kind of effect of the convolution process. The main idea is to place the kernel over each pixel (across the entire image) and multiply and sum its values over the target pixel and its local neighbours.

Figure 3.1: It is an animation illustrating how the convolution works. In this case, with no padding added and stride 1. Note that the resulting image (right) is smaller than the original image (left). — animation by the author.

The most common uses of convolutions in digital image processing are edge detection, blur and noise removal. Although the appearance of the effect of convolutions in middle layers in CNNs is well known, it can be made more evident by showing final applications.

Figure 3.2: Example of convolution process used in image processing. The results in the middle are blurring and Sobel edge detection, respectively. We usually first convert the images to grayscale before applying any filter to avoid the undesired result of the figure on the right. With CNNs, the red, green, and blue channels are used separately. — animation by the author.

Figure 3.2 shows convolutions applied in the left image with different kernels, first a blur and then the Sobel edge detection. I used these two convolutions over the grayscale version of the original image. With CNN, the convolution is applied separately to each RGB channel, which is not common in image processing because it generates a strange result, like the one on the right of Figure 3.2.

Padding: as can be seen from the above examples, the resulting image is smaller than the original by an amount related to the kernel size; the more extensive the kernel, the further away the center is from the border of the image.
To produce an output of the same size as the input, we pad the edges with extra pixels. This way, when sliding, the kernel can allow the original edge pixels to be at its center, while extending into the extra pixels beyond the edge.

Figure 3.4 shows some padding methods using the copyMakeBorder function of OpenCV. The original image is from the kernel filter with four colores points in each corner to help to demonstrate the difference in each method.

Figure 3.4: Four examples of padding using OpenCV. From left to right: the original image, with BORDER_CONSTANT (in yellow), with BORDER_REFLECT, with BORDER_REPLICATE and with BORDER_WRAP. — image by the author.
Figure 3.5: Same idea of Figure 3.1 but this time with padding of size one added (the zero border around the image). Note that the resulting image (right) has the same dimensions as the original image (left). — animation by the author.

Striding: Stride is the number of pixels of each shift of the kernel window over the input matrix. A stride of one means to pick slides a pixel apart, so every single slide acting as a standard convolution. A stride of two means picking slides two pixels apart, skipping every other slide in the process, downsizing roughly by a factor of two. A stride of three means skipping every two slides, downsizing roughly by a factor of three, and so on.

3.2 How the U-Net works

The architecture contains two paths, as can be seen in Figure 3.6. The first path is the contraction path (also called the encoder) used to capture the context in the image. The encoder is just a traditional stack of convolutional and max-pooling layers. The second path is the symmetric expanding path (also called the decoder), used to enable precise localization using transposed convolutions to upsampling the input feature map (you will find an excellent explanation on transposed convolutions here). It is an end-to-end, fully convolutional network without any dense layer.

Figure 3.6: U-Net architecture (example for 32x32 pixels in the lowest resolution). Each blue box corresponds to a multi-channel feature map. The number of channels is denoted on top of the box. The x-y-size is provided at the lower left edge of the box. White boxes represent copied feature maps. The arrows denote the different operations [10].

Each encoder block receives an input, perform a concatenation with the correspondingly cropped feature map from the contracting path, applies two 3×3 convolutions (without padding), each followed by a rectified linear unit (ReLu) and then a 2×2 max pooling operation with stride 2 for downsampling, as detailed in Figure 3.7.

Figure 3.7: Encoder detail from the original U-Net diagram (from [10]) and the respective code in our implementation.

Each decoder block constitutes two convolutional layers and, in this project, the shape of the input image changes from 256×256×1 to 256×256×64, in the first convolution process, increasing the depth of the feature channels by a factor of 64. Notice that in the code shown in Figure 3.7, padding=’same’ is used (filling an extra border with zeros, but you can have a non-padded convolution with padding=’valid’ in Keras); this way, the convolution process doesn’t decrease the image dimensions. Ronneberger et al. used a non-padding convolution, and for this reason: ‘the cropping is necessary due to the loss of border pixels in every convolution’ [10]. In the original implementation of the U-Net, the input is a 128×128×1 image, and the encoder outputs an 8×8×256 shape.

The decoder consists of: expansion blocks, with each block consisting in upsampling of the feature map using a 2×2 upsampling layer (transposed convolutions) that halves the number of feature channels, a concatenation with the correspondingly cropped feature map from the encoder path, and two 3×3 convolutions, each followed by a ReLU, see Figure 3.8. The cropping is necessary due to the loss of border pixels in every convolution. As I adopted a padded convolution in the encoder (padding=’same’ in Keras), the cropping process is unnecessary.

In the original implementation of the U-Net, the decoder increases the shape from 8×8×256 → 128×128×1.

Figure 3.8: Decoder detail from the original U-Net diagram (from [10]) and the respective code in our implementation.

Notice that the process finishes with a 1×1 convolution to map each 64 component feature vector to the desired number of classes.

4 Experiments

The code was implemented in Python using Tensorflow Keras, and the source can be found here. To run the project, you will need Python 3.x, TensorFlow, Keras, OpenCV, Numpy, skimage and skilearn installed. I think the easiest way to Install TensorFlow is using Anaconda, as explained here. Also, there is a requirements.txt file to help with installation.

The project is organized in such a way that the dataset needs to have the following structure for the dataset to be used for data augmentation shown in Figure 4.1, where GT is the ground truth images, and Originals is what the name means:

Figure 4.1: File structure for the dataset to be used by the data augmentation process (can be used for training with minor modifications).

Figure 4.2 shows the data files structure for files to be used in training (result from data augmentation, for example).

Figure 4.2: Data folders structure for files to be used for training. — image by the author.

In model folder the is a pre-trained model (you have to unzip it)
Following is the code for image preparation to split the input image into 256×256 tiles:

Code for image restoration joining together 256×256 tiles to form the resulting image:

If we want to binarize a document of any size, after the training process, it must be cut into 256×256 tiles. The U-Net then binarizes each tile image, and all resulting binarized tiles are joined together to form the resulting image.

4.1 Evaluating the results

It is easy to check if the result is the expected one when the resulting classification is a label. But if the result is an image, like in binarization or semantic segmentation in general?
In some competitions, like the ones from which we got the image, the evaluation process uses measures suitable for evaluation purposes in document analysis and recognition. These measures consist of (i) F-Measure (FM), (ii) pseudo- FMeasure (Fps), (iii) PSNR and (iv) Distance Reciprocal Distortion (DRD) [11].
Here I will describe one common technic in semantic segmentation: the Intersection over Union (IoU), also known as the Jaccard index (Eq. 1). It measures the percent overlap between the ground truth and the binarization result. In other words, it is the number of black pixels in common between the ground truth and the binarization result divided by the total number of black pixels present in both images.

Eq. 4.1: Intersection over Union formula.

Here is an example to help understand the IoU score calculated using the ground truth image and the result of the Otsu binarization method.

Figure 4.3: Original image: detail of a document image from the DIBCO dataset [6].
Figure 4.4: The ground truth and the Otsu result were colored in green and red, respectively, for better visualization of the common areas (in brown in the image at right). The IoU in this case scored 0.93124. — image by the author.

More on metrics to evaluate your model can be found here and here.

4.2 Results

Next, I will show some results of experiments using a model trained with the U-Net architecture and comparing them with results obtained using three classical binarization algorithms: Otsu, Sauvola and Niblack. Document images present some of the problems described in the introduction (Figure 1.2), such as foxing, back-to-front ink interference, adhesive tape marks and folding mark.

Of course, these algorithms are not the best choice for this type of problem, but they serve to give an idea of ​​how likely the results from a model based on U-Net can be, even without much sophistication in base preparation and training.

Figure 4.5: Example of documents with strong folding mark and their IoU score. Original image (top left), ground truth (top center), result with Otsu=0.9397 (top right), result with Sauvola= 0.9665 (bottom left), result with Niblack=0.8594 (bottom center) and using U-Net=0.3171 (bottom right). Image from DIBCO dataset [6]
Figure 4.5: Example of documents with strong folding mark and their IoU score. Original image (top left), ground truth (top center), result with Otsu=0.9863 (top right), result with Sauvola= 0.9731 (bottom left), result with Niblack=0.9020 (bottom center) and using U-Net=0.3020 (bottom right).
Figure 4.6: Example of a document with a large foxing area, a small glitch at the top and their IoU score. Original image (top left), ground truth (top center), result with Otsu=0.5826 (top right), result with Sauvola=0.9661 (bottom left), result with Niblack=0.8443 (bottom center) and using U-Net=0.3656 (bottom right). Image from DIBCO dataset [6].
Figure 4.7: Example of document with strong back-to-front ink interference. Original image (top left), ground truth (top center), result with Otsu=0.7939 (top right), result with Sauvola=0.9599 (bottom left), result with Niblack=0.8912 (bottom center) and using U-Net=0.6754 (bottom right). Image from DIBCO dataset [6]
Figure 4.8: Example of document with adhesive tape mark and same crumpled area. Original image (top left), ground truth (top center), result with Otsu=0.9699 (top right), result with Sauvola=0.9861(bottom left), result with Niblack=0.8369 (bottom center) and using U-Net=0.6745 (bottom right). Image from PHIB 2012 dataset [9]
Figure 4.8: IoU score with images from the dataset (5,027 images). Note how the U-Net model scored much better (lower score) than the other methods.

The code presented in this article can be found on github.

5 References:

[1] Thresholding. In: Bezerra, B. L. D, Zanchettin, C., Toselli, A. H. and Pirlo, G.(Org.). Handwriting: Recognition, Development and Analysis. 1ed.New York: Nova Science Publishers, v. , p. 33–56, 2017.

[2] Otsu, N., A threshold selection method from gray level histogram. IEEE Transactions on System, Man, Cybernetics. [S.l.]: [s.n.], p. 62–66, 1978.

[3] Sauvola, J.; Pietaksinen, M., Adaptive document image binarization. Pattern Recognition, v. 33, p. 225–236, 2000.

[4] Niblack, W., An introduction to Digital Image Processing, Prentice-Hall, 1986.

[5] Document Image Binarization (DIB)-The Nabuco Dataset. At https://u.pcloud.link/publink/show?code=kZzQIE7ZjztUduQxNvmf9P0hrBmOx8D3GJtk.

[6] Document Image Binarization COmpetition, DIBCO and H-DIBCO (years: 2009, 2010, 2011, 2012, 2013, 2014, 2016, 2017, 2018). Dataset at https://vc.ee.duth.gr/dibco2019/, 2019.

[7] The International Conference on Frontiers of Handwriting Recognition (ICFHR) Challenge 1. Binarization of Palm Leaf Manuscript Images, 2016. Dataset at http://amadi.univ-lr.fr/ICFHR2016_Contest/index.php/challenge-1, 2016.

[8] ICDAR2017 Competition on Historical Document Writer Identification. https://lme.tf.fau.de/dataset/scriptnet-icdar2017-competition-on-historical-document-writer-identification/, 2017.

[9] Persian Heritage Image Binarization Dataset (PHIBD 2012). Dataset at http://tc11.cvc.uab.es/datasets/PHIBD%202012_1/task_1_1, 2012.

[10] Ronneberger, O., Fischer, P. and Brox, T., U-Net: Convolutional Networks for Biomedical Image Segmentation. International Conference on Medical Image Computing and Computer-Assisted Intervention, 2015.

[11] Pratikakis, I., Zagori, K., Kaddas, P., Gatos, B., ICFHR 2018 Competition on Handwritten Document Image Binarization (H-DIBCO 2018). 2018.

--

--