Historical OCR with Self-Supervised Learning

Challenges and Achievements as a Google Summer of Code Contributor

Yukinori Yamamoto
7 min readSep 10, 2024

Introduction

For the past several months, I have been working on creating self-supervised models for historical OCR as a Google Summer of Code contributor with HumanAI. This blog post is the second report on my work, following my first blog post. You can check my code in this repo.

Task

The aim of this project is to develop a model that can recognize characters in Spanish printed documents from the Renaissance using self-supervised learning. The model is expected to achieve an accuracy rate of over 80%. This task is particularly challenging due to the unique characteristics of the data. Renaissance-era Spanish differs significantly from modern Spanish, including variations in spelling, which is a critical factor for OCR.

An example of images in the target dataset

Additionally, some characters in the dataset are no longer in use today. Since these documents were printed hundreds of years ago, parts of them have deteriorated, making it even harder for the model to recognize text. Compounding the difficulty, there is only a minimal amount of annotated data for these historical documents — just a few thousand words. General OCR models typically require over 10,000 annotated word images to be trained from scratch, so creating a decent model with such a small dataset might seem daunting. To address this, self-supervised learning is utilized, as it allows training on data without the need for labels — exactly the approach I have been taking in this project.

SeqCLR

Contrastive learning is a self-supervised learning method that trains a deep learning model using an unlabeled dataset by generating labels during training. In contrastive learning, the model learns to associate each data point with a generated label and distinguish it from others.

A graph that represents contrastive learning from this paper

Contrastive learning is a self-supervised learning method that trains a deep learning model using an unlabeled dataset by generating labels during training. In contrastive learning, the model learns to associate each data point with a generated label and distinguish it from others.
To apply contrastive learning in OCR, I implemented SeqCLR, a method proposed by Aberdam et al., designed to improve text recognition by leveraging contrastive learning in a sequence-to-sequence manner. During contrastive learning (pre-training), SeqCLR learns to maximize agreement between corresponding sequences of differently augmented images (positive pairs) and distinguish these from different sequences (negative pairs).

Illustration of SeqCLR’s contrastive learning method

The SeqCLR architecture I implemented combines a ResNet50 and a 2-layer BiLSTM as the Encoder and an Attention LSTM Decoder. Initially, I trained the model on images of text lines that were manually segmented from original document images. However, since this segmentation is labor-intensive and impractical in real-world settings, I developed a pipeline to automatically extract word bounding boxes from document images using CRAFT-pytorch. I then adapted the model from line-level to word-level recognition, using these extracted images.

The workflow for training follows three steps: (1) self-supervised learning (pre-training), (2) fine-tuning on an automatically generated dataset, and (3) fine-tuning on real data. In the pre-training phase, I used over 800,000 word images extracted from unlabeled Renaissance-era Spanish documents. Next, for fine-tuning, I used a tool called “TRDG” to generate 700,000 synthetic word images with fonts resembling the target documents. Finally, the model was fine-tuned on 5,000 labeled word images from real Renaissance-period documents. Despite the small dataset, I hypothesized that the model could perform well after the pre-training phases.

Previous Problems with SeqCLR

As I mentioned in my previous blog post, during my initial attempt, SeqCLR was trained on 800,000 unlabeled word images using contrastive learning, and fine-tuned on 700,000 synthetic word images and about 5,000 labeled word images. I set the initial learning rate at 0.01 for both contrastive learning and fine-tuning, and used 3 epochs for contrastive learning and 10 for fine-tuning.

However, I observed that the loss value, which measures how well the model learns during training, remained constant during contrastive learning — indicating that the model wasn’t learning anything.

The result of SeqCLR’s contrastive learning. As you can see, the loss stopped moving after a few hundred steps.

I found that SeqCLR performed better without contrastive learning, contrary to the original paper’s findings. The model with contrastive learning had a Character Error Rate (CER) of over 10%, whereas the model without it achieved a CER of 0.1%. (Note: The CER values reported earlier were inaccurate due to inappropriate metric settings. I later corrected this by removing PAD tokens from the inference outputs and labels, leading to higher CER values in the next chapter.)

I hypothesized that the implementation of contrastive learning might have been flawed, potentially causing the model to change its parameters randomly or learn unrelated information from the data.

Causes and Solutions

After analyzing the model’s architecture, hyperparameter settings, and dataset size, I identified two major issues: the temperature in the contrastive loss function and the learning rate during fine-tuning. Temperature in the contrastive loss determines how sensitive the loss function is to differences between sequences.

This is the NCE loss function, which is a component of the contrastive loss described in the original paper. A letter that is pointed out by red arrows represents Temperature.

Initially, I set the temperature at 256, but it should have been below 1.0, given the abstract nature of the features. Once I reduced the temperature to 0.5, the loss value began to decrease gradually during contrastive learning.

A chart of loss value during contrastive learning after the adjustment of Temperature.

The second issue was the learning rate. I had set the same learning rate for both contrastive learning and fine-tuning. However, since the model’s parameters had already been adjusted during contrastive learning, they needed smaller adjustments in fine-tuning. When I reduced the fine-tuning learning rate by a factor of 10 (to 0.0001), the model with contrastive learning achieved nearly the same CER as the model without it.

Validation CER values of SeqCLR models with and without Contrastive Learning(Self-supervised learning) during fine-tuning phase
Inference result of the model that is trained on contrastive learning and fine-tuned only on real dataset.

I also found that during fine-tuning, the model with contrastive learning converged faster than the model without it when trained only on real data (800,000 unlabeled data points and 5,000 labeled ones). Although both models ultimately achieved a CER of about 5% after 30 epochs, the contrastive learning model reached that point much faster.

The final CER value of my best model (combining contrastive learning, synthetic data fine-tuning, and real data fine-tuning) was 4.1%.

I concluded that while SeqCLR can achieve reasonable accuracy without contrastive learning if enough labeled data is available, contrastive learning helps reduce the cost of preparing large datasets, such as synthetic data generation.

PerSec

As mentioned in my first blog post, I have also been exploring PerSec, a successor to SeqCLR proposed by Liu et al. PerSec uses hierarchical contrastive learning with two layers: Stroke Context Perceiver (STCP) and Semantic Context Perceiver (SECP). STCP extracts partial patterns of characters, while SECP captures the overall shape of characters.

Illustration of PerSec’s architecture

I implemented the Vision Transformer (ViT)-based Encoder version of PerSec, but unlike SeqCLR, PerSec showed a gradual decrease in loss during contrastive learning from the start.

The loss value of PerSec’s contrastive learning along steps. Although the loss sometimes exploded, it gradually decreased overall.

However, it performed poorly during fine-tuning, generating the same outputs regardless of the input. Even switching to the CNN-based Encoder version did not resolve this issue.

The result of fine-tuned ViT-based PerSec

I suspect that the components of contrastive learning (STCP and SECP) overfit to the contrastive learning task, confusing the overall model. I am continuing to investigate the cause and will work on improving PerSec’s performance

Pipeline and UI

I have also implemented an end-to-end pipeline that generates transcriptions from input images, complete with a graphical user interface (GUI). Although it has a lot of room for improvement, users can easily process documents and obtain transcriptions by using the pipeline. You can try out the pipeline via this GitHub repository.

Summary

In this project, I worked on developing a self-supervised OCR model for Renaissance-era Spanish documents using contrastive learning. I implemented and experimented with SeqCLR and PerSec models. SeqCLR initially faced challenges with contrastive learning due to inappropriate hyperparameters, but after adjustments, it achieved a final CER of 4.1%. The contrastive learning approach proved useful in reducing the cost of dataset preparation by minimizing the need for large annotated datasets.

While SeqCLR showed promise, PerSec, despite being a successor model, struggled with performance, especially during fine-tuning. My ongoing efforts involve improving PerSec’s functionality and identifying the root causes of its current issues.
In addition to the model development, I created an end-to-end pipeline and user-friendly interface to enable easy transcription of historical documents. I look forward to continuing to refine these models and the pipeline to further enhance OCR for historical texts.

--

--