GSoC 2024 with HumanAI | Text Recognition with Transformer Models
Abstract
Optical Character Recognition (OCR) technology has revolutionised document digitization, converting printed and handwritten text into machine-readable formats. However, recognizing text from centuries-old works remains challenging due to the complexity of early non-standard print forms, handwritten texts, and manuscripts. Existing OCR tools like Adobe Acrobat and Google’s Pytesseract OCR often struggle with historical documents due to variations in font styles, image quality, degradation, and layouts.
This project, under the Google Summer of Code (GSoC) initiative, aims to develop a hybrid end-to-end Transformer model capable of accurately recognizing text from non-standard Spanish printed sources from the 16th and 17th centuries.
Introduction
In recent years, the landscape of neural network architectures has undergone significant transformations, particularly with the rise of transformers.
These historical texts present unique challenges with their non-standardized orthography, varied typographical conventions, and centuries of wear and tear. By leveraging the strengths of transformers in both natural language processing and image processing, this project seeks to create a robust model that can accurately transcribe these invaluable historical documents, aiding in their preservation and accessibility for modern scholars and researchers.
To evaluate our model’s performance, we will use metrics such as Character Error Rate (CER), Word Error Rate (WER), and BLEU Score. These metrics will help assess the model’s accuracy and effectiveness in transcribing historical documents.
Printing Irregularities
Historical texts from the 16th and 17th centuries exhibit a range of printing irregularities that pose significant challenges for Optical Character Recognition (OCR) technology. These irregularities result from the limited resources and printing techniques of the time, such as the scarcity of type molds and the need to save space or reuse type molds.
The key irregularities expected in these texts include:
1. Interchangeable Characters:
u and v: These letters are used interchangeably. Typically, ‘u’ is assumed at the beginning of a word and ‘v’ inside a word.
f and s: Both letters can be used interchangeably at the beginning or inside words. ‘s’ is generally assumed at the beginning or end of a word, and ‘f’ within a word.
2. Tildes and Accents:
Tildes act as horizontal caps. When a ‘q’ is capped, ‘ue’ usually follows. If a vowel is capped, ’n’ follows. A capped ’n’ is always interpreted as ‘ñ’.
3. Spelling Conventions:
The old spelling ‘ç’ should be interpreted as the modern ‘z’.
4. Hyphenation:
In some instances, line-end hyphens are missing, resulting in split words. These splits should be retained initially but checked against a dictionary for accuracy.
The researchers utilized early modern handbooks, such as Alonso Víctor de Paredes’s 1680 Institucion, y origen del arte de la Imprenta, to guide the machine learning models in predicting orthographic patterns and standardizing outputs for more consistent transcription, accounting for the diverse printing practices and orthographical idiosyncrasies of the period.
Data PreProcessing
Data preprocessing is a critical and time-consuming stage in historical document text recognition, directly affecting the performance of deep learning models used for Optical Character Recognition (OCR). Layout analysis and text line segmentation are crucial for precise transcription, especially in historical documents where text regions and line breaks are often unclear. Segmentation algorithms are key for detecting text regions, isolating individual lines, and separating characters to improve transcription accuracy. Various algorithms have been developed to optimize this process.
The preprocessing of historical documents began with converting PDFs into high-resolution images, followed by binarization to enhance text-background separation and noise reduction using DeNoise algorithms as pages contained various imperfections, including text deterioration, stains, blurring, and other distortions. Morphological operations with vertical and horizontal structuring elements were utilized to remove marginalia while balancing the preservation of critical content and the elimination of extraneous artifacts. Further refinement involved repairing text gaps and remove small artifacts, thereby improving text region integrity.
A custom trained U-Net segmentation model was employed to generate binary masks of text regions, enabling the extraction of bounding boxes to automate image cropping to areas of interest, thus enhancing efficiency. Post-segmentation, Deskewing corrected angular distortions, and image sharpness was improved, with DPI enhancement ensuring optimal resolution for subsequent OCR processing.
A* Path Planning Line Segmentation Algorithm
A* algorithm is designed to find the optimal path between a start and an end point by considering both the actual distance traveled and a heuristic estimate of the remaining distance. This ensures that the path found is both accurate and efficient.
It uses a concept of projection profile analysis, it calculates the horizontal projection profile by summing pixel values along horizontal lines. Peaks in this profile correspond to text lines, while valleys represent spaces between lines. Where there were more white spaces there are peaks in the graph. The threshold value helped us pick the probably white regions from the image. The black regions indicate where we would need to run our path planning algorithm for line segmentation.
The algorithm operates by identifying a threshold from the horizontal projection graph, where regions above this threshold are considered peaks. These peaks are then used to delineate line segments. To fine-tune the segmentation, the parameter min_peak_group_size is introduced, which filters out small regions that are likely errors due to an incorrect threshold value. This parameter is adjustable, allowing users to refine the segmentation when line segments are missed.
In some cases, the algorithm may incorrectly group multiple lines into a single segment. To address this issue, the segmentation process is re-run on images where the line height exceeds the 90th percentile of all line segment heights on the corresponding page. This ensures that over-segmented regions are correctly split into individual line segments, improving the accuracy of the algorithm.
After line segmentation, we align the segmented line images with their corresponding ground truth text labels for training.
Data Augmentation
Augmentation helps the model generalize better by exposing it to various versions of the same images during training. Different Augmentation (such as Gaussian Noise, Optical Distortion, CHAHE, Affine, Perspective, Elastic Transform, etc.) are applied to increase the effective size of the training data due to these diverse, on-the-fly augmentations with a ‘p’ probability.
Vision Transformers (ViT)
Transformers, however, represent a paradigm shift. Unlike specialized architectures like LSTMs and CNNs, transformers are designed as highly general computation frameworks. They offer a level of flexibility, with dynamic, on-the-fly computation of connections between inputs. This generality allows transformers to learn inductive biases from data, potentially discovering more optimal patterns than those pre-defined by human designers. With the availability of massive datasets, transformers can now train effectively, often outperforming architectures with built-in inductive biases.
Transformers, through their attention mechanism, excel at handling sets of tokens. However, attention is a quadratic operation that computes pairwise inner products between tokens, resulting in a high computational cost, especially for long sequences. This issue is more pronounced in images, where transformers must process a large number of pixels, leading to an infeasible number of connections.
Vision Transformers (ViT) address this by operating on image patches instead of individual pixels, reducing the computational load. Images are split into 16x16 patches, which are unrolled into vectors and processed by the transformer, along with positional embeddings. These embeddings, based on sine and cosine functions, provide spatial information to the model, compensating for the transformer’s permutation invariance and ensuring spatial awareness of patches within the image.
Here’s a breakdown of the ViT architecture:
1. Patch Extraction and Embedding:
The image is divided into fixed-size patches (e.g., 16x16 pixels). Each patch is flattened into a vector. A linear projection is applied to these vectors to reduce dimensionality and prepare them for the transformer.
2. Positional Embeddings:
Learnable positional embeddings are added to the patch embeddings to provide spatial information. Unlike some positional encoding schemes that use sophisticated methods, ViT uses a simpler approach where each patch position is associated with a learnable vector.
3. Transformer Encoder:
The sequence of patch embeddings, combined with their positional embeddings, is fed into a standard transformer encoder. The encoder consists of multiple layers of multi-head self-attention and feed-forward neural networks. The transformer can attend to patches globally, allowing it to capture long-range dependencies within the image.
4. Transformer Decoder:
The decoder transforms encoded image features into textual outputs after the encoder processes image patches for text recognition. It uses masked self-attention to maintain sequence integrity by considering only previous tokens, and cross-attention to focus on relevant encoded image features. This process ensures the generation of accurate and contextually coherent text from visual data.
5. Training and Transfer Learning:
ViT models are pre-trained on large datasets like JFT-300M and then fine-tuned on smaller datasets such as ImageNet. Pre-training on vast amounts of data helps the model learn robust features that transfer well to various downstream tasks.
Vision Transformer (ViT) models learn in a manner quite similar to Convolutional Neural Networks (CNNs), despite not being explicitly programmed to do so. One significant advantage is evident from the very beginning of the network, Some attention heads immediately attain maximum attention, allowing the model to focus on elements that are far apart even at low network depths, offering superior performance in capturing long-range dependencies in the data.
Fine Tuning Stratergies
The training strategy focuses on efficiently utilizing datasets provided to us and computational resources to build robust models. It involves several components to optimize performance and ensure the effective learning of features from image data.
1. Gradient Accumulation:
It involves accumulating gradients over multiple mini-batches before performing an update to the model weights. This technique allows for effective training with smaller batch sizes without exceeding memory constraints, having stable gradient estimates, improving model convergence and stability.
2. Optimizers:
AdamW is a variant of the Adam optimizer that includes weight decay as a separate parameter. This approach decouples the weight decay from the optimization step. It helps improve generalization and prevents overfitting by effectively regularizing the model weights, leading to better performance on unseen data.
3. Schedulers (Cosine Annealing Scheduler / Cosine Scheduler):
The Cosine Scheduler increases the learning rate linearly from 0 to the initial value during the warm-up phase, then applies a cosine decay, reducing the learning rate toward 0 over the remaining steps. In contrast, Cosine Annealing periodically decays and resets the learning rate in cycles, promoting better exploration during training.
It helps the model avoid local minima and better explore the loss landscape, prevent premature convergence, improving the model’s ability to handle complex patterns leading to accurate recognition and improves generalisation.
4. Regularization:
Regularization techniques such as dropout (randomly deactivating neurons), weight decay (penalizing large weights) are used to prevent overfitting and improve generalization.
Label Smoothing redistributes a small portion of the probability mass (ϵ) from the true class to the other classes, instead of using hard one-hot encoded labels. It makes the logits less extreme, reducing the gap between them, which makes the model less confident in its predictions.
Temperature Scaling involves scaling the logits by dividing them by a scalar parameter T (temperature). A higher value of T reduces the confidence of the model’s predictions by flattening the softmax output, while lower values increase it. The parameter T is learned on a validation set to improve the alignment between predicted confidence scores and true probabilities, addressing model overconfidence without changing the learned weights.
While both methods aim to reduce overconfidence and improve generalization, label smoothing directly modifies the training process by altering the target distribution, whereas temperature scaling adjusts the output logits post-training for better calibration.
Loss Function
Various loss functions have been experimented and implemented to help the model fine-tune across different data distributions, to enhance both its accuracy and generalizability.
- Beam search
Beam search loss is a decoding strategy integrated into sequence generation models to optimize output quality by maintaining a fixed number of top candidate sequences at each step (the beam width). Refined the beam loss with techniques such as length normalization and changing the objective function to a normalized log-likelihood objective. These refinements reduces the issue of numeric underflow, which occurs when multiplying small probabilities across long sequences, and reduce bias towards short outputs, ensuring more coherent and contextually relevant sequences during generation.
2. Focal Loss
Focal Loss is a modification of cross-entropy loss designed to address class imbalance by focusing more on hard-to-classify examples. It introduces a scaling factor, where pt is the predicted probability of the true class and γ is a tunable parameter (called the focusing parameter). This factor down-weights the contribution of easily classified examples, allowing the model to focus more on difficult or misclassified instances.
In cases of imbalanced datasets, such as historic text recognition where rare characters or distorted patterns occur, Focal Loss helps the model prioritize challenging examples, improving performance on less frequent or difficult patterns without being overwhelmed by easy cases. This is particularly useful in addressing inconsistencies, like irregular characters or uncommon symbols in historical documents.
Model Calibration
Conditional language generation models typically rely on datasets that provide only a single target sequence per context, leading to high probability assignments for plausible sequences without the direct supervision to compare different potential sequences and rely heavily on models’ generalisation capabilities. This results in un-calibrated sequence likelihood, where model probabilities do not accurately reflect the true likelihood of generated sequences, causing a poor correlation between sequence probability and quality, referred as deterministic target distribution issue. This problem is further aggravated by exposure bias, where models are trained solely on ground-truth sequences, lacking feedback on generated alternatives, thereby necessitating calibration to align model probabilities with true sequence likelihoods and ensure more reliable predictions.
Several Heuristics such as label smoothing, beam search decoding, length normalization, and trigram blocking have been implemented. However, these methods fall short as they are based on indirect supervision, influencing predictions without explicitly optimising the model for accurate probability distributions, leaving the problem of uncalibrated sequence likelihood unresolved.
Sequence Likelihood Calibration (SLiC)
SLiC represents a natural extension of the current pretraining and fine-tuning paradigm, offering a more calibrated and robust approach to sequence generation, by calibrating sequence likelihoods directly in the model’s latent space.
Key Components
1. Decoding Candidates:
SLiC starts by decoding multiple candidate sequences from a fine-tuned model using standard decoding techniques like beam search or nucleus sampling. These candidate sequences are then used as the foundation for further calibration.
2. Similarity Function:
The similarity function measures the alignment between candidate output and target sequences by comparing their decoder output hidden state representations. It computes cosine similarities over token spans and aggregates them using an modified F-measure, focusing on spans of varying lengths. Unlike external metrics like ROUGE or BERTScore, this method is efficient, context-aware, and avoids overfitting to imperfect evaluation metrics by using the model’s own decoder outputs.
3. Positive and Negative Candidates:
These are sequences that the model generates that are more or less similar to the target sequence, according to some similarity metric . A high similarity score indicates that the candidate is close to the true sequence in terms of content or structure (Positive Candidates). A lower similarity score indicates that these candidates deviate more from the true sequence (Negative Candidates).
4. Calibration Loss:
The calibration loss is designed to align the sequence likelihood of the model's decoded candidates with their similarity to the target sequence. Given the context, the target sequence and a set of candidate sequences, four distinct loss types are considered to optimize.
Rank Loss: Ensures positive candidates rank higher than negative ones.
Margin Loss: Increases the probability gap between positive and negative candidates.
List-wise Rank Loss: Optimizes the ranking of a list of candidates.
Expected Reward Loss: Maximizes expected similarity across a list of candidates.
For the implementation, Rank Loss and Margin Loss are selected.
5. Regularization Loss :
Regularization losses are applied to prevent the model from deviating significantly from the original MLE objective.
KL Divergence is used here to measure the divergence between the predicted probability distribution and the target distribution, helping to refine the model’s predictions.
Evaluation Metrics
The model’s performance was thoroughly assessed using CER, WER, and BLEU metrics for benchmarking.
- Character Error Rate (CER): Assesses the rate of erroneous characters in the predicted text compared to the ground truth text. Generally CER < 10% are considered as Good OCRs.
- Word Error Rate (WER): Measures the rate of erroneous words in the predicted text compared to the ground truth text. A lower WER indicates better performance in recognizing words accurately. Generally WER is twice or more than CER percentage.
- BLEU (Bilingual Evaluation Understudy): Used for evaluating machine-generated text by comparing n-grams in the generated text to those in reference texts. It measures precision and recall of n-grams and applies a brevity penalty to avoid favoring shorter texts. It scores range from 0 to 1, with 1 indicating a perfect match. It is widely used for its effectiveness in capturing the adequacy and fluency of generated text.
- Levenshtein Distance: Quantifies the minimum number of single-character edits (insertions, deletions, substitutions) required to transform one string into another, providing insights into dissimilarity between predicted and ground truth text. The higher the distance number, the more different the two strings are.
These metrics were continuously monitored during the training process, guiding adjustments to hyperparameters and training strategies to maximize the model’s accuracy.
Results
The TrOCR model made notable improvements in OCR performance, especially for historical Spanish texts. The current performance metrics of the model are:
Character Error Rate (CER) ~ 0.03 (97% accuracy)
Word Error Rate (WER) ~ 0.07 (93% accuracy)
Evaluating the model on an unseen page from the Parades book
Conclusion
The TrOCR model represents a significant advancement in Optical Character Recognition, with transformers offering remarkable flexibility, especially in challenging contexts like historical document recognition. These models effectively manage degraded text, diverse fonts, irregular layouts, and multilingual scripts. By dynamically focusing on relevant parts of an image, regardless of distortions or non-standard formats, transformers excel at generalizing across varied structures, learning from limited or noisy data, and supporting multilingual and multi-script recognition. Their capability to efficiently handle large-scale datasets makes them a powerful tool for digitizing and preserving vast historical archives, ensuring both accuracy and adaptability across a wide range of document conditions.
Overall, the advancements realized through this project not only enhance the accessibility and preservation of historical texts but also contribute to the broader field of OCR technology. The successful application of Transformer models in this domain highlights their potential to address the unique challenges posed by historical documents, paving the way for future innovations in text recognition and digitization. This work can extend this framework to other historical languages by creating domain-specific hybrid datasets and fine-tuning models to accommodate diverse script structures and typographic conventions.
For anyone interested in exploring the implementation or contributing further, the code is available on GitHub: Arsh Khan Transformer OCR.
You can also find more details about this project on the GSoC platform: GSoC 2024 Project.
References
- Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
- 1.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008).
- Surinta, M. Holtkamp, F. Karabaa, J. -P. Van Oosten, L. Schomaker and M. Wiering, “A Path Planning for Line Segmentation of Handwritten Documents,” 2014 14th International Conference on Frontiers in Handwriting Recognition, Hersonissos, Greece, 2014, pp. 175–180, doi: 10.1109/ICFHR.2014.37.
- Zhao Y, Khalman M, Joshi R, Narayan S, Saleh M, Liu PJ. Calibrating sequence likelihood improves conditional language generation. In: The eleventh international conference on learning representations; 2022.