Document Translation Using Attention, EAST, and Tesseract
Translating documents from Italian to English language using Attention, EAST and Tesseract.
Table of content
- Introduction
- Problem statement
- My approach and Model pipeline
- Text Translation
- Text Detection
- Text Recognition
- Structuring Translated Text
- Results
- Summary
- Deployment
- Future Work
- References
Introduction
In the current era of Machine learning and deep learning, we have several case studies that has been solved and implemented with the state of the art techniques of Artificial Intelligence. One of the very well-known problem is language translation of a given document. Searching each words into google translate is a hard process to follow also sometime we may not get the exact context of the text. To solve this many of the companies has invested millions.
Problem Statement
Document translation is a widely used task in deep learning. Documents such as invoices, letters, certificates, handbooks etc in foreign language is hard to understand and can cause a lot of trouble. Here we are implementing a standard solution which can translate a given document from Italian to English language. Translated text can be structured as same as in original image which will help reader to verify their original documents.
My Approach and Model pipeline
We will use four steps to translate a document.
1. Text Detection
2. Text Recognition
3. Text Translation
4. Structuring Translated Text
Lets start with text translation first. Text translation is a separate module in this case study.
Text Translation
Data Overview
We have collected the Italian to English translation dataset from http://www.manythings.org/anki/. This is a text file which includes few more text fields which are not necessary to us.
After cleaning we have total 3,40,432 Italian to English Translations.
Exploratory Data Analysis
Lets first see word counts in each Italian sentences.
Above bar plot shows that nearly 70,000 Italian sentences has word count of five. There are very few sentences with very high word counts.
Now lets plot histogram of both Italian and English text.
Histogram of Italian and English sentence word counts are almost overlapping. It shows word count in both Italian and English sentence are linearly related. Number of Sentence with 0 to 15 word counts are very high for both languages.
Data Preprocessing
Below are few steps we followed to preprocess the both the Italian and English text.
a. Change all character to lower case.
b. Convert all input text from Unicode to ASCII.
c. Remove all characters except (a-z, A-Z, 0–9, “.”, “?”, “!”, “,”)
Split the preprocessed dataset into Train test and validation. Now lets check the maximum word count of our train dataset (Italian and English sentence).
Max word count of Italian sentence is 104 but it is very rare, only 2 data points. Hence we select 55 as our maximum length for Italian text while performing tokenization and padding. Similarly for English we choose 51 as the Maximum length.
Perform tokenization and padding on train and test dataset.
Deep learning Models
Here we used both Attention and Transformer architecture and compared BLEU score on test dataset.
BLEU (Bilingual Evaluation Understudy) is a measurement of the differences between an automatic translation and one or more human-created reference translations of the same source sentence. For a quick note refer here.
Attention Model and architecture
Below is the Attention mechanism we used for translation. For detailed architecture refer this beautiful blog.
We used custom models (encoder, onestepdecoder and decoder) and custom layers (attention) to train our attention model. Also we used custom loss function to consider only the tokenized words and not the padded ones. Once our attention model is trained, we used this to predict on test dataset. Here we used a predict function which will take input of Italian sentence, encoder and onestepdecoder object from trained model and return the translated sentence in English.
This model had total 14,646,519 trainable parameter. After model training completion, we have translated random 1000 test Italian sentence and achieved average 0.66 BLEU score.
Here are few results from this model. We plotted the attention weights of the input sentence.
For the first sentence model translated exact same as original sentence with BLEU score as 1.0. In the second example as well the translated and original text are same except ‘to’ which is missing in our translated text. But this doesn’t not make any impact as the both sentence means the same. Here we got 0.61 BLEU score.
Transformer Model and architecture
We build transformer model as well for translation. Transformer architecture has five important blocks as below.
a. Encoder & decoder Stack
b. Self Attention
c. Feed- Forward Network
d. Embedding and Softmax
e. Positional Embedding
Please refer this for detail architecture.
The key of this transformer architecture is scaled dot product which is used in self Attention layer.
Attention(Q, K, V) = softmax(QKT √ dk )V
Here Q is query matrix, K is key matrix and V is value matrix. dk is dimension of Keys.
Here we used 6 layers of Encoder and decoder stacks and total 8 head attention. Similar to our previous Attention model, here we used custom layers (Encoder, MultiheadAttention, Decoder and transformer) to train the model. We also used custom loss same as in Attention Model.
We trained this model for 30 epochs and saved the model weights. This model has total 9,441,855 trainable parameter (less than our attention model). We evaluated this model on the same 1000 test data point as we used in attention model and achieved average 0.605 BLEU score.
Here are few results from this transformer model.
As we have used total 8 attention head, hence we have total eight set of attention weights. We have plotted these weights from 6th decoder layer (we can choose any decoder layer and check the translation process).
In the above Text translation in both cases model has performed pretty good. But BLEU score is less because of the words in original and predicted text do not match exactly.
Model Comparison
For the architecture we used, we have Attention model performing better than Transformer. We re evaluated both the models on different set of 1000 test data points and found that Attention model is performing better there too. Hence we have chosen Attention model for text translation in our problem.
Also as this is a custom model implementation, we saved our model using checkpoints (tf.train.Checkpoint).
Text Detection
Text detection is the first step we do when a new document is given for translation. With wrong detection of text we can miss words/sentences and that can lead to failure in translation. For TEXT detection we have used pre trained EAST (An Efficient and Accurate Scene Text Detector).
EAST model has three steps to follow as feature extractor (stem), feature merging branch and output layer.
Feature extractor is a convolution network trained on ImageNet. It has four block which outputs four different size of input image (1/32, 1/16, 1/8 and 1/4), called feature map.
In Feature merging branch first it upPools each last output from feature extractor, concatenate that with current feature map and pass it through 1X1 convolution layer followed by another 3X3 convolution layer.
In output layer multiple 1X1 convolution operation is used to project 32 channels of features maps into 1 channel of score map and multi-channel geometry map.
We used pretrained EAST model from https://github.com/oyyd/frozen_east_text_detection.pb.
EAST model is originally implemented in C++. As we have used python, we were not able to generate rotated bounding boxes in case the detected text is not aligned with X axis. Hence in case of rotated text we used the same rectangle bounding boxes with start and end X and Y geometries.
Also EAST model provides us bounding boxes for each words. But for translation task we need to get the whole sentence at once and then translate it using our text translator.
Here is the Pseudo code we used for handling the above constrains.
The same Pseudo code can be referred from here.
Text Recognition
For text recognition we used pytesseract. This is an optical character recognition (OCR) tool for python. That is, it will recognize and “read” the text embedded in images.
Python-tesseract is a wrapper for Google’s Tesseract-OCR Engine.
Please refer to this documentation for more details on pytessetact (installation, functions and parameters).
Here is the Pseudo code we used for recognizing horizontal and rotate text.
The same Pseudo code can be referred from here.
Once we have all BB’s and their recognized text, we merged these texts and store the BB’s start and end geometries, angles and word positions in the merged sentence. Now pass these merged Italian texts to our text translation model and get the translated English texts.
Structuring Translated Text
In this stage we structured our translated text so that it can be represented to the reader in the original document fashion.
In case of multiple lines of texts in original document, reader would need the translated text to have multiple lines which will help reader to understand and compare between the original and translated documents.
Below steps are followed to structure the translated text.
- Break the translated text to get it adjusted in multiple lines.
- Replaced the original text space with a mask such that the original text gets overlapped with blank image. And now write the translated text onto this mask images. Here the mask image is generated using color detection from the selected BB. Most used/average BRG color found inside the BB is used to generate the mask so that it looks like as same as the original image. For more on color detection refer this blog.
- Original text and translated text length may not be same, so we need to adjust the translated text size so that we can replace it in the original text space. Also if our input text is rotated we need to replace the original text with translated text in same rotated angle.
Results
Lets translate few Italian Documents. For easy understanding we have created our own texts and documents with different text alignment.
Observation
From the above results we can see our model is performing decent. It is able to handle multi paragraph Italian text and rotated single line text as well. But it fails when we have multiple rotated text line. As discussed in Text Detection, as we are using pretrained EAST in python, we are not able to handle rotated boundary boxes, and hence we had to use large horizontal rectangle boxes for rotated texts which makes it overlap with boxes.
Here are the times consumed by this model to process the above documents.
Above screenshot is from Spyder. Here for first two documents model took less than 1.5 sec as these two documents has very less text present in it. Document 3,4,5 and 6 has lot of Italian text in multiple paragraph. Hence it took more time to translate.
Summary
- Given any Italian document we are able to translate it into English. Here are the pipeline used for translation.
Text Detection -> Text Recognition -> Text Translation -> Re Structuring Text - For Text Detection we used Pretrained EAST, Python tesseract for Text recognition and Attention model for text translation.
- Here we used few threshold based on our experiment cases (bounding box confidence, rotated angle etc). These can be altered based on requirement.
- Constrain : Multi line rotated text fails to translate in this model.
Deployment
This model is deployed on AWS EC2 (t2.large) instance using Streamlit framework.
As we require tensorflow, tesseract, Open CV and other packages we choose t2.large instance for better performance. It provides 8GB of Ram and 2vCPUs.
Here is a video clip of the model performance on EC2 instance.
Future Work
- Instead of using EAST and tesseract we will try to use FOTS (Fast oriented text spotting with a unified network) which can speed up out task as this treats text detection and recognition as a single task.
- We need to come up with different techniques to detect boundary boxes where we can build different shapes of boxes around the text in any angle so that while drawing these boxes it will not overlap with other boxes.
- This project can be implemented further to translate LIVE text using FOTS and with different shapes of bounding boxes as mentioned above.
References
- https://www.appliedaicourse.com/course/11/Applied-Machine-learning-course
- https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
- http://www.manythings.org/anki/ita-eng.zip
- https://arxiv.org/pdf/1706.03762.pdf
- http://jalammar.github.io/illustrated-transformer/
- https://arxiv.org/pdf/1704.03155.pdf
- https://medium.com/generalist-dev/background-colour-detection-using-opencv-and-python-22ed8655b243
- https://www.pyimagesearch.com/2018/08/20/opencv-text-detection-east-text-detector/
- https://pypi.org/project/pytesseract/
- https://www.youtube.com/watch?v=jJTa625q85o&t=531s
- https://www.youtube.com/watch?v=WJpL-krgmqs&t=1612s
GitHub repository
My whole case study can be accessed through the following GitHub repository.