Document Translation Using Attention, EAST, and Tesseract

Swarup Barua

Published in

The Startup

10 min readOct 6, 2020

Translating documents from Italian to English language using Attention, EAST and Tesseract.

Table of content

Introduction
Problem statement
My approach and Model pipeline
Text Translation
Text Detection
Text Recognition
Structuring Translated Text
Results
Summary
Deployment
Future Work
References

Introduction

In the current era of Machine learning and deep learning, we have several case studies that has been solved and implemented with the state of the art techniques of Artificial Intelligence. One of the very well-known problem is language translation of a given document. Searching each words into google translate is a hard process to follow also sometime we may not get the exact context of the text. To solve this many of the companies has invested millions.

Problem Statement

Document translation is a widely used task in deep learning. Documents such as invoices, letters, certificates, handbooks etc in foreign language is hard to understand and can cause a lot of trouble. Here we are implementing a standard solution which can translate a given document from Italian to English language. Translated text can be structured as same as in original image which will help reader to verify their original documents.

My Approach and Model pipeline

We will use four steps to translate a document.
1. Text Detection
2. Text Recognition
3. Text Translation
4. Structuring Translated Text

Lets start with text translation first. Text translation is a separate module in this case study.

Text Translation

Data Overview

We have collected the Italian to English translation dataset from http://www.manythings.org/anki/. This is a text file which includes few more text fields which are not necessary to us.

After cleaning we have total 3,40,432 Italian to English Translations.

Fig 1: Cleaned dataframe (Italian and English sentences)

Exploratory Data Analysis

Lets first see word counts in each Italian sentences.

Above bar plot shows that nearly 70,000 Italian sentences has word count of five. There are very few sentences with very high word counts.

Now lets plot histogram of both Italian and English text.

Fig 3: Histogram of Italian and English word counts

Histogram of Italian and English sentence word counts are almost overlapping. It shows word count in both Italian and English sentence are linearly related. Number of Sentence with 0 to 15 word counts are very high for both languages.

Data Preprocessing

Below are few steps we followed to preprocess the both the Italian and English text.
a. Change all character to lower case.
b. Convert all input text from Unicode to ASCII.
c. Remove all characters except (a-z, A-Z, 0–9, “.”, “?”, “!”, “,”)

Split the preprocessed dataset into Train test and validation. Now lets check the maximum word count of our train dataset (Italian and English sentence).

Fig 4: Max word counts of Italian (left) and English (right) sentence

Max word count of Italian sentence is 104 but it is very rare, only 2 data points. Hence we select 55 as our maximum length for Italian text while performing tokenization and padding. Similarly for English we choose 51 as the Maximum length.

Perform tokenization and padding on train and test dataset.

Deep learning Models

Here we used both Attention and Transformer architecture and compared BLEU score on test dataset.

BLEU (Bilingual Evaluation Understudy) is a measurement of the differences between an automatic translation and one or more human-created reference translations of the same source sentence. For a quick note refer here.

Attention Model and architecture

Below is the Attention mechanism we used for translation. For detailed architecture refer this beautiful blog.

Fig 5: Attention Model architecture (Source)

We used custom models (encoder, onestepdecoder and decoder) and custom layers (attention) to train our attention model. Also we used custom loss function to consider only the tokenized words and not the padded ones. Once our attention model is trained, we used this to predict on test dataset. Here we used a predict function which will take input of Italian sentence, encoder and onestepdecoder object from trained model and return the translated sentence in English.

This model had total 14,646,519 trainable parameter. After model training completion, we have translated random 1000 test Italian sentence and achieved average 0.66 BLEU score.

Here are few results from this model. We plotted the attention weights of the input sentence.

For the first sentence model translated exact same as original sentence with BLEU score as 1.0. In the second example as well the translated and original text are same except ‘to’ which is missing in our translated text. But this doesn’t not make any impact as the both sentence means the same. Here we got 0.61 BLEU score.

Transformer Model and architecture

We build transformer model as well for translation. Transformer architecture has five important blocks as below.
a. Encoder & decoder Stack
b. Self Attention
c. Feed- Forward Network
d. Embedding and Softmax
e. Positional Embedding
Please refer this for detail architecture.

Fig 7: Transformer Model Architecture (source)

The key of this transformer architecture is scaled dot product which is used in self Attention layer.

Attention(Q, K, V) = softmax(QKT √ dk )V
Here Q is query matrix, K is key matrix and V is value matrix. dk is dimension of Keys.

Here we used 6 layers of Encoder and decoder stacks and total 8 head attention. Similar to our previous Attention model, here we used custom layers (Encoder, MultiheadAttention, Decoder and transformer) to train the model. We also used custom loss same as in Attention Model.

We trained this model for 30 epochs and saved the model weights. This model has total 9,441,855 trainable parameter (less than our attention model). We evaluated this model on the same 1000 test data point as we used in attention model and achieved average 0.605 BLEU score.

Here are few results from this transformer model.

As we have used total 8 attention head, hence we have total eight set of attention weights. We have plotted these weights from 6th decoder layer (we can choose any decoder layer and check the translation process).

In the above Text translation in both cases model has performed pretty good. But BLEU score is less because of the words in original and predicted text do not match exactly.

Model Comparison

For the architecture we used, we have Attention model performing better than Transformer. We re evaluated both the models on different set of 1000 test data points and found that Attention model is performing better there too. Hence we have chosen Attention model for text translation in our problem.

Also as this is a custom model implementation, we saved our model using checkpoints (tf.train.Checkpoint).

Text Detection

Text detection is the first step we do when a new document is given for translation. With wrong detection of text we can miss words/sentences and that can lead to failure in translation. For TEXT detection we have used pre trained EAST (An Efficient and Accurate Scene Text Detector).

EAST model has three steps to follow as feature extractor (stem), feature merging branch and output layer.

Feature extractor is a convolution network trained on ImageNet. It has four block which outputs four different size of input image (1/32, 1/16, 1/8 and 1/4), called feature map.
In Feature merging branch first it upPools each last output from feature extractor, concatenate that with current feature map and pass it through 1X1 convolution layer followed by another 3X3 convolution layer.
In output layer multiple 1X1 convolution operation is used to project 32 channels of features maps into 1 channel of score map and multi-channel geometry map.

We used pretrained EAST model from https://github.com/oyyd/frozen_east_text_detection.pb.

EAST model is originally implemented in C++. As we have used python, we were not able to generate rotated bounding boxes in case the detected text is not aligned with X axis. Hence in case of rotated text we used the same rectangle bounding boxes with start and end X and Y geometries.

Also EAST model provides us bounding boxes for each words. But for translation task we need to get the whole sentence at once and then translate it using our text translator.

Here is the Pseudo code we used for handling the above constrains.

The same Pseudo code can be referred from here.

Text Recognition

For text recognition we used pytesseract. This is an optical character recognition (OCR) tool for python. That is, it will recognize and “read” the text embedded in images.
Python-tesseract is a wrapper for Google’s Tesseract-OCR Engine.
Please refer to this documentation for more details on pytessetact (installation, functions and parameters).

Here is the Pseudo code we used for recognizing horizontal and rotate text.

Fig 12: Pseudo code for Text recognition

The same Pseudo code can be referred from here.

Once we have all BB’s and their recognized text, we merged these texts and store the BB’s start and end geometries, angles and word positions in the merged sentence. Now pass these merged Italian texts to our text translation model and get the translated English texts.

Structuring Translated Text

In this stage we structured our translated text so that it can be represented to the reader in the original document fashion.

In case of multiple lines of texts in original document, reader would need the translated text to have multiple lines which will help reader to understand and compare between the original and translated documents.

Below steps are followed to structure the translated text.

Break the translated text to get it adjusted in multiple lines.
Replaced the original text space with a mask such that the original text gets overlapped with blank image. And now write the translated text onto this mask images. Here the mask image is generated using color detection from the selected BB. Most used/average BRG color found inside the BB is used to generate the mask so that it looks like as same as the original image. For more on color detection refer this blog.
Original text and translated text length may not be same, so we need to adjust the translated text size so that we can replace it in the original text space. Also if our input text is rotated we need to replace the original text with translated text in same rotated angle.

Results

Lets translate few Italian Documents. For easy understanding we have created our own texts and documents with different text alignment.

Fig 14: Result 2 (Simple Horizontal text)

Fig 15: Result 3 (Documents with Multiple paragraphs)

Fig 16: Result 4 (Documents with Multiple paragraphs)

Fig 17: Result 5 (Documents Multiple paragraphs and rotated text)

Fig 18: Result 6 (FAILED — Rotated Multiple text)

Observation

From the above results we can see our model is performing decent. It is able to handle multi paragraph Italian text and rotated single line text as well. But it fails when we have multiple rotated text line. As discussed in Text Detection, as we are using pretrained EAST in python, we are not able to handle rotated boundary boxes, and hence we had to use large horizontal rectangle boxes for rotated texts which makes it overlap with boxes.

Here are the times consumed by this model to process the above documents.

Above screenshot is from Spyder. Here for first two documents model took less than 1.5 sec as these two documents has very less text present in it. Document 3,4,5 and 6 has lot of Italian text in multiple paragraph. Hence it took more time to translate.

Summary

Given any Italian document we are able to translate it into English. Here are the pipeline used for translation.
Text Detection -> Text Recognition -> Text Translation -> Re Structuring Text
For Text Detection we used Pretrained EAST, Python tesseract for Text recognition and Attention model for text translation.
Here we used few threshold based on our experiment cases (bounding box confidence, rotated angle etc). These can be altered based on requirement.
Constrain : Multi line rotated text fails to translate in this model.

Deployment

This model is deployed on AWS EC2 (t2.large) instance using Streamlit framework.
As we require tensorflow, tesseract, Open CV and other packages we choose t2.large instance for better performance. It provides 8GB of Ram and 2vCPUs.

Here is a video clip of the model performance on EC2 instance.

Fig 20: Model performance

Future Work

Instead of using EAST and tesseract we will try to use FOTS (Fast oriented text spotting with a unified network) which can speed up out task as this treats text detection and recognition as a single task.
We need to come up with different techniques to detect boundary boxes where we can build different shapes of boxes around the text in any angle so that while drawing these boxes it will not overlap with other boxes.
This project can be implemented further to translate LIVE text using FOTS and with different shapes of bounding boxes as mentioned above.

References

GitHub repository

My whole case study can be accessed through the following GitHub repository.

Swarupbarua/Italian_to_English_Translator

Here we are translating Italian language to English from images present in 'text_data' folder. This same project can…

github.com

LinkedIn Profile

https://www.linkedin.com/in/swarup-barua-44251ab1/