Text Detection on Natural Scenes with Tensorflow Object Detection API

Armand Olivares
Jun 15 · 7 min read

Evaluating models capabilities of localizing and identifying Text on natural scenes images.


I am interested in NLP so I have been playing with some exercises and projects related to, in recent days I saw several project with object detection so I decided to play with the tensorflow API, the main objective of this article is to show the construction and evaluation of deep learning models for detection of texts in natural images, the model will be able to identify in which images texts appear typing and to extract the part of the image containing texts, For this purpose.


Given an image or a video stream, an object detection model can identify which of a known set of objects might be present and provide information about their positions within the image, for example:

In this article I am going to apply object detection for detect text in images, here you can find a complete guide about object detection.

The dataset to use is the well known MS COCO (Common Objects in Context) -Text in its version 2017 which contains 115K for training and 5K for validation also this dataset is assists in loading and parsing the annotations with The COCO-Text API.

Using the COCO Text API we can see that the data has 4 categories:

  1. Legible — machine printed: images with the presence of readable text printed by machine, example of these images:

2. Ilegible — machine printed: images with presence of illegible text printed by machine:

3. Ilegible –handwritten: images with the presence of ilegible handwritten text.

4. Legible — handwritten: images with the presence of handwritten legible text.

The Distribution of each class:

We will select only 1 class: Machine legible in English.

The first step is to filter, with python, through the COCO API the images whose annotations contain only legible text in English and that such text is printed by machine, because it predominate (machine legible), also to narrow the problem and limit the complexity of the recognition.

For each image on the dataset we need to create an xml file with the following structure:

The xml generated is called PASCAL VOC, and is necessay for training / validation / test.

to recall, as input each tensorflow model will need:

Each dataset is required to have a label map associated with it. This label map defines a mapping from string class names to integer class Ids. in our case the label map is a file called text_label_map.pbtxt, this only contain the class and the label associated:

item {  
id: 1
name: 'machine_legible'

For every example in your dataset, you should have the following information:

  • An RGB image for the dataset encoded as jpeg or png.
  • the XML file.
  • the class label map file.

TensorFlow’s Object Detection API is a very powerful tool that can quickly enable anyone to build and deploy powerful image recognition software. the guide to learn and to get help for use the API is in this link.

A pre-trained model is a model that was trained on a large benchmark dataset to solve a problem similar to the one that we want to solve. Tensorflow API provides some different pre-trained Deep leaning models (model previously trained over mutliples huges datasets for other different purposes) to choose from, the selected models are shown below, they are picked because the speed of execution and the accuracy (mAP)

We selected the above pretrained model to train our custom text detector.

The Speed metrics (execution speed) is measured in milliseconds (ms) and the COCO mAP metric is the average of the average precision (mAP is the metric for object detection, like accuracy the more the better). As the data is annotated using the boxes approach, we will evaluate only models that have outputs boxes (the green square) instead of masks.
In table above you can see how the fastest one is ssd_mobilenet_v2_coco but also the one that offers the most under performance in mAP, according to the official report of tensorflow developers.


The entire process to follow is:

  1. Install the object detection API of Tensorflow: Follow the instructions here
  2. Clone this repo git clone https://github.com/tensorflow/models.git
  3. Download the pretrained models from this link, for example to download faster_rcnn pretrained:
wget http://storage.googleapis.com/download.tensorflow.org/models/object_detection/faster_rcnn_inception_v2_coco_2018_01_28.tar.gz
tar -xvf faster_rcnn_inception_v2_coco_2018_01_28.tar.gz

4. Create in the object_detection directory (from the directory cloned in step 2) the next structure:

| +--- train_text.sh
| +--- eval_text.sh
| +--- export_model__text.sh
| +object_detection/
| +--- text_label_map.pbtxt
| +---faster_rcnn_inception_v2_coco.config

and copy the files.

5. Modify the configuration json file of the model to be trained ( for example faster_rcnn_inception_v2_coco.config), in order to use the number of classes (labels) present in the data, the path where the checkpoint files are, data from training and test.

In this file you can also configure the speed of learning, batch size, among other hyperparameters, example of the format of this file:

model { 
faster_rcnn {
num_classes: 1
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 600 max_dimension: 1024
feature_extractor {
type: ‘faster_rcnn_inception_v2’ first_stage_features_stride: 16 } ....

please review this tutorial for the complete properties for this file.

6. Convert the database to TFRecords with this script

7. In a Terminal execute train_text.sh to train the model: this files is a simple script containing:

# Execute from models/research/ 
python object_detection/train.py \ — logtostderr \ — pipeline_config_path=./object_detection/models/text/faster_rcnn_inception_v2_coco.config \ — train_dir=./object_detection/models/text/train

Contains the path where the json configuration file is, and where the model checkpoint is going to be saved.

8. Simultaneously, in other terminal, execute eval_text. sh to start the evaluation proccess, this file contain:

export PYTHONPATH=$PYTHONPATH:.:./slim export 
python object_detection/eval.py \ --logtostderr \ --pipeline_config_path=./object_detection/models/text/faster_rcnn_inception_v2_coco.config \ --checkpoint_dir=./object_detection/models/text/train \ --eval_dir=./object_detection/models/text/eval

9. In another terminal start the tensorboard to follow the training and evaluation processes, and you could see something like this:

  • training after 2000 steps:
  • training after 200K steps:

You can see the green square that the model has started to identify correctly where the text in the image is located.

TensorFlow Object Detection API uses “PASCAL VOC 2007 metrics” where an instance predicted is correctly classified when the Intersection over Union (IoU) exceeds 50%, and the IoU is calculated:

  1. The ground-truth bounding boxes ( the hand labeled bounding boxes from set that specify where in the image our object is) the green square in image below.
  2. The predicted bounding boxes from our model, the red square in the image below.

Dividing the area of overlap by the area of union yields our IoU score, An Intersection over Union score > 0.5 is normally considered a “good” prediction.

with the IoU the mean average precision is calculated for our models on validation data the mAP value is:

for this article the best model is rfc_resnet101_coco with a 0.31 mAP .


I take the best model (rfc_resnet101_coco) and applied to a random video in order to see how good is, this is the result:

Final thoughs

  • A fine adjustment starting from a pre-trained model worked reasonably well reaching average levels in the state of the art scale.
  • Starting from this text detection model, training can be continued
    varying parameters or simply performing more iterations (I trained it with 200K steps) in order to improve The metric.
  • The rfcn_resnet101_coco network showed the one that best fits our data, However it is not the one that achieves the greatest speed, in addition to this the generated graph it is not integrable in a mobile environment.

Thanks for reading! I can also be reached on my site

Armand Olivares

Written by

Engineer, Data Science — NLP Practitioner https://armandds.github.io/#projects