Published in

Tutorial : Building a custom OCR using YOLO and Tesseract

In this article, you will learn how to make your own custom OCR with the help of deep learning, to read text from an image. I will walk you through the example of PAN Card images, for text detection and text recognition. But first, let’s get familiar with the process involved in Optical Character Recognition.

What is OCR?

OCR stands for Optical Character Recognition. It is used to read text from images such as a scanned document or a picture. This technology is used to convert, virtually any kind of images containing written text (typed, handwritten or printed) into machine-readable text data.

Here, we are going to build an OCR which only reads the information you want it to read from a given document.

OCR has two major building blocks:

  • Text detection
  • Text recognition

1. Text detection

Our first task is to detect the required text from images/documents. Often, as the need is, you don’t want to read the entire document, rather just a piece of information like credit card number, Aadhaar/PAN card number, name, amount and date from bills, etc. Detecting the required text is a tough task but thanks to deep learning, we’ll be able to selectively read text from an image.

Text detection or in general object detection has been an area of intensive research accelerated with deep learning. Today, object detection, and in our case, text detection, can be achieved through two approaches.

  • Region-Based detectors
  • Single Shot detectors

In Region-Based methods, the first objective is to find all the regions which have the objects and then pass those regions to a classifier, which gives us the locations of the required objects. So, it is a two-step process.

Firstly, it finds the bounding box and afterwards, the class of it. This approach is considered more accurate but is comparatively slow as compared to the Single Shot approach. Algorithms like Faster R-CNN and R-FCN take this approach.

Single Shot detectors, however, predict both the boundary box and the class at the same time. Being a single step process, it is much faster. However, it must be noted that Single Shot detectors perform badly while detecting smaller objects. SSD and YOLO are Single Shot detectors.

Often, there is a tradeoff between speed and accuracy while choosing the object detector. For example, Faster R-CNN has the highest accuracy, while YOLO is fastest among all. Here is a great article which compares different detectors, and provides comprehensive insights on how they work.

To decide which one to use, totally depends on your application. Here, we are using YOLOv3 here mainly because,

  • No one can beat it when it comes to speed.
  • Has good enough accuracy for our application.
  • YOLOv3 has Feature Pyramid Network (FPN) to detect small objects better.

Enough said, let’s dive into YOLO

Using YOLO(You only look once) for Text Detection

YOLO is a state-of-the-art, real-time object detection network. There are many versions of it. YOLOv3 is the most recent and the fastest version.

YOLOv3 uses Darknet-53 as it’s feature extractor. It has overall 53 convolutional layers, hence the name ‘Darknet-53’. It has successive 3 × 3 and 1 × 1 convolutional layers and has some shortcut connections.

For the purpose of classification, independent logistic classifiers are used with the binary cross-entropy loss function.

Training YOLO using the Darknet framework

We will use the Darknet neural network framework for training and testing. The framework uses multi-scale training, lots of data augmentation and batch normalization. It is an open source neural network framework written in C and CUDA. It is fast, easy to install, and supports CPU and GPU computation.

You can find the source on GitHub.

Here is how easy it is to install Darknet framework. Just 3 lines! (If you are going to use GPU then update GPU=1 and CUDNN=1 in the makefile.)

git clone
cd darknet

Let’s start building

Get your data first…

Data is the first and most important thing in any machine learning based project. So, whatever is your application make sure you have around 100 images for it. If you have a fewer number of images, then use image augmentation to increase the size of your data. In image augmentation, we basically alter images by changing its size, orientation, light, color, etc.

There are many methods available for augmentation, and you can very easily pick any method you like. I would like to mention an image augmentation library called Albumentations, build by Kaggle Masters and Grandmaster.

I collected 50 images of PAN Card floating on the internet, and using image augmentation, created a dataset of 100 PAN card images.

Data Annotation

Once we have collected the data, let’s move to the next step, which is to label it. There are many free data annotation tools available. I used VoTT v1 because it is a simple tool and works like a charm. Follow this link, to understand the process of data annotation.

Note that it is important we tag all the text fields that we want to read from the image data. It also generates the data folders which will be required during training.

Make sure to set export format to YOLO after tagging. After annotation, copy all the generated files to the data folder of the cloned repository.


To clear all confusion, Darknet has two repositories one is by the original author and other is the forked one. We use the forked repository, as it has a great documentation.

To start training our OCR, we first need to modify our config file. You will get your required config file in ‘cfg’ folder named ‘yolov3.cfg’. Here, you need to change the batch size, subdivision, number of classes and filter parameters. Follow the required changes needed in the config file, as given in the documentation.

We will start training with pre-trained weights of darknet-53. This will help our model converge early.

To start the training hit this command

./darknet detector train data/ yolo-obj.cfg darknet53.conv.74

The best thing is it has multi GPU support. When you see that average loss ‘0.xxxxxx avg’ no longer decreases after a certain number of iterations, you should stop training. As you can see in the below chart I stopped at 14200 iterations as the loss became constant.

Loss curve

It’s not always the case that you will get the best results from last weight file. I got the best results on the 8000th iteration. You need to evaluate them on the basis of the mAP(Mean Average Precision) score. Choose weights-file with the highest mAP score. Here, they have provided the details of how to use mAP score. So now, when you run this detector on a sample image you will get the bounding box of the detected text field from which you can easily crop that region.

Ta daaaa!

Text detection on dummy pan card

2. Text recognition

Now that we have our custom text detector implemented for text detection, we move onto the subsequent process of Text Recognition. You can either build your own text recognizer or use an open-sourced one.

Although, it is a great practice to implement your own text recognizer, it is challenging to get the labelled data for it. However, if you already have a lot of labelled data to create your custom text recognizer, it’ll certainly improve the accuracy.

In this article, however, we are going to use the Tesseract OCR engine for text recognition. With only a few tweaks, the Tesseract OCR engine works wonders for our application. We are going to use Tesseract 4, which is the latest version. Thankfully, it also supports many languages.

Installing Tesseract OCR Engine

On Ubuntu 14.04, 16.04, 17.04, 17.10. For 18.04 skip first 2 commands.

sudo add-apt-repository ppa:alex-p/tesseract-ocr 
sudo apt-get update
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
sudo pip install pytesseract

To get familiar with tesseract here is a nice example.

3. Putting things together

Once we’ve implemented the process of text detection and text recognition, it is time to combine them to achieve the following flow:

  • Detect the required region from the image
  • Pass that detected regions to Tesseract
  • Store the results from Tesseract in your required format

From the above diagram, you can understand that, first the image of pan card is passed into YOLO. Then, YOLO detects the required text regions and crops them out from the image. Later, we pass those regions one by one to tesseract. Tesseract reads them, and we store that information.

Now, to represent the results you can choose any form of choice. Here, I used an excel sheet to show the results.

I have open sourced this entire pipeline. Clone the repository and move your data folder and the weight file generated after training to this repository directory. You need to install darknet here by the following command.

bash ./

Now run your OCR with this command -d -t

Congratulations! Now you can see the OCR results in the output folder in the form of a CSV file. While testing your custom OCR, you may need to change the size of the image. For this, tweak the basewidth parameter in the file.


With this article, I hope that you were able to gain a comprehensive understanding of the various steps involved in Optical Character Recognition, and implement your own OCR program alongside reading this article. I encourage you to try this method out on different sets of images, and use different detectors for your application, and see what avails the best results.

For more articles, disseminations, and hands-on NLP tutorials, follow us on

Find us on Facebook, LinkedIn, and Twitter,where we regularly post useful articles for Deep Learning practitioners and Conversational AI enthusiasts.

Multilingual virtual AI assistants fluent in spoken languages for a complete omnichannel customer communication.

Recommended from Medium

Confronting Energy Consumption Issue: Predicting Future Energy Consumption with Neural Networks


How I helped Tina implement Machine Learning to predict Gross Income of her documentary

Using Tensorflow Object Detection to control first-person shooter games

Using Deep Learning to automatically rank millions of hotel images

Proximal Policy Optimization Tutorial (Part 1: Actor-Critic Method)

Why Can’t I Predict Bitcoin? Takens’ Theorem Illustrated using Neural Network Forecasting

GFPGAN that makes a face photo very beautiful

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Karan Purohit

Karan Purohit

Deep Learning Engineer | From Computer Vision to NLP, now in Speech!

More from Medium

An introduction to 3D Object Detection in Autonomous Navigation.

Introducing Hocrox: An image preprocessing and augmentation library with Keras like interface

How to Deploy Tensorflow Models in C++ in 3 different ways

ASR(Speech Recognition) creation with Tranformers NN