A comparison of cloud solutions for optical character recognition (OCR)

Vladislav Klekovkin
Deelvin Machine Learning
6 min readApr 13, 2021

Optical character recognition (OCR) is a mechanical or electronic conversion of images of handwritten, typed, or printed text into text data used to represent characters in a computer (for example, in a text editor).

In this article we compare the accuracy of OCR algorithms offered by three cloud services — Google Cloud Platform, Amazon Web Services, Microsoft Azure as the most popular ones among OCR providers.

A cloud platform is a set of services and permissions offered by developers. They provide users (private users and large companies) with access to computing resources and analytical tools, as well as data storage, servers, software, etc.

Currently, the following services are the most popular ones:

1. Amazon Web Services (AWS) was founded in 2006 and at present provides IaaS, PaaS, SaaS services, among others. It also offers more than 70 resources with extended coverage in fourteen regions of the world.

2. Azure is a Microsoft product released in 2010. Today the platform offers a wide variety of different supporting tools, programming languages ​​and frameworks. It works on Microsoft Windows and Linux. Currently, about 60 services and data centers are available on the platform in more than 38 locations around the world. Among Azure’s customers are well-known names such as Johnson Controls, Fujifilm, HP, Apple as well as several other large companies.

3. Google Cloud Platform is the youngest cloud platform among these three. It was launched in 2011 and offers many services, including IaaS, PaaS and Serverless and supports Big data and IoT. The providers use more than 50 resources and have 6 global data processing centers at their disposal.

Google Cloud

In our comparison we used the Cloud Vision API. Complete information on registering and using the Cloud Vision API for OCR is available here.

In this and the following examples a Python code is provided. We used a picture from the local computer for processing. A response variable contains the detected text, box coordinates and meta information.

The following is an example of working with Cloud Vision API:

import os
from google.cloud import vision

os.environ[
"GOOGLE_APPLICATION_CREDENTIALS"] = "YOUR CREDENTIALS"

image_path = "PATH TO IMAGE"

client = vision.ImageAnnotatorClient()

with open(image_path, 'br') as image_file:
content = image_file.read()

image = vision.Image(content=content)
response = client.text_detection(image=image)

Amazon Web Services

Amazon Rekognition is used for the OCR task. More information on it is available here.

The example below demonstrates how to work with Amazon Rekognition:

import boto3

image_path = "PATH TO IMAGE"

client = boto3.client('rekognition')

with open(image_path, 'br') as image_file:
content = image_file.read()

response = client.detect_text(Image={'Bytes': content})

Microsoft Azure

We will use Azure Cognitive Services and Computer Vision S1 product. You can learn more about this service here.

The resize() function brings the image to the minimum allowed image size (50 pixels for height and width). The get_image_file_object() function creates a file object from the image. Both functions are required to work correctly with the Computer Vision service.

Here is an example of working with Azure Cognitive Services:

import time
from PIL import Image
from io import BytesIO

from azure.cognitiveservices.vision.computervision import ComputerVisionClient
from msrest.authentication import CognitiveServicesCredentials

subscription_key = "YOUR KEY"
endpoint = "YOUR ENDPOINT"
image_path = "PATH TO IMAGE"

def resize(img):

_format = img.format
size = tuple(max(val, 50) for val in img.size)
img = img.resize(size)
img.format = _format

return img

def get_image_file_object(image_path):

byte_arr = BytesIO()
img = Image.open(image_path)
img = resize(img)
img.save(byte_arr, img.format)
byte_arr.seek(0)

return byte_arr


computervision_client = ComputerVisionClient(endpoint,
CognitiveServicesCredentials(subscription_key))

mage = get_image_file_object(image_path)
job = computervision_client.read_in_stream(image=image, raw=True)

operation_id = job.headers['Operation-Location'].split('/')[-1]
image_analysis = computervision_client.get_read_result(operation_id)

while image_analysis.status in ['notStarted', 'running']:
time.sleep(1)
image_analysis = computervision_client.get_read_result(
operation_id=operation_id)

response = image_analysis.as_dict()

In our testing, we used the following 3 datasets containing pictures and annotations:

1.IIIT5K:

IIIT5K dataset consists of 5,000 images. The images show individual words cut from photographs of billboards, signboards, house numbers, and movie posters.
Download.

Sample images from IIIT5K set

2. IC13:

IC13 dataset contains 1015 cropped images with text. Words with non-alphanumeric characters are removed from the set.
Download.

Sample images from IC13 set

3. SVHN:

SVHN dataset consists of 600,000 house number images captured using Google Street View.
Download.

Sample images from SVHN set

To compare the OCR accuracy, 500 images were selected from each dataset. Cloud Vision API, Amazon Rekognition, and Azure Cognitive Services results for each image were compared with the ground truth values. The correct result was considered the one that completely coincided with the ground truth, case-insensitive. The accuracy was calculated as follows: Accuracy = 100 * correct samples / (correct samples + incorrect samples).

The table summarizes the results of the accuracy comparison:

Test results. Accuracy measure

As the results presented in the table demonstrate, among the selected datasets, Google Cloud performs better than Microsoft Azure, and Microsoft Azure performs better than AWS.

Let’s us now demonstrate on several examples how the services in question function and what results they produce. We used the same picture for text recognition in each service. The results are presented in the tables below.

Examples of IIIT5K data processing
Examples of IC13 data processing
Examples of SVHN data processing

In terms of pricing, on average, text recognition (OCR) using Amazon Rekognition and Azure Cognitive Services is more economical than using Cloud Vision API.

Selected prices for these services are represented in the table below:

Pricing in US$ for OCR services depending on the number of images processed per month

In all these services, the price per picture depends on the volume of processed pictures per month. Also, these services have offers available for free, therefore it is worth familiarizing yourself with the prices on the official websites.

The following are links to the pricing on the respective websites:

When doing price comparison, it is also important to consider the cost of other services such as data storage services, for example.

Comparing three cloud services led me to the following conclusions. If you decide to use models from cloud services for text recognition, then from the point of view of accuracy, Google Cloud performs best. If you want to get a cheaper solution, then Azure Cognitive Services is the option, but the OCR accuracy will be lower. Amazon Rekognition costs the same as Azure, but its accuracy is much worse.

All codes for this material can be found in GtHub here.

Also I want to thank Anna Gladkova and Ildar Idrisov for their help in writing this article.

--

--