Extracting Text from Images(OCR) using OpenCV & Pytesseract

siromer
6 min readMar 17, 2024

--

  • Text Extraction from Pages & Online Documentations.
    OpenCV, Python, Pytesseract, OCR (Optical Character Recognition)

Recently, I read an article about mobile phone cameras, the author was talking about the total number of images taken every day. When I saw the exact number, I was quite surprised, it was around 5 billion. Before 20 years ago our elders probably took 10 images per week, and most of the images taken were their family or some landscape.
What about today? We took photos of everything. We go to school, we take photos of notes from friends to look at later, or we take a photo of paper that is written in a foreign language, and translate it to our native language with different tools like Google Lens. Under the hood, tools like Google Lens uses OCR technology.

  • In this article, I will discuss how to extract texts from images with OCR
  • I also have a Youtube video (link) about this article , you may prefer watching it rather than reading.
Extracting Text from Book Page
  • Over the years we have started to extract useful information from images and extracting text from images become a crucial part of our lives.

What is OCR ?

Optical Character Recognition (OCR) is a foundational technology behind the conversion of typed, handwritten, or printed text from images into machine-encoded text. (Google Cloud)

OCR is a field of research in pattern recognition, artificial intelligence, and computer vision.

What can you do with OCR ?

By using OCR, you can develop numerous useful projects. After extracting text from images, it can be combined with different kinds of deep learning models. For instance, you can create a text translator, a document summarizer, a document fraud detector, and more.

What I am trying to say is that after extracting text from an image, if you use that text in various deep learning models, there are limitless possibilities for what you can create.

OCR APIs

There are so many API options, and choosing the right API for OCR depends on your purpose. Here are some of the most well-known OCR APIs:

  1. Google Cloud Vision API
  2. Microsoft Azure Computer Vision API
  3. Amazon Textract
  4. Tesseract OCR

CODE / Extracting Text with Pytesseract & OpenCV

Purpose: I took a bunch of book page photos with my phone’s camera, and I will extract text from these photos and write them into a text(.txt) file

I will explain all the steps one by one. If you want to see the entire code at once go to the below of the page.

  • Import required Libraries
# Import required libraries
import cv2
import pytesseract
import matplotlib.pyplot as plt

# Mention the installed location of Tesseract-OCR in your system
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
  • Read the image and convert it to grayscale colorspace
# Read the image from which text needs to be extracted
img = cv2.imread("resources/text_images/paragraph4.jpeg")

# Convert the image to the grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# visualize the grayscale image
plt.imshow(gray,cmap="gray")
Grayscale Image
  • Create a binary image with the cv2.threshold() function

In global thresholding, the user decides the threshold value by trying different values to find the best one.
In contrast, Otsu’s method automatically determines threshold values.
(OpenCV documentation)

# Performing OTSU threshold 
ret, thresh1 = cv2.threshold(gray, 0, 255, cv2.THRESH_OTSU | cv2.THRESH_BINARY_INV)

# visualize the thresholded image
plt.imshow(thresh1)
  • Extracting text

These steps may be a little complex but if you check image in the below, you will grasp it for sure.

  • Apply dilation and use cv2.findContours() function for finding rectangles from dilated images, these rectangles are simply paragraphs. Use these rectangles in pytesseract.image_to_string() function and extract text, and then append the extracted text and rectangle coordinates to cnt_list.
  • Adding to the list is necessary because sometimes the image_to_string() function doesn’t provide paragraphs in sequence. Therefore, I am going to use Y coordinates to sort paragraphs.
# dilation parameter , bigger means less rect
rect_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (25, 25))

# Applying dilation on the threshold image
dilation = cv2.dilate(thresh1, rect_kernel, iterations = 1)

# Finding contours
contours, hierarchy = cv2.findContours(dilation, cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_NONE)

# Creating a copy of the image, you can use a binary image as well
im2 = gray.copy()

cnt_list=[]
for cnt in contours:
x, y, w, h = cv2.boundingRect(cnt)

# Drawing a rectangle on the copied image
rect = cv2.rectangle(im2, (x, y), (x + w, y + h), (0, 255, 0), 2)
cv2.circle(im2,(x,y),8,(255,255,0),8)

# Cropping the text block for giving input to OCR
cropped = im2[y:y + h, x:x + w]

# Apply OCR on the cropped image
text = pytesseract.image_to_string(cropped)

cnt_list.append([x,y,text])
dilation | im2 , finding paragraphs from dilated image
  • Sort Paragraphs
# This list sorts text with respect to their coordinates, in this way texts are in order from top to down
sorted_list = sorted(cnt_list, key=lambda x: x[1])
  • Write the extracted texts to a text (.txt) file
# A text file is created 
file = open("recognized2.txt", "w+")
file.write("")
file.close()


for x,y,text in sorted_list:
# Open the file in append mode
file = open("recognized2.txt", "a")

# Appending the text into the file
file.write(text)
file.write("\n")

# Close the file
file.close()
Result / recognized2.txt

Full Code

# Import required packages
import cv2
import pytesseract
import matplotlib.pyplot as plt

# Mention the installed location of Tesseract-OCR in your system
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

# Read the image from which text needs to be extracted
img = cv2.imread("resources/text_images/paragraph4.jpeg")

# Convert the image to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)


# Performing OTSU threshold
ret, thresh1 = cv2.threshold(gray, 0, 255, cv2.THRESH_OTSU | cv2.THRESH_BINARY_INV)

# dilation parameter, bigger means less rectangle
rect_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (25, 25))

# Applying dilation on the threshold image
dilation = cv2.dilate(thresh1, rect_kernel, iterations = 1)

# Finding contours
contours, hierarchy = cv2.findContours(dilation, cv2.RETR_EXTERNAL,cv2.CHAIN_APPROX_NONE)

# Creating a copy of image
im2 = gray.copy()


cnt_list=[]
for cnt in contours:
x, y, w, h = cv2.boundingRect(cnt)

# Drawing a rectangle on the copied image
rect = cv2.rectangle(im2, (x, y), (x + w, y + h), (0, 255, 0), 5)
cv2.circle(im2,(x,y),8,(255,255,0),8)

# Cropping the text block for giving input to OCR
cropped = im2[y:y + h, x:x + w]

# Apply OCR on the cropped image
text = pytesseract.image_to_string(cropped)

cnt_list.append([x,y,text])


# This list sorts text with respect to their coordinates, in this way texts are in order from top to down
sorted_list = sorted(cnt_list, key=lambda x: x[1])

# A text file is created
file = open("recognized.txt", "w+")
file.write("")
file.close()


for x,y,text in sorted_list:
# Open the file in append mode
file = open("recognized.txt", "a")

# Appending the text into the file
file.write(text)
file.write("\n")

# Close the file
file.close()


# read image
rgb_image = cv2.resize(im2, (0, 0), fx = 0.4, fy = 0.4)
dilation = cv2.resize(dilation, (0, 0), fx = 0.4, fy = 0.4)
#thresh1 = cv2.resize(thresh1, (0, 0), fx = 0.4, fy = 0.4)

# show the image, provide the window name first
#cv2.imshow('thresh1', thresh1)
cv2.imshow('dilation', dilation)
cv2.imshow('gray', gray)

# add wait key. window waits until the user presses a key
cv2.waitKey(0)
# and finally destroy/close all open windows
cv2.destroyAllWindows()

--

--