Extracting Text from PDF Files Using OCR: A Step-by-Step Guide with Python Code

7 min readJul 26, 2023

Optical Character Recognition (OCR) is a technology that enables the extraction of text from images or scanned documents. It plays a crucial role in various applications, including Natural Language Processing (NLP) and text summarization. In this article, we will explore OCR, NLP, and text summarization, and understand how OCR comes into the picture for these tasks.

Optical Character Recognition (OCR):

OCR is a technology that converts images containing text into machine-readable text data. It utilizes advanced algorithms and machine learning models to recognize characters, words, and sentences in images and convert them into editable and searchable text. OCR has a wide range of applications, including digitization of printed documents, data extraction from invoices, forms, and receipts, and making scanned books accessible to visually impaired individuals.

2. Natural Language Processing (NLP):

NLP is a subfield of artificial intelligence that focuses on the interaction between computers and human language. It involves the analysis, understanding, and generation of natural language text and speech. NLP algorithms enable machines to comprehend and interpret human language, facilitating tasks like sentiment analysis, machine translation, and chatbots.

3. Text Summarization:

Text summarization is the process of condensing a lengthy piece of text into a shorter, concise version while retaining its main ideas and key information. There are two main types of text summarization: extractive and abstractive. Extractive summarization involves selecting and extracting important sentences directly from the original text, while abstractive summarization involves generating new sentences to summarize the content.

Role of OCR in NLP and Text Summarization:

NLP Applications:

OCR plays a vital role in NLP by enabling the extraction of text from images, which can then be processed and analyzed using NLP techniques. For example, in sentiment analysis, OCR can be used to extract text from social media images or memes, allowing sentiment analysis algorithms to understand people’s emotions and opinions expressed in images.

2. Data Collection and Preprocessing:

In many NLP tasks, data is collected from various sources, including scanned documents, images, and handwritten notes. OCR comes into the picture by extracting text from these sources, converting them into machine-readable formats, and facilitating further analysis.

3. Accessibility and Inclusivity:

OCR contributes to making digital content more accessible and inclusive. By extracting text from scanned books or images, OCR allows visually impaired individuals to access and interact with the content using text-to-speech technologies.

4. Text Summarization:

OCR is an essential component in the text summarization process, particularly in extractive summarization. It enables the extraction of text from images, making it possible to summarize content present in scanned documents, presentation slides, or newspaper clippings. OCR helps in identifying the relevant sentences and information for generating concise summaries.

Now, to help you understand OCR in a better way, I will walk you through a detailed workflow:

Read PDF files
Convert them into images
Perform image preprocessing to handle orientation and deskew issues
Finally, extract text from these images using OCR

We will accomplish all these tasks using Python and various libraries, making the process both straightforward and effective.

Requirements:
Before we start, make sure you have the following libraries installed:
1. pdf2image: To convert PDF files into images.
2. pytesseract: A Python wrapper for Google’s Tesseract OCR engine.
3. OpenCV: For image preprocessing tasks like deskewing and grayscale conversion.
4. pandas: For storing extracted text data in a structured manner.

Step 1: Reading PDF Files

To start our workflow, we need to read the PDF files. For this purpose, we will use the pdf2image library, which converts PDF pages into images.

from pdf2image import convert_from_path

# Replace 'input_file.pdf' with the path to your PDF file
pdf_file = 'input_file.pdf'
pages = convert_from_path(pdf_file)

Here, we import the convert_from_path function from the pdf2image library. This function is used to convert PDF pages into images. We specify the path to the input PDF file in the pdf_file variable, and then we call convert_from_path(pdf_file) to obtain a list of image objects corresponding to each page of the PDF.

Step 2: Image Preprocessing

Once we have the images from the PDF pages, we may encounter some issues like skewed or rotated pages. To handle these issues, we can perform image preprocessing using OpenCV. We will implement the deskew function to correct the orientation of the images.

import cv2
import numpy as np

def deskew(image):
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    gray = cv2.bitwise_not(gray)
    coords = np.column_stack(np.where(gray > 0))
    angle = cv2.minAreaRect(coords)[-1]
    
    if angle < -45:
        angle = -(90 + angle)
    else:
        angle = -angle

    (h, w) = image.shape[:2]
    center = (w // 2, h // 2)
    M = cv2.getRotationMatrix2D(center, angle, 1.0)
    rotated = cv2.warpAffine(image, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)

    return rotated

Above, we import the OpenCV library as cv2. The deskew function is defined to correct the orientation of the image. It takes the input image and returns the deskewed image. The actual implementation of the deskewing process is not provided in the code snippet, but it involves transforming the image to handle skew or rotation issues.

Step 3: Running OCR using pytesseract

Now that our images are preprocessed, we can run the OCR process using pytesseract. This library acts as a wrapper around Google’s Tesseract OCR engine.

import pytesseract

def extract_text_from_image(image):
    text = pytesseract.image_to_string(image)
    return text

We import the pytesseract library. The extract_text_from_image function is defined to perform OCR on the input image and return the extracted text as a string. It uses the image_to_string function from pytesseract, which takes an image as input and runs the OCR process on it to extract text.

Step 4: Text Extraction

With the OCR process completed, we can now extract the text from the images.

# Create a list to store extracted text from all pages
extracted_text = []

for page in pages:
    # Step 2: Preprocess the image (deskew)
    preprocessed_image = deskew(np.array(page))

    # Step 3: Extract text using OCR
    text = extract_text_from_image(preprocessed_image)
    extracted_text.append(text)

Above, deskew function is defined to correct the orientation of the image. It takes the input image and returns the deskewed image. The actual implementation of the deskewing process is not provided in the code snippet, but it involves transforming the image to handle skew or rotation issues.

Step 5: Text Extraction with Additional Preprocessing

The below process includes additional preprocessing to exclude the header and footer regions from the OCR process.


def process_page(page):
    try:
        # Transfer image of pdf_file into array
        page_arr = np.array(page)
        # Transfer into grayscale
        page_arr_gray = cv2.cvtColor(page_arr, cv2.COLOR_BGR2GRAY)
        # Deskew the page
        page_deskew = deskew(page_arr_gray)
        # Cal confidence value
        page_conf = get_conf(page_deskew)
        # Extract string
        d = pytesseract.image_to_data(page_deskew, output_type=pytesseract.Output.DICT)
        d_df = pd.DataFrame.from_dict(d)
        # Get block number
        block_num = int(d_df.loc[d_df['level'] == 2, 'block_num'].max())
        # Drop header and footer by index
        header_index = d_df[d_df['block_num'] == 1].index.values
        footer_index = d_df[d_df['block_num'] == block_num].index.values
        # Combine text in dataframe, excluding header and footer regions
        text = ' '.join(d_df.loc[(d_df['level'] == 5) & (~d_df.index.isin(header_index) & ~d_df.index.isin(footer_index)), 'text'].values)
        return page_conf, text
    except Exception as e:
        # If can't extract then give some notes into df
        if hasattr(e, 'message'):
            return -1, e.message
        else:
            return -1, str(e)

In this above code, we import the necessary libraries and define the process_page function, which performs the OCR process on each page. It takes an image of a page as input and performs the following steps:

Transfers the image to an array and converts it into grayscale.
Deskews the page using the deskew function.
Calculates the confidence value (page_conf) for the extracted text using the get_conf function (not provided in the code snippet).
Extracts the text using pytesseract.image_to_data and stores it in a pandas DataFrame (d_df).
Identifies the block number of the last block in the OCR result to determine the footer’s position.
Excludes the header and footer regions by their index in the DataFrame and combines the text using the join function.

In the main part of the code, we loop through each PDF file in file_list. For each file, we convert it into images, preprocess each page using the process_page function, and extract the text. We store the extracted text in a DataFrame pages_df, and then concatenate all the page DataFrames into a single DataFrame. Finally, we store the combined DataFrame for each PDF file in the dictionary OCR_dic with the filename as the key. The process is repeated for all the PDF files in the file_list, and the extracted text is saved in OCR_dic.

Note: The poppler_path parameter for convert_from_bytes function is mentioned but not provided in the code snippet. Make sure to include the correct path to the Poppler library if you encounter any issues related to it.

Conclusion

showing all the blocks recognized by tesseract

In this article, I have walked you through a detailed workflow to extract text from PDF files using OCR. We started by reading the PDF files and converting them into images using pdf2image. Next, we performed image preprocessing tasks like deskewing using OpenCV to ensure the text recognition accuracy. Finally, we used pytesseract to run the OCR process and extract the text from the images.

Refer the github link for full code

With this workflow, you can now efficiently extract text from PDF documents, making them accessible for further analysis and processing in various applications. OCR technology has proven to be an indispensable tool in modern data processing pipelines and can greatly enhance efficiency and productivity when dealing with scanned documents or images with embedded text.

Happy coding!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Extracting Text from PDF Files Using OCR: A Step-by-Step Guide with Python Code

Role of OCR in NLP and Text Summarization:

Conclusion

Written by Dr Booma