How to make a scanned PDF to searchable PDF using Python?

Rakesh M
4 min readOct 10, 2020

--

One of the major benefits of a searchable PDF is that you can search quickly in a document instead of manually looking up information. We will see how this can be done in 3 simple steps.

In order to make searchable PDF, first you need to install Tesseract v5 which is the deep learning model for text recognition. You can read more about Tesseract from this paper.

Step 1:

Follow these steps to install Tesseract if you are a windows user.

  1. Download the Tesseract from this link.

2. Download and install python-3.5 from this link, if you use the spider IDE from anaconda distribution follow this link (Make sure you have to select Windows Installer) through- out this article we use spider IDE.

3. Launch the Spyder IDE from Anaconda distribution.

Step 2:

Convert the scanned PDF to image.

Installing dependencies

  1. pdf2image

pdf2image: pdf2image is a python module which will convert the PDF document to image in any format to install pdf2image, type this following command in the anaconda terminal or in Spyder ipython console.

Before writing this function you need to create two folders inpath and image_path.

In the inpath folder you put the PDF you want to convert, then create a new file named pdf_convert.py write the following program in Spyder and execute.

What’s happening in this function, we are giving a two argument to image_conversion function one inpath which is the path of your input PDF document and another path is image_path which is to store the image.

In line 1 we import the library pdf2image, in line 13 we used the module pdf2image to convert PDF from the path to the image (the format I have mentioned JPG).

Output: you will get the output image in the image_path folder.

That’s it you successfully created image from PDF files.

Step 3:

Installing dependencies

  1. Pytesseract
  2. Numpy
  3. OpenCV

Pytesseract: Pytesseract (python-Tesseract) is a wrapper for the Tesseract-OCR Engine to install Pytesseract, type this following command in the anaconda terminal or in Spyder ipython console.

Numpy: Numpy is a package for scientific computation in python to install Numpy follow this command.

OpenCV: (open source computer vision library) OpenCV was built to provide a common infrastructure to computer vision application to install OpenCV follow this command.

Before executing below program you need to create output_pdf folder.

Create a new file named conversion.py and write the following program and execute.

From line 3–4 importing necessary libraries, from line 6–8 are the path of your Tesseract –OCR which you have installed, give the path of tesseract.exe and tessdata.

In line 10 is the path of your converted image in line 12 you have to read the image as numpy array and from line 14–18 you are converting the image into searchable PDF and storing the PDF in the output_pdf folder.

Really, that’s it you made a scanned PDF into Searchable PDF.

--

--

Rakesh M

Engineer - Data Scientist at Kanini Software Solutions