Analytics Vidhya
Published in

Analytics Vidhya

Restoring Text From A PDF Image

The pythonic way…

Photo by Colin Nixon at www.freeimages.com

Have you ever scanned a document into a pdf as an image and then later realized that you actually needed to be able to edit the document? Adobe has built in optical character recognition (OCR) software that can make for any easy fix, if you have adobe professional. If you don’t have this luxury but have a few minutes, keep reading.

What you need…

  1. Python3
  2. Tesseract OCR: sudo apt-get install tesseract-ocr
  3. These python libraries: wand, Pillow, pyocr, PySimpleGUI

Set up your virtual environment, import your python version of choice, install the libraries and run the code:

import PySimpleGUI as sg
from wand.image import Image
from PIL import Image as PI
import pyocr
import pyocr.builders
import io
sg.theme('DarkAmber')tool = pyocr.get_available_tools()[0]
lang = tool.get_available_languages()[1] # Sets language to English
# Set GUI layoutlayout = [ [sg.Text("Chose PDF:")],
[sg.FileBrowse(key='user_file_conv'), sg.Text('', size=(32,1)),],
[sg.Text("Name Your File:"), sg.InputText(key='user_file_save')],
[sg.Button('Convert'),sg.Cancel()]
]
window = sg.Window('PDF_Converter', layout)def produce_text(cmd_input):

req_image = []
final_text = []

image_pdf = Image(filename=cmd_input, resolution=300)
image_jpeg = image_pdf.convert('jpeg')
for img in image_jpeg.sequence:
img_page = Image(image=img)
req_image.append(img_page.make_blob('jpeg'))
for img in req_image:
txt = tool.image_to_string(
PI.open(io.BytesIO(img)),
lang=lang,
builder=pyocr.builders.TextBuilder()
)
final_text.append(txt)
with open('{}.txt'.format(values['user_file_save']),'+w') as file:
for txt in final_text:
file.write(txt)
# ---------Event Loop-------------------------------while True:event, values = window.read()
if event in (None, 'Cancel'):
break
if event == 'Convert':
produce_text(cmd_input=values['user_file_conv'])
window.close()
Resulting PDF_Converter GUI

If you don’t like the “Dark Amber” then feel free to change it up.

One problem you might run into is permissions around your /etc/ImageMagick-6/policy.xml file. If the program fails to read the pdf, edit the file in superuser and change the pdf rights from “none” to “read”. Save the file and you should be good to go.

Now lets test. Here is an example of a pdf image that is not editable:

And here is the output after running it through our pdf_OCR:

IDRH
Non-text-searchable PDF
This is an example of a non-text-searchable PDF. Because it was created from an image rather than a text document, it cannot be rendered as plain text by the PDF reader. Thus, attempting to select the text on the page as though it were a text document or website will not work, regardless of how neatly it 1s organized.

The great part of the program is that it will read and convert text from other images as well as pdf images. Here is a picture I took of menu from some junk mail.

Junk mail picture stored as .jpg

Here is the resulting text after running the image through the program.

MENUFirst Course
Shrimp Gumbo | Pappadeaux Salad
Second Course
Filet Mignon
Salmon Alexander
Big Bay Platter
Giant Shrimp & Creamy Grits
Third CourseVanilla Cheesecake
With fresh strawberries
Turtle Fudge Brownie
With pecans, chocolate and caramel sauce

The program doesn’t work as well on handwritten notes, but maybe that’s a project for another time.

--

--

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
MB

MB

Husband, Father, Pediatrician & Informaticist writing about whatever is on my mind for today.