OCR with Python — $50K return
Published in
2 min readJul 8, 2022
I have a project with a client who found me based on my articles on medium.com. The contract requires that I extract data from image-based pdfs. Here, I will walk you through the process of extracting text from these pdfs.
Using the OCR script that I run below has generated over 50K in the last few years.
First the imports; there are few extra imports, but they will not impact your script.
import pandas as pd
import os, sys, gc, io, pytesseract, urllib, requests, re, smtplib
from datetime import datetime, date, timedelta
from dateparser.search import search_dates
from bs4 import BeautifulSoup
import urllib
from wand.image import Image as wi
from pdf2image import convert_from_path, convert_from_bytes
from PIL import Image, ImageFilter
from pytesseract import Output
import urllib.request
from pathlib import Path
import glob as glob
First, you will need to write this function that will extract all the pages in your pdf not just the first.
def Get_text_from_image(pdf_path):
try:
pdf=wi(filename=pdf_path,resolution=300)
pdfImg=pdf.convert('jpeg')
imgBlobs=[]
extracted_text=[]
for img in pdfImg.sequence:
page=wi(image=img)
imgBlobs.append(page.make_blob('jpeg'))for imgBlob in imgBlobs…