Build a Web App to convert Image document to scanned PDF document using Python

Lakshmi Narayana Santha
Analytics Vidhya
Published in
7 min readSep 27, 2019

This article is about converting multiple Images of text documents to scanned PDF document and deploying this model as Web app using Flask in Python.

Setup Virtual Environment for our project

$ pip install virtualenv

Create virtual environment and name it “imagetopdf”

$ virtualenv imagetopdf

Activate our virtual environment

$ source imagetopdf/bin/activate

Move to our project folder “imagetopdf” and create a new folder “app”.
All our files are stored in this directory with path
“~/../imagetopdf/app/ “

$ cd imagetopdf && mkdir app && cd app

Now install required dependencies

$ pip install opencv-python
$ pip install PIL
$ pip install scikit-image

Open text editor and create a python file with name “imagetopdf.py”

Import packages

Load Images using cv2 and covert to binary images

Load images in current working directory with file extensions .jpeg, .png, .jpg
as cv2 arrays which are usually in format “BGR”.Insert all images in a list by loading from directory.

NOTE: PDF pages are constructed in the same order as reading images from directory. So, images with names ‘img_1’,’img_2',….. makes sense of desired order of images.

For better image operations convert “RGB” images to binary images.

img_gray=cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)

converts image from ‘rgb’ format to ‘gray’ scale.

clahe=cv2.createCLAHE(clipLimit=4.0,tileGridSize=(16,16))
img_gray=clahe.apply(img_gray)

above lines improve image contrast by applying “Histogram Equalization”

ret,th=cv2.threshold(img_gray,130,255,cv2.THRESH_BINARY)
thsh_images.append(th)

apply image threshold to gray scale image with threshold_value=130 and store binary images in list.

Find contours in image and fit Maximum area contour to image

There may cases of image not taken with area of interest only that is image contains other content and background which not desired and should be scaled to single page only.Like below image which contains background and some part of next page

To get only important content from image we extract our area of interest from using contours in opencv.

Since our area of interest is largest area of an image, it can be identified using maximum contour area which can be close to rectangle shape.For each image get maximum area contour which is our single page document of image and fit that contour to image which is same as drawing something in image.

contours,_=cv2.findContours(img.copy(),cv2.RETR_TREE,cv2.CHAIN_APPROX_SIMPLE)

cv2.RETR_TREE method finds contours in hierarchy order and cv2.CHAIN_APPROX_SIMPLE just fits outline position co-ordinates of contour which describes entire contour instead entire contour co-ordinates which takes more space.

area=cv2.contourArea(cnt)
if area > max_area:
max_area=area
max_ind=ind

retrieves the area of each contour and finds maximum area of contour to fit like below

Draw closest rectangle shape to maximum contour which usually describes a page of PDF

Fitted contours can be any shape but we are looking for rectangle shape which is the shape of normal screens of desktop,smart phones…

epsilon=0.02*cv2.arcLength(image_conts[ind[max_area_conts[ind]],True)
approx=cv2.approxPolyDP(image_conts[ind[max_area_conts[ind]],epsilon,True)

above lines finds the closest rectangle points for our fitted maximum area contour like below

Using perspective transformation transform four sides of rectangle to full co-ordinates of an image

We end up with a rectangle shape contour which is area of interest.Now enlarge or extract this rectangle shape only like from img1 to img2

img1 : original image with unwanted background
img2: all background is removed for some extent

Improve gray scale image contrast using scikit-image and save images

th=threshold_local(img_gray.copy(),101,offset=10,method=”gaussian”)

$ pip install gunicorn
$ echo “web: gunicorn -w 4 -b “0.0.0.0:$PORT” app:app” > Procfileabove scikit-image’s threshold_local() function increases low contrast image very high contrast like from img1 to img2 below

img1: low contrast image
img2: improved image contrast

save high contrast images with name “digitised_+ original name“ of image which helps in finding images to convert as PDF.

Load saved high contrast images and save as PDF using PIL

Load saved images with prefix “digitised_” with PIL Image and save them as PDF document.

Make Web Interface using Flask

Install Flask and Werkzeug

$ pip install Flask

Flask is a light weight framework for building simple back end app engines in python.

A flask model typically contains following skeleton.

from flask import Flask 

app = Flask(__name__)
@app.route("/")
def hello():
return "Hello World!"
if __name__ == '__main__':
app.run(debug=True)

First line imports Flask class from flask package,
Second line creates app object for Flask class with “__name__” which is home url for our app,
Third line defines a decorator that routes for any name after prefix that is home url with “/” like http://home/,
Fourth and fifth lines defines a function that must be executed for above decorator,
Seventh line runs our app engine with ‘debug=True’ means debug mode on.
If we run this script with filename saved as “app.py” and if there is no errors we can see screen like this

To test our app, open browser and type “localhost:5000”

Now to deploy our model make changes to “app.py” file with HTML to add interface to take files from user like

"index.html"<html>
<head>
</head>
<body>
<!-- File Uploading-->
<form action='/uploadimages' method="POST"
enctype="mutlipart/form-data>
<input type="file" name="files" multiple />
<input value="Upload">
</form>
</body>
</html>

In above snippet, in <form action=”/uploadimages” …… > , the part “/uploadimages” states where should app engine route when this form is submitted. Here “uploadimages” is name of the app.route(“/uploadimages”) decorator we define function to handle this file request.

To handle HTML file request we must save our HTML file in a separate folder with name “templates” which would be identify by Flask for routing HTML pages. You can add more CSS, JS ,JQuery,Bootstrap and media content like Images,Videos etc., but all these files must save in another folder with name “static”.

NOTE:Change “imagetopdf.py” by adding class with name “Imagetopdf” whose function “convert()” contains all our previous code to convert Images to PDF.

from flask import Flask,requests,render_template,send_from_directory
from imagetopdf import Imagetopdf
import os
app=Flask(__name__)@app.route('/uploadimages', methods=['POST','GET'])
def uploadimages():
file_names=[]
curr_path=os.getcwd()
files_in_dir=os.listdir()
for file in file_names:
if file.split('.')[-1] is in ['jpeg','png','jpg','pdf']:
os.remove(file)
uploaded_files=request.files.getlist("files")
for file in uploaded_files:
if file.filename.split('.')[-1] is in ['jpeg','png','jpg']:
file.save(file.filename)
imagetopdf_obj=imagetopdf.Imagetopdf()
imagetopdf.convert()
try:
return send_from_directory(curr_path,'digitised_images.pdf',
as_attachment=True)
except Exception:
abort(404)
@app.route("/")
def index():
render_template("index.html")
if __name__ == "__main__":
return app.run(debug=True)

In the above snippet we change our “app.py” by adding file requests handling routine.We first removes any previous Image or PDF files and then saves uploaded images which uses our “imagetopdf.py” to convert to PDF.

Deploy to web with Heroku free hosting

Install Heroku CLI with command

$ sudo snap install --classic heroku

Before that create an account in Heroku

Login into your Heroku account

$ heroku login -i

Now create an app engine in Heroku

$ heroku create {name of your app}

Go to project directory “app” of our virtual environment where our files and other sub directories structure be

app
--templates
--static
-app.py
-imagetopdf.py

Create a “Procfile” as with text “web: gunicorn -w 4 -b “0.0.0.0:$PORT” app:app”. To do so , first install ‘gunicorn’

$ pip install gunicorn
$ echo “web: gunicorn -w 4 -b “0.0.0.0:$PORT” app:app” > Procfile

As our app requires dependencies Heroku installs these if provided in a separate file called “requirements.txt”. This dependencies can be easily retrieved with command “pip freeze” and save it to file “requirements.txt” as

$ pip freeze > requirements.txt

IMPORTANT NOTE :OpenCv errors like
“can’t import shared library libsm.co.6 …..”
may encounter if deployed directly to Heroku.This can be handled by changing
“opencv-python==4.1.1.26” to “opencv-python-headless==4.1.1.26” in “requirements.txt” file. This changes state of using GPU support by Opencv to CPU only.

Now our files structure looks like

app
--templates
--static
-app.py
-imagetopdf.py
-Procfile
-requirements.txt

We are good to deploy our app to Heroku though ‘git’.

$ git init
$ heroku git:remote -a {name of your app}
$ git add .
$ git commit -m “some message”
$ git push heroku master

Check logs in Heroku dashboard about successful build. If everything goes right our app is now a deployed to web and anyone can access this using url
“https://{nameofourapp}.herokuapp.com/”

Here is link for site I deployed my model https://imagetopts.herokuapp.com/imagetopdf (works only imagetopdf feature)

Note: Upload small size images as free accounts may face timeout errors and
it takes 20 seconds to re-start from sleep state as we haven’t subscribed.

One can also test locally using command

$ heroku open

If not started or got application errors check logs using command

$ heroku logs -t 

Check my Github repository which I added two other features Imagetotext and Imagetospeech.

Feel free to connect
Linkedin Github

--

--