Filling Editable PDF in Python

Using python to automate the Editable PDF Filling Process.

Vivsvaan Sharma
6 min readAug 17, 2020

Introduction

Unlike traditional paper-based forms which are often a headache to complete by hand, PDF fillable forms are convenient and offer a broad range of user interactivity. It is fast, easy, has fewer errors, easy archiving, and accessing and completely reusable.

Many companies these days have started using editable Pdf forms instead of paper-based forms for filling information for their clients. In most of the organizations (like banks, universities), users fill up a form on a website, and later that data is fetched from DB to fill the Pdf, with pre-defined template and fields, and store it for later use.

In this article, I’m gonna use python to automate this process of filling data in Pdf. Let’s get started.

Making our Class

So first, I’m gonna make a class which will have all the methods for the Pdf. I am naming it as ProcessPdf. In the __init__() function in this class, I’ll set temporary folder path where all our temporary files (pdfs) will be stored, and an output file path where the final pdf will be stored.

class ProcessPdf:    def __init__(self, temp_directory, output_file):        self.temp_directory = temp_directory        self.output_file = output_file

Next, I’ll make a function for writing data in the Pdf. In this function, I’ll pass path to my Pdf template file and the data object containing all the data to be filled. Then I’ll read the Pdf using Pdfrw Library in python.

Note — The keys in the data object should be the same as those in Pdf.

def add_data_to_pdf(self, template_path, data):
template = pdfrw.PdfReader(template_path)

Loop over all the pages in Pdf, and for each page, loop over all the editable fields (called annots). For each page, all the annotations are stored in ‘/Annots’ key, and all the key names for each field/annot are stored in ‘/T’ key.

for page in template.pages:
annotations = page[‘/Annots’]

for annotation in annotations:
key = annotation[‘/T’][1:-1]

Now that we have our fields, we can add value to those fields. For that I’ve made another function encode_pdf_string(), as we can’t simply assign a value to that field, we have to encode the value in the object which Pdf can understand.

I’ll send the value and type of key (string or checkbox etc) to the function and it will return the encoded value.

def encode_pdf_string(self, value, type):
if type == ‘string’:
if value:
return pdfrw.objects.pdfstring.
PdfString.encode(value.upper())
else:
return pdfrw.objects.pdfstring.PdfString.encode(‘’)
elif type == ‘checkbox’:
if value == ‘True’ or value == True:
return pdfrw.objects.pdfname.BasePdfName(‘/Yes’)
else:
return pdfrw.objects.pdfname.BasePdfName(‘/No’)
return ‘’

Note that here I’ve also added a case for the checkbox. BasePdfName(‘/Yes’) will mark the checkbox true and show a tick and BasePdfName(‘/No’) will mark it false and no tick will appear.

We can use this function and add the value in the field like this —

annotation.update(pdfrw.PdfDict(V=self.encode_pdf_string(data[key], FORM_KEYS[key])))

After adding all the values we can write our Pdf and save it.

pdfrw.PdfWriter().write(self.temp_directory + “data.pdf”, template)

But before that, there are two more things. We can make the fields of Pdf non-editable (it is optional) so that once they are filled, the user cannot edit it. We can do this by setting a flag to 1 for every annot.

annotation.update(pdfrw.PdfDict(Ff=1))

And sometimes, after filling pdf, the values might not be visible until you click on these fields. For that, we need to set another flag NeedAppearances to true.

template.Root.AcroForm.update(pdfrw.PdfDict(NeedAppearances=pdfrw.PdfObject(‘true’)))

So now our function will look like —

def add_data_to_pdf(self, template_path, data):
print('\nAdding data to pdf...')
template = pdfrw.PdfReader(template_path)
for page in template.pages:
annotations = page['/Annots']
if annotations is None:
continue
for annotation in annotations:
if annotation['/Subtype'] == '/Widget':
if annotation['/T']:
key = annotation['/T'][1:-1]
if re.search(r'.-[0-9]+', key):
key = key[:-2]
if key in data:
annotation.update(pdfrw.
PdfDict(V=self.encode_pdf_string(data[key],
FORM_KEYS[key]))
)
annotation.update(pdfrw.PdfDict(Ff=1))
template.Root.AcroForm.update(
pdfrw.PdfDict(NeedAppearances=pdfrw.PdfObject('true')))
pdfrw.PdfWriter().write(self.temp_directory + "data.pdf",
template)
print('Pdf saved')
return self.temp_directory + "data.pdf"

Note — In Pdf, I guess there can’t be more than one key with the same name. So if there are multiple fields that are the same (like name), we can add a number like name-1, name-2, name-3, and so on, and later while filling data, we can strip this number from the key name (see line 15 above, re.search()).

Some Utility Functions

Now that we’ve made a function for filling data in Pdf, there are some other util functions too which might be helpful.

First is adding images to Pdf. This can be helpful if the user uploads KYC data like Aadhar card image or a signature and you want to add the same to the Pdf.

For this, I’m gonna make a function add_image_to_pdf(). I’ll pass pdf path, list of images, and list of positions, where images will be added, to this function.

I’m using Fitz library for adding images. Firstly I’ll open the pdf and then loop over the list of positions for images, then on each position, I’ll use insertImage() function for adding the images.

insertImage() functions takes in 2 arguments, a rectangle where the image will be added (which I’ll make using Fitz.Rect() function in which I need to pass coordinates of the upper left (x0, y0) and bottom right corner (x1, y1) of the rectangle), and the path of the image file which is in the list of images.

So our function will look like this —

def add_image_to_pdf(self, pdf_path, images, positions):
print('\nAdding images to Pdf...')
file_handle = fitz.open(pdf_path)
for position in positions:
page = file_handle[int(position['page']) - 1]
if not position['image'] in images:
continue
image = images[position['image']]
page.insertImage(
fitz.Rect(position['x0'], position['y0'],
position['x1'], position['y1']),
filename=image
)
file_handle.save(self.temp_directory + "data_image.pdf")
print('images added')
return self.temp_directory + "data_image.pdf"

Example of images list and positions list are —

positions = [
{ ‘page’: 1, ‘x0’: 320, ‘y0’: 790,’x1’: 440, ‘y1’: 810,
‘image’: signature },
{ ‘page’: 1, ‘x0’: 470, ‘y0’: 75,’x1’: 565, ‘y1’: 175,
‘image’: aadar-card }
]
images = { ‘signature’: ‘path_to_signature’,
‘aadhar-card’: ‘path_to_aadhar_card’ }

The second function is compressing the pdf. This function uses gs command, which invokes Ghostscript which is an interpreter of the Adobe Systems.

So the function to compress pdf using Ghostscript is —

def compress_pdf(self, input_file_path, power=3):
"""Function to compress PDF via Ghostscript command line
interface"""
quality = {
0: '/default',
1: '/prepress',
2: '/printer',
3: '/ebook',
4: '/screen'
}
output_file_path = self.output_file if not os.path.isfile(input_file_path):
print("\nError: invalid path for input PDF file")
sys.exit(1)
if input_file_path.split('.')[-1].lower() != 'pdf':
print("\nError: input file is not a PDF")
sys.exit(1)
print("\nCompressing PDF...")
initial_size = os.path.getsize(input_file_path)
subprocess.call(['gs', '-sDEVICE=pdfwrite', '-
dCompatibilityLevel=1.4',
'-dPDFSETTINGS={}'.format(quality[power]),
'-dNOPAUSE', '-dQUIET', '-dBATCH',
'-sOutputFile={}'.format(output_file_path),
input_file_path]
)
final_size = os.path.getsize(output_file_path)
ratio = 1 - (final_size / initial_size)
print("\nCompression by {0:.0%}.".format(ratio))
print("Final file size is {0:.1f}MB".format(final_size /
1000000))
return output_file_path

Our work is not done yet. Note that in all these functions I’ve returned the pdf file path. That is because all those pdf were temporary and the final pdf was the pdf returned by compression function. So we need to make another function for deleting those temporary pdfs.

I’ll use simple os.remove command for this —

def delete_temp_files(self, pdf_list):
print(‘\nDeleting Temporary Files…’)
for path in pdf_list:
try:
os.remove(path)
except:
pass

Now our ProcessPdf class is ready. Now we need to make our driver function so we can save this class in a separate file (pdf_processing.py). Our class now is —

Making our Driver Function

In the main function, assuming that we’ve made our data object and a variable output_file containing the final pdf path+name, I can call the PdfProcess class.

pdf = ProcessPdf(“pdf_temp”, “final_pdf.pdf”)

Now we’ll call our add_data_to_pdf() function.

data_pdf = pdf.add_data_to_pdf(pdf_template_path, data)temp_files.append(data_pdf)

where pdf_template_path is a path for our Pdf template.

This will return the path of the pdf containing the data. I will add that path to temp_files list to later delete this file as it is not our final pdf.

Now I can call the add_images_to_pdf() function by passing data_pdf, images list and positions list.

data_image_pdf = pdf.add_image_to_pdf(data_pdf, image_list, 
image_positins)
temp_files.append(data_image_pdf)

The last step is to compress this pdf.

compressed_pdf = pdf.compress_pdf(data_image_pdf)

Now, as we have our final Pdf saved, we can delete all the temporary files/pdfs.

pdf.delete_tempfiles(temp_files)

Some Image Processing

We can also make another class (in image_processing.py) for image operations, like removing background from the signature that the user uploaded or rotating the image.

That class would look like this —

Conclusion

Now we’ve made our class for pdf processing, for image processing, and a driver function. The whole code is uploaded to my Github.

Note — Please replace all the constants with your own values/lists.

--

--

Vivsvaan Sharma

UI Designer • Web Developer • Cyber Security Enthusiast