Designing and building a micro-service: form-based PDF files

Lehar Oha

Published in

Swedbank AI

8 min readJan 15, 2021

Top & bottom image

Introduction

Among all the electronic document formats in play today, the PDF is definitely one of the most common ones, nowadays it is the standard way to communicate business and academic information. It is compatible with most systems and convenient to consume as a human viewer, emphasizing readability. From an automation perspective though, things are different: many files created as PDFs are basically images of documents, i.e. image-based PDFs, and this makes their processing difficult. This is probably why people working with automation have a saying that goes

PDF is a format where good data goes to die.

To extract the information from such documents we need to deploy Optical Character Recognition (OCR) tools.

In this post we cover the sub-category of PDFs that are easier to work with, namely text-based form PDFs, particularly those which have fillable and interactive fields and where text can be selected and searched for, and copied from/to PDF.

These days, forms are mostly in (HTML) web formats, but PDF-based forms are also rather common, be it:

Transaction forms for UCITS funds purchase/redemption/switch
Insurance claim forms
Legacy bank application forms
Medical forms
Government related application forms (today’s example)
…or your own area/industry-specific form

In this post, we will look at how to:

Generate fake data for the forms
Automate form filling (with fake data)
Extract/parse form data
Employ form data extraction and processing as a microservice.

To make things more concrete, let’s imagine the following scenario: a foreign student from Finland wants to start studying at the University of Tartu. Her first step is to apply for a temporary residence permit issued by the Police and Border Guard Board (read more here if you are interested). To fulfill the process, potential student must fill the “Application For Temporary Residence Permit” PDF form; this will be the form we’ll be using as an example.

So let’s begin impersonating and liberating the data …

Fake data generation

First, we’ll create some artificial data about our student. We could use Faker or Mimesis among other tools. Let’s choose the first one and for simplicity generate 10 examples of:

Finnish female first name
Finnish (female) last name
Father name (patronym) (additionally we assume that the student’s father is Russian)
Estonian social security number (SSN), issued by EE Gov.
Birth date (from today between 30 and 20 years ago)
Birth place (in Finnish)
Email (with Finnish flavor)
Motivation text (not provided by Faker for ‘fi-locale’, constant text here, in Finnish)

Also note that for simplicity we are ignoring the fact that SSN and birth date are related, potentially also email and person names.

We fix this with the Python code snippet below.

Automatic form filling

From the fake personas generated with the code above, we select one by eyeballing the data-frame (student Margareta is the lucky one) and auto-fill the form. Also note that we might use Python to get this form from web, see code below for both implementation details.

Python provides many tools for working with PDS files (like PyPDF2). Here, we use MuPDFs: fast PDF parsing and rendering engine written in C. More specifically, its wrapper PyMuPDF, as seen above this tool is utilized for auto-filling the form (‘fill_form_fields’ function). Note that PDF form fields are also referred to as “widgets” there.

If we now look at the actual auto-filled form, we obtain the following (partial screenshot):

As we can see, the form has modified some of our values, for example “Patronym” (now in uppercase). This is because the PDF fields can have specific triggers and actions (like JavaScript code) which apply to them, be it enforcing a “field required” condition, limiting field value to valid choices only (some dropdown fields), forcing some date-time formats and so on. Some forms apply additional restrictions/validations and some less, or none. From an information extraction perspective, the latter ones put a heavier burden on developers, but good form design is a topic of its own and we won’t delve further into it in this post.

Extract/parse form data

Let’s now imagine that student Margareta has submitted the filled form to the Estonian Government server, and one of their data science-savvy officers Marta decides to automate the form processing with Python.

Since Marta is not familiar with this type of form, she initially writes a general PDF information extraction script:

Based on that, she gets an understanding of the whole PDF:

number of pages
is it a form PDF
the file metadata
its textual content
embedded images
etc.

For example, when looking at the forms component (“widgets”) as a data-frame…

…and doing some basic exploratory data analysis, it is clear that field names are not unique:

To identify all of them and extract their values requires some additional attributes, for example “field_type_string” (is it a radio-button, text field etc.), “field_rect” (field position), extraction order and so on.

Here, we assume for simplicity that in the first iteration, Marta will cover only fields from classes (let’s assume that only these are needed for other downstream application which consumes forms data):

first name
father’s name
ID code (social security number)
date of birth

Since Marta wants to make a more general solution (we all know that forms are subject to change and she thinks solution should be reusable by similar form types), she aims to solve it by executing the following procedure.

Extract all form field names and concatenate them into a string.
Detect the language of this string (source the language of fields).
Translate form field names to English (target language) as the base language.
Make a mapping table in English for form field class to form field names (considering the abovementioned four form classes), by using common English field names.
Use a mapping table to get form field classes for all form fields, taking into account some variations in language.

There are other ways to detect form field classes almost automatically (for example with help of word/sentence embeddings and different similarity metrics), but we’ll leave it aside for now.

Also note that if the form field names do not carry a semantic meaning, this solution will not work. This could for example be if we have, instead of the field name “person first name”, a field called “text_box_1”. For such cases, it could help if we have many examples of certain types of forms — if so, we could infer field types from field values (be it for example applying text classification to field values and/or searching for patterns with regular expressions).

For our simplified case below, our code shows some of those five procedural points for solving the problem:

Note that here we also use fuzzy string matching which takes care of some language variation. In case we have — after automatic local language translation is done — a form field name like “first name or names”, then when a match is searched for in Marta’s mapping table, a string “first name” should give a hit (due to the pair being similar enough by some similarity metric and threshold) and we have detected the form field class: “first_name”.

As seen above, an internal translation service is used to do the translations (due to data sensitivity); an on-prem solution which was exposed by Marta’s colleague Mart as an API, so essentially Marta sends a POST request to translate the form field values the requires.

… and gets back the following response:

In the example above, we show the interactive exploration view for the API, but Marta likes to call it programmatically with Python (also for example purposes we show serving locally here).

PDF form data extraction as a service

When the core logic development is done, Marta decides to use Mart’s example and expose her form parsing service as an API. They gather in a meeting room for a discussion.

Eventually, she decides to use FastAPI, an excellent production grade API-first web framework that under the hood is utilizing Starlette (web parts) and Pydantic (data parts; a very useful pearl of it’s own). FastAPI and friends would need a separate blog post, but here we just show a screenshot from their website about high level capabilities:

If a user would now look at Marta’s exposed API from Swagger/OpenAPI interactive UI, then he/she would see the following:

Below, we showcase the implementation of some selected data parts, particularly some response model details:

If now the developed service should receive a request from users, we would get the following response (for the filled student application form seen above; again the UI output is shown for visual reasons).

As we can see, the pre-processing functions and validators worked well, for example the father’s name was transliterated to Latin alphabet and SSN passed the most basic validation, there were also no unsupported field types.

After Marta has published the API inside the organization platform (or even in a secure cloud), Mart wants to use this API and may write his own custom code for convenience.

And we see that response for the same form is identical to the above:

In addition, for more business-oriented users and ad-hoc processing and testing, Marta could also expose the UI based on Streamlit, a Python library for building and deploying web apps quickly. In example below, a user can upload a batch of PDFs and start processing manually (under the hood, Streamlit may call the FastAPI service and write results to the database).

End notes

That’s it folks! Marta and Mart have optimized another internal process inside their organization and Margareta is also happy, drawing the benefits from fast application processing and getting into the University of Tartu.