Among all the electronic document formats in play today, the PDF is definitely one of the most common ones, nowadays it is the standard way to communicate business and academic information. It is compatible with most systems and convenient to consume as a human viewer, emphasizing readability. From an automation perspective though, things are different: many files created as PDFs are basically images of documents, i.e. image-based PDFs, and this makes their processing difficult. This is probably why people working with automation have a saying that goes
PDF is a format where good data goes to die.
To extract the information from such documents we need to deploy Optical Character Recognition (OCR) tools.
In this post we cover the sub-category of PDFs that are easier to work with, namely text-based form PDFs, particularly those which have fillable and interactive fields and where text can be selected and searched for, and copied from/to PDF.
These days, forms are mostly in (HTML) web formats, but PDF-based forms are also rather common, be it:
- Transaction forms for UCITS funds purchase/redemption/switch
- Insurance claim forms
- Legacy bank application forms
- Medical forms
- Government related application forms (today’s example)
- …or your own area/industry-specific form
In this post, we will look at how to:
- Generate fake data for the forms
- Automate form filling (with fake data)
- Extract/parse form data
- Employ form data extraction and processing as a microservice.
To make things more concrete, let’s imagine the following scenario: a foreign student from Finland wants to start studying at the University of Tartu. Her first step is to apply for a temporary residence permit issued by the Police and Border Guard Board (read more here if you are interested). To fulfill the process, potential student must fill the “Application For Temporary Residence Permit” PDF form; this will be the form we’ll be using as an example.
So let’s begin impersonating and liberating the data …
Fake data generation
- Finnish female first name
- Finnish (female) last name
- Father name (patronym) (additionally we assume that the student’s father is Russian)
- Estonian social security number (SSN), issued by EE Gov.
- Birth date (from today between 30 and 20 years ago)
- Birth place (in Finnish)
- Email (with Finnish flavor)
- Motivation text (not provided by Faker for ‘fi-locale’, constant text here, in Finnish)
Also note that for simplicity we are ignoring the fact that SSN and birth date are related, potentially also email and person names.
We fix this with the Python code snippet below.
Automatic form filling
From the fake personas generated with the code above, we select one by eyeballing the data-frame (student Margareta is the lucky one) and auto-fill the form. Also note that we might use Python to get this form from web, see code below for both implementation details.
Python provides many tools for working with PDS files (like PyPDF2). Here, we use MuPDFs: fast PDF parsing and rendering engine written in C. More specifically, its wrapper PyMuPDF, as seen above this tool is utilized for auto-filling the form (‘fill_form_fields’ function). Note that PDF form fields are also referred to as “widgets” there.
If we now look at the actual auto-filled form, we obtain the following (partial screenshot):
Extract/parse form data
Let’s now imagine that student Margareta has submitted the filled form to the Estonian Government server, and one of their data science-savvy officers Marta decides to automate the form processing with Python.
Since Marta is not familiar with this type of form, she initially writes a general PDF information extraction script:
Based on that, she gets an understanding of the whole PDF:
- number of pages
- is it a form PDF
- the file metadata
- its textual content
- embedded images
For example, when looking at the forms component (“widgets”) as a data-frame…
…and doing some basic exploratory data analysis, it is clear that field names are not unique:
To identify all of them and extract their values requires some additional attributes, for example “field_type_string” (is it a radio-button, text field etc.), “field_rect” (field position), extraction order and so on.
Here, we assume for simplicity that in the first iteration, Marta will cover only fields from classes (let’s assume that only these are needed for other downstream application which consumes forms data):
- first name
- father’s name
- ID code (social security number)
- date of birth
Since Marta wants to make a more general solution (we all know that forms are subject to change and she thinks solution should be reusable by similar form types), she aims to solve it by executing the following procedure.
- Extract all form field names and concatenate them into a string.
- Detect the language of this string (source the language of fields).
- Translate form field names to English (target language) as the base language.
- Make a mapping table in English for form field class to form field names (considering the abovementioned four form classes), by using common English field names.
- Use a mapping table to get form field classes for all form fields, taking into account some variations in language.
There are other ways to detect form field classes almost automatically (for example with help of word/sentence embeddings and different similarity metrics), but we’ll leave it aside for now.
Also note that if the form field names do not carry a semantic meaning, this solution will not work. This could for example be if we have, instead of the field name “person first name”, a field called “text_box_1”. For such cases, it could help if we have many examples of certain types of forms — if so, we could infer field types from field values (be it for example applying text classification to field values and/or searching for patterns with regular expressions).
For our simplified case below, our code shows some of those five procedural points for solving the problem:
Note that here we also use fuzzy string matching which takes care of some language variation. In case we have — after automatic local language translation is done — a form field name like “first name or names”, then when a match is searched for in Marta’s mapping table, a string “first name” should give a hit (due to the pair being similar enough by some similarity metric and threshold) and we have detected the form field class: “first_name”.
As seen above, an internal translation service is used to do the translations (due to data sensitivity); an on-prem solution which was exposed by Marta’s colleague Mart as an API, so essentially Marta sends a POST request to translate the form field values the requires.
… and gets back the following response:
In the example above, we show the interactive exploration view for the API, but Marta likes to call it programmatically with Python (also for example purposes we show serving locally here).
PDF form data extraction as a service
When the core logic development is done, Marta decides to use Mart’s example and expose her form parsing service as an API. They gather in a meeting room for a discussion.
Eventually, she decides to use FastAPI, an excellent production grade API-first web framework that under the hood is utilizing Starlette (web parts) and Pydantic (data parts; a very useful pearl of it’s own). FastAPI and friends would need a separate blog post, but here we just show a screenshot from their website about high level capabilities:
If a user would now look at Marta’s exposed API from Swagger/OpenAPI interactive UI, then he/she would see the following:
Below, we showcase the implementation of some selected data parts, particularly some response model details:
If now the developed service should receive a request from users, we would get the following response (for the filled student application form seen above; again the UI output is shown for visual reasons).
As we can see, the pre-processing functions and validators worked well, for example the father’s name was transliterated to Latin alphabet and SSN passed the most basic validation, there were also no unsupported field types.
After Marta has published the API inside the organization platform (or even in a secure cloud), Mart wants to use this API and may write his own custom code for convenience.
And we see that response for the same form is identical to the above:
In addition, for more business-oriented users and ad-hoc processing and testing, Marta could also expose the UI based on Streamlit, a Python library for building and deploying web apps quickly. In example below, a user can upload a batch of PDFs and start processing manually (under the hood, Streamlit may call the FastAPI service and write results to the database).
That’s it folks! Marta and Mart have optimized another internal process inside their organization and Margareta is also happy, drawing the benefits from fast application processing and getting into the University of Tartu.