Designing and building a micro-service: form-based PDF files

Lehar Oha
Lehar Oha
Jan 15 · 8 min read
Image for post
Image for post
Image for post
Image for post
Top & bottom image

Introduction

Among all the electronic document formats in play today, the PDF is definitely one of the most common ones, nowadays it is the standard way to communicate business and academic information. It is compatible with most systems and convenient to consume as a human viewer, emphasizing readability. From an automation perspective though, things are different: many files created as PDFs are basically images of documents, i.e. image-based PDFs, and this makes their processing difficult. This is probably why people working with automation have a saying that goes

To extract the information from such documents we need to deploy Optical Character Recognition (OCR) tools.

In this post we cover the sub-category of PDFs that are easier to work with, namely text-based form PDFs, particularly those which have fillable and interactive fields and where text can be selected and searched for, and copied from/to PDF.

These days, forms are mostly in (HTML) web formats, but PDF-based forms are also rather common, be it:

  • Transaction forms for UCITS funds purchase/redemption/switch
  • Insurance claim forms
  • Legacy bank application forms
  • Medical forms
  • Government related application forms (today’s example)
  • …or your own area/industry-specific form

In this post, we will look at how to:

  • Generate fake data for the forms
  • Automate form filling (with fake data)
  • Extract/parse form data
  • Employ form data extraction and processing as a microservice.

To make things more concrete, let’s imagine the following scenario: a foreign student from Finland wants to start studying at the University of Tartu. Her first step is to apply for a temporary residence permit issued by the Police and Border Guard Board (read more here if you are interested). To fulfill the process, potential student must fill the “Application For Temporary Residence Permit” PDF form; this will be the form we’ll be using as an example.

So let’s begin impersonating and liberating the data …

Fake data generation

First, we’ll create some artificial data about our student. We could use Faker or Mimesis among other tools. Let’s choose the first one and for simplicity generate 10 examples of:

  • Finnish female first name
  • Finnish (female) last name
  • Father name (patronym) (additionally we assume that the student’s father is Russian)
  • Estonian social security number (SSN), issued by EE Gov.
  • Birth date (from today between 30 and 20 years ago)
  • Birth place (in Finnish)
  • Email (with Finnish flavor)
  • Motivation text (not provided by Faker for ‘fi-locale’, constant text here, in Finnish)

Also note that for simplicity we are ignoring the fact that SSN and birth date are related, potentially also email and person names.

We fix this with the Python code snippet below.

Automatic form filling

From the fake personas generated with the code above, we select one by eyeballing the data-frame (student Margareta is the lucky one) and auto-fill the form. Also note that we might use Python to get this form from web, see code below for both implementation details.

Python provides many tools for working with PDS files (like PyPDF2). Here, we use MuPDFs: fast PDF parsing and rendering engine written in C. More specifically, its wrapper PyMuPDF, as seen above this tool is utilized for auto-filling the form (‘fill_form_fields’ function). Note that PDF form fields are also referred to as “widgets” there.

If we now look at the actual auto-filled form, we obtain the following (partial screenshot):

Image for post
Image for post
Image for post
Image for post
Auto-filled example form

As we can see, the form has modified some of our values, for example “Patronym” (now in uppercase). This is because the PDF fields can have specific triggers and actions (like JavaScript code) which apply to them, be it enforcing a “field required” condition, limiting field value to valid choices only (some dropdown fields), forcing some date-time formats and so on. Some forms apply additional restrictions/validations and some less, or none. From an information extraction perspective, the latter ones put a heavier burden on developers, but good form design is a topic of its own and we won’t delve further into it in this post.

Extract/parse form data

Let’s now imagine that student Margareta has submitted the filled form to the Estonian Government server, and one of their data science-savvy officers Marta decides to automate the form processing with Python.

Since Marta is not familiar with this type of form, she initially writes a general PDF information extraction script:

Based on that, she gets an understanding of the whole PDF:

  • number of pages
  • is it a form PDF
  • the file metadata
  • its textual content
  • embedded images
  • etc.

For example, when looking at the forms component (“widgets”) as a data-frame…

Image for post
Image for post
Sample (cut from bottom and left)

…and doing some basic exploratory data analysis, it is clear that field names are not unique:

Image for post
Image for post
Form field name

To identify all of them and extract their values requires some additional attributes, for example “field_type_string” (is it a radio-button, text field etc.), “field_rect” (field position), extraction order and so on.

Here, we assume for simplicity that in the first iteration, Marta will cover only fields from classes (let’s assume that only these are needed for other downstream application which consumes forms data):

  • first name
  • father’s name
  • ID code (social security number)
  • date of birth

Since Marta wants to make a more general solution (we all know that forms are subject to change and she thinks solution should be reusable by similar form types), she aims to solve it by executing the following procedure.

  1. Extract all form field names and concatenate them into a string.
  2. Detect the language of this string (source the language of fields).
  3. Translate form field names to English (target language) as the base language.
  4. Make a mapping table in English for form field class to form field names (considering the abovementioned four form classes), by using common English field names.
  5. Use a mapping table to get form field classes for all form fields, taking into account some variations in language.

There are other ways to detect form field classes almost automatically (for example with help of word/sentence embeddings and different similarity metrics), but we’ll leave it aside for now.

Also note that if the form field names do not carry a semantic meaning, this solution will not work. This could for example be if we have, instead of the field name “person first name”, a field called “text_box_1”. For such cases, it could help if we have many examples of certain types of forms — if so, we could infer field types from field values (be it for example applying text classification to field values and/or searching for patterns with regular expressions).

For our simplified case below, our code shows some of those five procedural points for solving the problem:

Note that here we also use fuzzy string matching which takes care of some language variation. In case we have — after automatic local language translation is done — a form field name like “first name or names”, then when a match is searched for in Marta’s mapping table, a string “first name” should give a hit (due to the pair being similar enough by some similarity metric and threshold) and we have detected the form field class: “first_name”.

As seen above, an internal translation service is used to do the translations (due to data sensitivity); an on-prem solution which was exposed by Marta’s colleague Mart as an API, so essentially Marta sends a POST request to translate the form field values the requires.

Image for post
Image for post
Translation microservice pre-request

… and gets back the following response:

Image for post
Image for post
Translation microservice response

In the example above, we show the interactive exploration view for the API, but Marta likes to call it programmatically with Python (also for example purposes we show serving locally here).

PDF form data extraction as a service

When the core logic development is done, Marta decides to use Mart’s example and expose her form parsing service as an API. They gather in a meeting room for a discussion.

Eventually, she decides to use FastAPI, an excellent production grade API-first web framework that under the hood is utilizing Starlette (web parts) and Pydantic (data parts; a very useful pearl of it’s own). FastAPI and friends would need a separate blog post, but here we just show a screenshot from their website about high level capabilities:

Image for post
Image for post

If a user would now look at Marta’s exposed API from Swagger/OpenAPI interactive UI, then he/she would see the following:

Image for post
Image for post

Below, we showcase the implementation of some selected data parts, particularly some response model details:

If now the developed service should receive a request from users, we would get the following response (for the filled student application form seen above; again the UI output is shown for visual reasons).

Image for post
Image for post
Image for post
Image for post

As we can see, the pre-processing functions and validators worked well, for example the father’s name was transliterated to Latin alphabet and SSN passed the most basic validation, there were also no unsupported field types.

After Marta has published the API inside the organization platform (or even in a secure cloud), Mart wants to use this API and may write his own custom code for convenience.

And we see that response for the same form is identical to the above:

Image for post
Image for post

In addition, for more business-oriented users and ad-hoc processing and testing, Marta could also expose the UI based on Streamlit, a Python library for building and deploying web apps quickly. In example below, a user can upload a batch of PDFs and start processing manually (under the hood, Streamlit may call the FastAPI service and write results to the database).

Image for post
Image for post
Streamlit frontend

End notes

That’s it folks! Marta and Mart have optimized another internal process inside their organization and Margareta is also happy, drawing the benefits from fast application processing and getting into the University of Tartu.

Swedbank AI

AI, machine learning and deep learning at one of the…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store