Designing and building a micro-service: form-based PDF files

Lehar Oha
Lehar Oha
Jan 15 · 8 min read
Image for post
Image for post
Image for post
Image for post
Top & bottom image

Introduction

Among all the electronic document formats in play today, the PDF is definitely one of the most common ones, nowadays it is the standard way to communicate business and academic information. It is compatible with most systems and convenient to consume as a human viewer, emphasizing readability. From an automation perspective though, things are different: many files created as PDFs are basically images of documents, i.e. image-based PDFs, and this makes their processing difficult. This is probably why people working with automation have a saying that goes

  • Insurance claim forms
  • Legacy bank application forms
  • Medical forms
  • Government related application forms (today’s example)
  • …or your own area/industry-specific form
  • Automate form filling (with fake data)
  • Extract/parse form data
  • Employ form data extraction and processing as a microservice.

Fake data generation

First, we’ll create some artificial data about our student. We could use Faker or Mimesis among other tools. Let’s choose the first one and for simplicity generate 10 examples of:

  • Finnish (female) last name
  • Father name (patronym) (additionally we assume that the student’s father is Russian)
  • Estonian social security number (SSN), issued by EE Gov.
  • Birth date (from today between 30 and 20 years ago)
  • Birth place (in Finnish)
  • Email (with Finnish flavor)
  • Motivation text (not provided by Faker for ‘fi-locale’, constant text here, in Finnish)

Automatic form filling

From the fake personas generated with the code above, we select one by eyeballing the data-frame (student Margareta is the lucky one) and auto-fill the form. Also note that we might use Python to get this form from web, see code below for both implementation details.

Image for post
Image for post
Image for post
Image for post
Auto-filled example form

Extract/parse form data

Let’s now imagine that student Margareta has submitted the filled form to the Estonian Government server, and one of their data science-savvy officers Marta decides to automate the form processing with Python.

  • is it a form PDF
  • the file metadata
  • its textual content
  • embedded images
  • etc.
Image for post
Image for post
Sample (cut from bottom and left)
Image for post
Image for post
Form field name
  • father’s name
  • ID code (social security number)
  • date of birth
  1. Detect the language of this string (source the language of fields).
  2. Translate form field names to English (target language) as the base language.
  3. Make a mapping table in English for form field class to form field names (considering the abovementioned four form classes), by using common English field names.
  4. Use a mapping table to get form field classes for all form fields, taking into account some variations in language.
Image for post
Image for post
Translation microservice pre-request
Image for post
Image for post
Translation microservice response

PDF form data extraction as a service

When the core logic development is done, Marta decides to use Mart’s example and expose her form parsing service as an API. They gather in a meeting room for a discussion.

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Streamlit frontend

End notes

That’s it folks! Marta and Mart have optimized another internal process inside their organization and Margareta is also happy, drawing the benefits from fast application processing and getting into the University of Tartu.

Swedbank AI

AI, machine learning and deep learning at one of the…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store