Parsing semi-structured Excel files with our open-sourced library called Refinery

Anton Le
VorTECHsa
Published in
5 min readApr 20, 2022

In this series of articles, we’ll explore the problem of semi-structured data here at Vortexa, what extraction and maintainability challenges we face and how we are tackling them. In this article, we will focus on the extraction of the data from Excel files and walk through the library called refinery that we built and open-sourced.

Intro

At Vortexa, we are receiving the data in different formats through different protocols. All these data types could be grouped into 3 different categories:

  • Structured data — a good example would be JSON/XML content of the HTTP response
  • Semi-structured data — Excels via email attachments, SFTP or Google Sheets
  • Unstructured data — free form emails or news feeds with analytics data, this could be PDF/Word files with free text form via email attachments as well.

In this article, we will focus on semi-structured data and try to find out what are the challenges of extracting the information from this type of data.

The Challenge

First of all, let’s outline the challenges we have with semi-structured data. We will focus on Excel files, however, Google Sheets have the same challenges as well. At Vortexa, we are receiving Excel files from various port agents, charterers and other companies in the energy and freight sectors. These files provide us with very important data points and serve both as evidence of past vessel events as well as help us to predict the future events of the vessel by training our prediction models. Thus, extracting as much data as possible from the Excel spreadsheet and parsing this data is crucial for our business.

The first challenge we face whilst extracting the data from Excel files is a non-standard layout. Whilst CSV is straightforward, most of the time Excel files are made for humans to read the data, not machines. Thus, people make different things to make the data look appealing to humans. The examples are:

  • Merging cells, instead of duplicating the data in the different cells
  • Having metadata in the first few rows of the spreadsheet (e.g. report time), instead of putting them into a separate column of the table
  • Making line breaks by merging multiple cells into a single one in the middle of the table, instead of having this data filled in a separate column
  • Multiple “tables” within a spreadsheet, instead of normalising the data into a single table

The images below are showing the same data being represented in 2 different layouts. One is favourable for humans, another one for machines.

Example of the spreadsheet we are receiving at Vortexa from the port agents. Please note, the data filled in is not real and for Excel layout representation purposes only
The same data is converted to the layout that the machine would parse with ease. “Report date”, “port”, “status” and “cargo” are filled in as separate columns.

As we receive these data from many counter-parties it is a great benefit for us to build a generic tool that deals with the extraction of the data from different layouts. Ideally, we want a tool that will enable us to describe the layout of the data and will “magically” extract all the data points according to the expected layout.

The Solution

First things first, we decided to search for the existing solutions on the web. Surprisingly, we did not find any libraries in the open-source space that would allow us to extract the data points by describing the layout of the spreadsheets. For us, that means there is a space for improvement and we decided to build such a library ourselves.

Our goal was to build a library that would allow describing the layout of the spreadsheet in a declarative way. Describing what to extract (declarative) rather than how to extract (imperative) makes us focus on the business domain and the Excel layout and helps us to avoid code duplication related to reading and extracting the data.

Once the layout is described (we call it definition), the library tries to locate and match the described layout with the actual one and then extracts all the data. Additionally, you could convert the data to the domain model with some extra validations to skip the invalid data points if you wish to.

We built the library which we called refinery and open-sourced it. It is built on top of Apache POI — a library that gives us a nice Java API to interact with Excel files. Thanks to Apache Foundation, we only needed to focus on the design of the data extractor and the features to cover different Excel layouts. To make the definitions look concise and human-readable we used Kotlin language which can easily interact with Apache POI as it has full interoperability with Java.

To give you an idea of how we describe the layout — here is the definition for the Excel spreadsheet above:

Definition for the spreadsheet with the layout we might receive from the port agents

You can see that the definition contains sub-definitions. As long as an Excel workbook can contain multiple spreadsheets and every spreadsheet can contain multiple “tables”, the definitions have the same hierarchy as well.

Refinery layout definitions

Within the spreadsheet definition, you can define from which spreadsheets you can extract the data and metadata that could be extracted as key-value pairs.

The richest set of features is located on the table definition level. That includes the definition of all the required and optional columns of the table, the presence of an anchor and divider.

Attributes of the table definition

Check out more examples in our documentation, which includes examples of different layouts and definitions.

What’s next

In future articles, we will talk about the usage of refinery in our production pipeline, how data validation is embedded as a refinery feature and how it helps us to provide clean data to the downstream services. Also, we will talk about maintainability, backward compatibility of the definitions and how we are monitoring the data quality. Stay tuned!

--

--

Anton Le
VorTECHsa

Pragmatic software engineer with radiophysics degree