Loading Excel files into databases is hard — our data pipeline fixes that

Published in

ZebraX

4 min readFeb 19, 2021

Overview

One of the examples of a digital transformation project in ZebraX is building a Control Tower (CT) system where users can get end-to-end supply chain visibility which can help them to make data-driven decision making. An integrated database of all value chain data is necessary to make reliable CT, unfortunately, most companies don’t have a good data ecosystem since they still collecting most of the data manually.

From our experience, Excel is still a very popular tool to record data in many companies. Unfortunately, Excel files in those companies are usually poorly constructed, in many cases, the data are scattered across the worksheet.

We realize that we can’t suddenly ask those companies to ditch Excel by using a more digital approach to collecting data since there is a limit in terms of budget and/or human capital. Therefore, we are building a data pipeline system that can tackle this issue, a system where all they need to do is to put the excel file on a particular folder, and later the system will automatically convert those scattered data into a tabular format so it can be exported into a table in a relational database. Finally, the data on the database will be displayed on the dashboard that is being used by the stakeholder to make data-driven decisions.

Data Pipeline Architecture

Each department has its own folder and users can only see and access the folder of their own (ex: department of marketing can only see and access ‘marketing’ folder), this separation is necessary to avoid the chance of somebody putting an Excel file into the wrong folder. Also inside each folder, there is a revision folder that is used to update existing data on the database.

Image 1. An illustration of folders used for the data pipeline

This is how it works: users put an Excel file on their department folder to add new data, according to the schedule the file will be moved to a ‘Temporary’ folder. In this folder, our script will read the Excel file to capture all necessary data and insert those data into the table. if it is successful then the Excel will be moved again to the ‘Success’ folder, but if it is not then it will be moved to the ‘Failed’ folder.

Image 3. A flowchart to illustrate how the data pipeline works

Programming language

In ZebraX we use Python to write scripts for this particular data pipeline. There are many packages that we use such as Pandas for convert the data into a tabular format and SQLAlchemy for inserting data into the database. The script for one worksheet to the other is customized differently since each worksheet has a unique template format.

Error handling

If the file goes to the ‘Failed’ folder then we also need to know the reason for it. In the previous project we recorded all the messages into a log file, so to find the error message we need to open the log file and scroll it until we find the error message. In the current project, we decided to save both success and error messages into a table. Personally, I prefer the second approach since it is easier for tracking error messages.

Summary

Below are some advantages of using this data pipeline:

The client’s habit of collecting & sharing the data is not changing much since they are still using Excel files.
Anybody with limited knowledge of computers can use the data pipeline easily. All they need to do is put their Excel file on the folder and wait for a few minutes and then check if the file goes to the Success/Failed folder.
The data transparency is improved since users from various departments can check the work progress way more details using CT system compare to Excel file.

This pipeline is definitely not perfect, these are some disadvantages of using this pipeline:

Since the script is designed to read data according to the existing template, any minimal changes to the template format can cause the script to fail to capture the data.
There is still a possibility of users input incorrect data since the data collection process is still doing manually.

In the end, this pipeline is not perfect, but I can say that this is the best approach when you have a company that still in an early step of digitalization. Later, when the company is ready to upgrade its data collecting process, then we can use a more advanced data pipeline.