Staging, Ingesting, and Transforming Excel Files

How to unlock the value in your Excel files

--

by Shekhar Parnerkar and Gaurav Kholiya

It’s said that the best insights are derived when you combine data from multiple disparate sources. We would also like to add that the more unconventional the data source, the higher its value. OLTP systems have traditionally been the most important data sources for many years. They were followed by weblog analytics, social media mining, and, more recently, Industrial IoT.

In this blog post, we build a case to go beyond and unlock the immense value stored in Excel files — arguably the most popular end-user application! Over the years, users have been entering crucial enterprise data in Excel files. Unfortunately, it is not easy to query or work with this data for many obvious reasons.

So Why Excel?

Our journey into the world of Excel began a while ago when the Vendor Service Procurement Manager from one of our Fortune 500 clients asked a simple question — “How do I know if the payments made to our vendors are commensurate with the quality of service provided by them?”

If you are a practicing data engineer, you would realize that there are two parts to this question — Payments and Quality of Service. To answer the question, you would need measures for each and then join them by a link. It is not hard to guess that the Payments data will be available in the Accounts Payable module of their ERP. If you think that the problem will be getting the data related to Quality of Service, you are right.

The service quality data was buried in thousands of Excel files in the form of an Inspection Report. And not just that, these Excel files were scattered over numerous network drives, shared folders, and some in cloud file storage, such as Amazon S3. We learned that the payments to vendors are currently triggered by the Inspection Supervisors when they prepare the Inspection Report in Excel, then sent to the Accounts Payable department by e-mail. We also discovered that these reports do not follow a standard tabular structure. On the contrary, the reports were designed to make them easy to read rather than easy to load into a database!

In this blog post, we discuss the problems (and solutions) while working with Excel files. For this discussion, we divide these problems into 3 parts:

  1. Staging files for load
  2. Ingesting Excel files
  3. Transformations

1) Staging Files for Load

The simplest way to get the Excel data from users is by asking them to upload their files in specific shared folders. Users will typically upload their files in different folders and often in different storage types — cloud, local, shared, etc. These files will likely be scattered over multiple locations. Therefore, you need to poll all these locations from time to time and bring the desired files to a standard staging location from where they can be loaded. Most ETL tools will provide a Scheduler and COPY function to achieve this. In practice, it boils down to managing the URI’s of these locations and storing their access credentials securely.

Most ETL tools also provide a connector to load Excel files. But soon, you will realize why they can’t be used.

Instead of configuring many locations, we found that asking users to send emails to a service account works better. Instead of uploading files, users send these files as attachments. Filenames and email subject can be used to stage files for loading into different locations. This significantly reduces maintenance overheads of multiple folders.

To extract Excel files from the email server, you need a Scala or Python code, which will connect to the SMTP server, read emails, extract attachments, and stage them in different folders based upon logic. We have built a fairly configurable tool to achieve this, which is available upon request.

2) Ingesting Excel Files

From an ingestion point of view, Excel files can be divided into two categories:

  1. Simple Excel sheets — have a defined tabular structure, usually with a header row and one tab holding data for one table.
  2. Complex Excel sheets — any Excel other than a simple Excel.

Even when ingesting simple Excels, you are faced with the following problems:

  1. Missing or malformed headers.
  2. Most data warehouses do not support spaces or special characters in column names.
  3. Frequently changing header information (e.g., “Product Qty” and “product_qty”).
  4. Blank rows — if a user even touches a blank row, most Excel readers will read it as user input. These blank rows need to be filtered through code.
  5. Inconsistent data types — unlike a database column, an Excel column can have multiple data types and formats.
  6. Multiple tabs — containing the same table data or different table data.
  7. Tab names or file names to be added as a column do the tab's data — similar to a partition key.

Despite these issues, it is still possible to ingest data from a simple Excel without implementing complex logic. Many Excel files will need custom logic to be ingested — we refer to them as Complex Excel sheets. This logic defines specific parts of the Excel to be read as tables and then stitching these parts together to make a single table. This is required due to many scenarios:

  1. Not all columns are required.
  2. Columns of the same table are split across multiple tabs.
  3. Merging of cells.
  4. Formulas — macros and cell references.
  5. Embedded objects — images.
  6. External links to other workbooks.

3) Transformations

Once ingested, data from simple Excel files can be processed like raw data from other sources. However, ingesting a Complex file requires some in-line transformations to load the data in flat raw tables. The transform logic changes from situation to situation. It is virtually impossible to write common code that does this; however, it is possible to identify some common patterns and develop a configurable code that can handle them.

To achieve this, we developed a Scala code, which reads the required configurations from a JSON file and then reads the Excel file repeatedly as per those configurations. This requires each section of the file is read as a raw table. Then these tables are transformed into a single target table. This is achieved by Dataframes and performing in-memory transformations using Apache Spark. (The same can also be achieved using Python and Pandas).

A high-level data flow that governs the processing of a complex file is as shown below:

Architecture

The following is the sample configuration file used to extract data from a complex excel file:

Final Thoughts

Data locked away in Excel files is extremely valuable. Make sure that data is easily accessible and ready to use. We hope this blog post was helpful and provided some insight into the transformation process. If you have further questions, reach out to us. We have the experience and tools to unlock the immense value stored in Excel files.

Ready to accelerate your digital transformation?

If you’d like additional assistance in this area, Hashmap offers a range of enablement workshops and consulting service packages as part of our consulting service offerings, and would be glad to work through your specifics.

How does Snowflake compare to other data warehouses? Our technical experts have implemented over 250 cloud/data projects in the last 3 years and conducted unbiased, detailed analyses across 34 business and technical dimensions, ranking each cloud data warehouse.

To listen in on a casual conversation about all things data engineering and the cloud, check out Hashmap’s podcast Hashmap on Tap as well on Spotify, Apple, Google, and other popular streaming apps.

Other Tools and Content You Might Like

Shekhar Parnerkar is a Solution Architect and Delivery Manager at Hashmap Pune. He specializes in building modern cloud data platforms for Hashmap’s global clients.

Gaurav Kholiya is a Data Engineer at Hashmap Pune.

--

--