How I finally learned to deal with very large .xlsx files

Sean Massot
Sep 8, 2018 · 5 min read

Validating column and row maxima using metadata (in Python)

Introduction

I work at a company that is, in many ways, on the cutting edge of software development and deployment technology. Docker, Kubernetes, Istio, gRPC, and Kafka are all front of mind in our day-to-day. But I’m not going to talk about that. I’m going to talk about something much less sexy, much less cutting-edge, and much less fun — .xlsx files

Problem

My team builds internal tools at a company that ingests heaps of data from all sorts of places. Our users are fellow employees who get ugly data, and try to make it less ugly. They know how to use MS Excel well enough, but they don’t necessarily understand that a python application is going to interpret their file differently than Excel does. This leads to confusion and headaches around date formats, embedded formulas, trailing and leading white-space, special characters, and more. These problems usually have relatively simple solutions that are a quick google away, but one had persisted: our users periodically upload files that contain, say, 16,384 columns and 200,000 rows. Such a file is generally devoid of meaningful data, and is usually the product of user error.

But how?

It turns out, it’s pretty easy to make this happen without realizing. All you have to do is, ctrl+down or right, then type something. A single space in cell XFD1 or A1048576 is all it takes to generate a needlessly large file. It also turns out that these sorts of files compress very efficiently. Although a file may have 2,000,000,000 (that’s 2 billion) cells, if they’re mostly empty, the file can compress to well below a single MB. What this means is that validating the file size in the front end doesn’t help us. So we have to ingest the file, and figure it out from there.

So what’s the problem?

The problem is that we then go ahead and try to create a 2-D array that contains 2 billion elements, it doesn’t go well.

Here’s what happens with 1 million elements (0.05% of 2 billion):

1 million elements, (1000 x 1000) — takes 372 ms

This gets worse as one dimension grows, even if you hold the total number of elements constant…

Again 1 million elements, this time (100,000 x 10) — takes 549 ms

Add in the cost of streaming bytes from the file, parsing them into strings, and checking the type of each element, and it starts to get out of hand.

Streaming data into the function, checking types, and adding the data to the arrays… 1468 ms for 1 million elements. For 2 billion?

Not to mention the physical size of the object created

For 1 million elements, 824kb — for 2 billion?
Using get_size, which is more accurate for nested data structures — 20MB!!!

Even a simplified version of this operation for an array containing 1 million elements takes about 1.5 seconds, and weighs in at about 20MB (see this post for the get_size function I used.) Multiply these figures by 2000, and you get the idea. Our application just can’t handle a file with those dimensions. This brings us to the crux of the issue:

How do we verify that we’re not about to load a file with 10,000 columns and 1 million rows into memory?

Simple front-end file-size validations no longer work. We need to ingest the file, and validate it before we read it into some data structure. This set us off on a goose hunt for some solution that just didn’t exist.

Exploration

openpyxl — allows you to get sheet names without loading the file, but not much else

xlrd — besides being deprecated, it didn’t seem to offer any peeking into metadata without loading file data

pyexcel-xlsr — seemed promising, but the parser was finicky, and basically wouldn’t work on half the files we tried

pyexcel-xlsx — offers an interface that makes it seem like it’s lazily loading data, but it’s using openpyxl under the hood, and peeking into large files was similarly problematic

As you can see, nothing does quite what we want it to. It was at this point that we began to consider pleading with our users to stop using xlsx files. Unfortunately, this wasn’t an option. In a word we were hopeless.

Discovery

In order to understand whether what we were trying to achieve was even possible, I took a few steps back, and asked a simple question:

What is an xlsx file?

It turns out, the answer is about as simple. When Excel saves a file with that .xlsx extension, all it is saving is a collection of xml files, zipped together with a predictable structure.

you can extract an xlsx file
and look into a sheet…

Once I realized this, it was just a matter of finding the specific metadata that I had so desperately sought.

bingo.

At long last, I have found the dimensions of my xlsx file. Then I wrote a simple library to parse that data — it’s fast, it’s about 60 lines of code, and it only uses the standard library. It works pretty effortlessly when reading from the file system, or from the files property of a web form.

Solution — do it yourself (or use my library)

this is the thinky stuff

The code itself is so simple that it made me angry writing it. All I do is load the file into a ZipFile object (standard library), open the particular sheet.xml file I want to peek into, and use a simple regex to grab the dimensions. If for some reason that tag isn’t in the first 1000 bytes, we set the read head back 500, and try the next 1000.

The moral of the story

As a developer, one of the most oft-repeated bits of advice you hear is “don’t reinvent the wheel.” Generally, this is excellent advice. However, occasionally a problem comes along, and you will spend weeks searching for an existing library, when you could have built the answer in two hours on a Sunday night. Be cognizant of when a problem might fall into this category. It will usually involve you doing a lot of searching to little avail.

Sean Massot

Written by

I am a software engineer in NYC.