How I finally learned to deal with very large .xlsx files
Validating column and row maxima using metadata (in Python)
Introduction
I work at a company that is, in many ways, on the cutting edge of software development and deployment technology. Docker, Kubernetes, Istio, gRPC, and Kafka are all front of mind in our day-to-day. But I’m not going to talk about that. I’m going to talk about something much less sexy, much less cutting-edge, and much less fun — .xlsx files
Problem
My team builds internal tools at a company that ingests heaps of data from all sorts of places. Our users are fellow employees who get ugly data, and try to make it less ugly. They know how to use MS Excel well enough, but they don’t necessarily understand that a python application is going to interpret their file differently than Excel does. This leads to confusion and headaches around date formats, embedded formulas, trailing and leading white-space, special characters, and more. These problems usually have relatively simple solutions that are a quick google away, but one had persisted: our users periodically upload files that contain, say, 16,384 columns and 200,000 rows. Such a file is generally devoid of meaningful data, and is usually the product of user error.
But how?
It turns out, it’s pretty easy to make this happen without realizing. All you have to do is, ctrl+down or right, then type something. A single space in cell XFD1 or A1048576 is all it takes to generate a needlessly large file. It also turns out that these sorts of files compress very efficiently. Although a file may have 2,000,000,000 (that’s 2 billion) cells, if they’re mostly empty, the file can compress to well below a single MB. What this means is that validating the file size in the front end doesn’t help us. So we have to ingest the file, and figure it out from there.
So what’s the problem?
The problem is that we then go ahead and try to create a 2-D array that contains 2 billion elements, it doesn’t go well.
Here’s what happens with 1 million elements (0.05% of 2 billion):

This gets worse as one dimension grows, even if you hold the total number of elements constant…

Add in the cost of streaming bytes from the file, parsing them into strings, and checking the type of each element, and it starts to get out of hand.

Not to mention the physical size of the object created


Even a simplified version of this operation for an array containing 1 million elements takes about 1.5 seconds, and weighs in at about 20MB (see this post for the get_size function I used.) Multiply these figures by 2000, and you get the idea. Our application just can’t handle a file with those dimensions. This brings us to the crux of the issue:
How do we verify that we’re not about to load a file with 10,000 columns and 1 million rows into memory?
Simple front-end file-size validations no longer work. We need to ingest the file, and validate it before we read it into some data structure. This set us off on a goose hunt for some solution that just didn’t exist.
Exploration
openpyxl — allows you to get sheet names without loading the file, but not much else
xlrd — besides being deprecated, it didn’t seem to offer any peeking into metadata without loading file data
pyexcel-xlsr — seemed promising, but the parser was finicky, and basically wouldn’t work on half the files we tried
pyexcel-xlsx — offers an interface that makes it seem like it’s lazily loading data, but it’s using openpyxl under the hood, and peeking into large files was similarly problematic
As you can see, nothing does quite what we want it to. It was at this point that we began to consider pleading with our users to stop using xlsx files. Unfortunately, this wasn’t an option. In a word we were hopeless.
Discovery
In order to understand whether what we were trying to achieve was even possible, I took a few steps back, and asked a simple question:
What is an xlsx file?
It turns out, the answer is about as simple. When Excel saves a file with that .xlsx extension, all it is saving is a collection of xml files, zipped together with a predictable structure.


Once I realized this, it was just a matter of finding the specific metadata that I had so desperately sought.

At long last, I have found the dimensions of my xlsx file. Then I wrote a simple library to parse that data — it’s fast, it’s about 60 lines of code, and it only uses the standard library. It works pretty effortlessly when reading from the file system, or from the files property of a web form.
Solution — do it yourself (or use my library)

The code itself is so simple that it made me angry writing it. All I do is load the file into a ZipFile object (standard library), open the particular sheet.xml file I want to peek into, and use a simple regex to grab the dimensions. If for some reason that tag isn’t in the first 1000 bytes, we set the read head back 500, and try the next 1000.
The moral of the story
As a developer, one of the most oft-repeated bits of advice you hear is “don’t reinvent the wheel.” Generally, this is excellent advice. However, occasionally a problem comes along, and you will spend weeks searching for an existing library, when you could have built the answer in two hours on a Sunday night. Be cognizant of when a problem might fall into this category. It will usually involve you doing a lot of searching to little avail.