When files are too big for Excel: Filtering on 7.2 million rows with MIObdt vs Excel

MIO - the data experts
7 min readApr 26, 2022

--

One of the most obnoxious ways in which Excel can be a capital-P Problem is when you try to open a file that Excel has decided is too big.

At least the current version of Excel has freed us from the 65,536-row limit. Still: the updated 1-million-and-change boundary is not that much in the grand scheme of things.

Now, technically speaking, it is possible to connect Excel to a datasource that has more than 1 million (and a bit) rows, as explained in this very informative article about working with millions of rows in Excel.

But also, you can do all of these things in MIObdt and it’s a much better experience.

A screenshot of MIObdt set diagonally from a screenshot of Excel with “vs” in the center between them
Image by author

To prove that, here’s a real scenario using NPI numbers. As part of a project, one of our clients needed a list of all MDs in Columbus, OH, with the following information about each MD:

NPI
Name Prefix
First Name
Middle Name
Last Name
Name Suffix
Credentials (comma separated is fine)
Address Line 1
Address Line 2
City
State
ZIP Code
Phone
Fax

Getting this means:

  • Taking the list of NPIs, a CSV file with 330 columns and about 7.2 million rows (https://download.cms.gov/nppes/NPI_Files.html)
  • Filtering to all records where the city is Columbus and the state is Ohio
  • Filtering to all records where the provider has an MD credential

I will go ahead and tell you now that the third step will take a bit of finagling because some people have more than one credential in their NPI record and the credential separators are not all the same. And, as is inevitable, the same value has been entered in different ways.

The test scenario

So we are going to skip the whole credentials thing for now.

My goal for this is just to:

  • Take the list of NPIs
  • Filter to all records where the city is Columbus (or East Columbus) and the state is Ohio

I’m not going to worry about real data quality situations, like misspelled values, for now.

To add real-working-conditions authenticity, I am going to actually do and record these things on my computer, a Surface Pro 6 with an Intel i5 1.6GHz 4-core processor.

In the background, it will be running all my normal programs that I use for daily work (Slack, Chrome, Outlook, Snagit, OneNote, Spotify, Teams, Notepad++, Firefox), plus Camtasia to do the recording.

A note about Camtasia being involved

Running Camtasia chews up a ton of resources and makes everything take twice as long as it does when I’m not recording. I will provide the non-Camtasia numbers first and then the recorded ones.

In MIObdt

Here’s what I did:

  • Opened the file, and let a decent-size amount (around 35K) load while I was getting the MIObdt window into the area being recorded.
  • Created a View multi-column transformation so I only have to look at the 14 columns I want and not the other 316 I don’t care about.
  • Let it scan all 7.2 million records.
  • Back in the View, I marked all records with Columbus in the city and then filtered.
  • Let that run over all the records.
  • Repeat that for OH, which is much faster because now I’m down to around 35k records.
  • Quick additional cleanup to remove cities that just have Columbus in the name but aren’t actual Columbus (except for East Columbus), and, after a false start with the wrong transformation, to standardize the two records where Ohio is spelled out for some reason.
  • Export those as a CSV file.

Start to finish: 16 minutes.
(34 minutes while recording.)

In Excel

Through a combination of the article I linked before and this help post, plus the time-honored strategy of “clicking around,” I learned how to produce an equivalent effect in Excel.

So I:

  • Connected to the CSV file.
  • In the PowerQuery preview, I chose to keep only the 14 columns I am interested in.
  • Then filtered to states that are OH. Ohio isn’t an option.
  • Then filtered to cities that are Columbus. East Columbus was also not an option.
  • Loaded the data.

Start to finish: 8 minutes.
(12 minutes if recording.)

EXCEPT

This only produced 28,283 records, instead of the 28,286 we know are there from using MIObdt.

A screenshot of Excel showing 28,283 rows loaded as the result of a query
Screenshot by author

We know, in this case, that’s because the initial filtering didn’t catch the two “Ohio” records or the “East Columbus” record.

In Excel, again

So I tried again, using the same steps as before.

Except this time, I:

  • Changed Excel’s column profiling to use the entire data set.
  • Got all the values for the state and city name filters.

This is more equivalent to what MIObdt does than my original Excel attempt.

This had two effects:

  • One, it added 14 minutes to the total execution time.
  • Two, it is how I learned that Excel will only show you a maximum of 1000 values* for filtering. East Columbus was not one of those 1000 values, even though I gave Excel a softball here by filtering on Ohio first, so it didn’t have every city in the US to deal with.

Start to finish: 22 minutes.
(53 minutes while recording.)

And we still didn’t even get East Columbus.

* There are 1488 OH/Ohio city values in this file, according to MIObdt.

A screenshot of the MIObdt Analysis window showing 1488 unique values in the City column of the filtered NPI view
Screenshot by author

Why this 6-minute difference matters

Why make a big deal over 6 minutes? Isn’t that just some time to get a smoothie built into the Excel route?

Here’s the thing though: MIObdt still allows you access to all the other columns and data. Excel, once you have done the “Close & load,” does not.

So if I’m getting ready to make the CSV and then realize my mouse slipped and I selected Country instead of ZIP Code when I was winnowing down the columns?

In MIObdt, it would take me around 15 seconds to re-map the correct column to the View and then carry on.

In Excel, I would have to fire up the Power Query editor, delete the step where I removed the columns, add a NEW step where I pick a complete set of different columns (good thing there are only 14), and then run the whole query… again.

But what if I did exactly like the article and loaded the whole CSV — all 7.2m records and 330 columns — into the Excel data model? Perhaps that would be faster?

I tried that, then stopped recording 15 minutes in, because it didn’t seem to be going well.

A screenshot of Excel showing a query in progress having 773,734 rows loaded with 10,488 errors
Screenshot by author

I followed the directions in the article, but I didn’t change anything on the “Transformations” page because Excel did OK on the original attempt (and because since MIObdt automatically parses the file correctly, I felt like I shouldn’t have to).

In summary

So in short, I took a 7.2-million-row, 330-column file and selected a view of only 14 columns. Then I applied two filter criteria and exported the result as a CSV file.

In MIObdt…

  • I found 28,286 records matching the criteria in 16 minutes.
  • If I wanted to select criteria visually, I could see and browse all values in a column.
  • Once I was done filtering, I could still access any of the 316 columns I hadn’t originally selected.
  • I could see all 7.2 million records in the main MIObdt page.

In Excel…

  • I found 28,283 records matching the criteria in 7 minutes.
  • I found 28,285 records matching the criteria in 22 minutes.
  • If I wanted to select the filter criteria visually, I was limited to 1000 options. (This is why there was no way to find the 28,286th record without knowing it was there.)
  • Once I had executed the query, I lost access to the 316 columns I hadn’t originally selected.
  • I could only see a preview of the data in the PowerQuery editor.

Results

  • MIObdt found records Excel couldn’t, when using standard filtering.
  • A complete analysis in Excel was slower than a complete analysis in MIObdt, and found fewer records.
  • A sample-based analysis in Excel was faster than a complete analysis in MIObdt, but found fewer records than the Excel complete analysis.

The video

Here’s a side-by-side of the experience. It’s in real time, and Camtasia made everything take twice as long, so I don’t actually suggest watching it the whole way through as-is.

Use the timestamps in the description to skip to the part you want to see. For an experience more like when Camtasia isn’t running, set the playback speed to 2x.

About the author

Erin Kapp is a writer at MIO.

--

--

MIO - the data experts

We provide boutique data consulting for companies that need real, practical solutions. Our experts can handle even the stickiest data problems.