How to Correct 32,000 Incorrect CSV Files in Fewer Than 32,000 Steps
Or how impossible problems can become possible when given no other choice.
CSV feels like the simplest of file formats, where it might seem that there’s not much to know after mentally expanding the acronym — Comma Separated Values. But tell me this: if it’s so simple, then why are there so many CSV parsing libraries, alternative CSV parsing libraries, and CSV parsing libraries that claim to be better or smarter, and a mountain of mangled CSVs in existence?
CSV isn’t so much a file format as it is a loose set of guidelines for converting tabular data into text. The closest thing to a spec for it is this, which deals with vital and often overlooked questions such as:
1. “What happens if a value has a comma in it?” — oh, you quote it
2. “What happens if a value has a quote in it?” — oh, you put another quote before it
One question the spec definitely does not cover is one I needed answering: “What do you do with 32,000 files claiming to be valid CSVs but of the 750,000 some lines an unknown number of them have extra unquoted commas hidden in the values, basically making the data untrustworthy?”
This is not such a simple problem, but it’s an interesting problem.
Background
I’ve been building a platform for public and community radio stations the last year and have been working with the fantastic 91.7 KOOP in Austin, TX to pilot it. Part of this project involved exporting the data from their old system (some 750,000 tracks) into the system I built for them. And noticing some tracks from an artist named “10” doing a song called “000 maniacs”, I realized the old system’s crusty CSV exporter did not do the right thing when it came to commas in values, leaving me with a real mess.
How many track titles, artists, albums, or record labels have commas in their names? Many.
I initially thought this was an impossible problem to solve (at least in a way that would not drive a person insane) so I tried reaching out to the developer of the old system to fix their ancient CSV exporter, or to send me data in another format that I could sort through myself. This felt like the most straightforward option and didn’t seem like too much of a lift.
But after weeks and weeks of back and forth, this path dead ended leaving me with no other option but to fix it myself.
The Problem
Here’s an abbreviated and simplified example of a messed up CSV I was dealing with:
The way a CSV parser would parse that data currently is like this:
This is incorrect, and would only be parsed correctly if the artist value were properly quoted, like so:
Which would be parsed like this:
But how do we get there without manually looking at 750,000 lines of comma separated text?
Computers, man
First, we can determine if a line is incorrect by parsing it individually as a CSV, seeing how many values we end up with, and comparing that number with the number of headers.
For this example, the header row has 4 values, the first line has 5 values, and the second line has 4 values. The first line is incorrect, meaning there’s an extra comma that belongs within a value.
Knowing that, we need to figure out all the different possibilities of what it could be. One of those possibilities will be the correct one, and the rest will be wrong. With five values and four fields, we need to join two values together into one (and quote it), making four values.
I think best on paper, so I started drawing out how this might work in a brute force sort of way, to try to get a handle on the algorithm I needed.
Eliminate 1 position, Fit 5 into 4
[1, 2, 3, 4, 5] -> [x, x, x, x] =>
[[1, 2], 3, 4, 5]
[1, [2, 3], 4, 5]
[1, 2, [3, 4], 5]
[1, 2, 3, [4, 5]]
What if we had two commas in there? Then we’d have six values needing to fit into four slots.
Eliminate 2 positions, Fit 6 into 4
[1, 2, 3, 4, 5, 6] -> [x, x, x, x] =>
[[1, 2, 3], 4, 5, 6]
[1, [2, 3, 4], 5, 6]
[1, 2, [3, 4, 5], 6]
[1, 2, 3, [4, 5, 6]]
[[1, 2], [3, 4], 5, 6]
[[1, 2], 3, [4, 5], 6]
[[1, 2], 3, 4, [5, 6]]
[1, [2, 3], [4, 5], 6]
[1, [2, 3], 4, [5, 6]]
[1, 2, [3, 4], [5, 6]]
This was pretty easy to visualize and draw out on paper manually, but figuring out how to do this programmatically for any situation was a challenge. This sort of combination/permutation type math felt like the type of thing I would have learned in a Probability and Statistics class that I took in college, but you know what I remember from that class?
- Having a huge crush on a lil’ cutie and zero game to do anything about it
- Drawing this funny comic
That’s what I remember. Certainly not this practical math.
Anyway, turns out the question we’re trying to answer is: if v is the number of values we have (5), and h is the number of headers we have (4), what are all the unique permutations of h numbers that add up to v?
The following looks similar to the written out paper breakdowns from before, but now instead of each number representing a position, each number represents how many values will be joined together.
What are all the permutations of 4 numbers that add up to 5?
[2, 1, 1, 1] # join the first two values together
[1, 2, 1, 1] # join the second and third values together
[1, 1, 2, 1] # join the third and fourth values together
[1, 1, 1, 2] # join the fourth and fifth values together
To calculate this programmatically, we can first calculate the unique combinations using the method below that I nicked from a stack overflow post that has since been lost in my browser history. There were a number of them, and this one seemed most performant.
Given these combinations we can now calculate all the different permutations of those sets, which is basically just getting every ordering of those numbers and then eliminating duplicates.
Here’s a class that does all this:
Now given all the permutation options, we can loop through them and generate all the value possibilities.
Here are the value possibilities. Five values made into four.
At this point, we could match these up with the headers and prompt the user with the option, making the task a little less tedious, like so:
And that’s a pretty good worst case scenario! But there’s still something we can do to determine which one of these is most likely correct without dying from boredom after doing this thousands of times.
Aha
You might have already have caught this, and if so — nice work, you’re sharp — but this crucial fact only hit me when I was a mile deep into this problem: when people type commas out they generally put a space after it. So if a parsed value starts with a space…it’s probably not the correct choice.
So in this example we can obviously see that option number 4 is the correct one, since every other one has a value starting with a space. So most of the time, we can reject any choice where any value starts with a space.
This doesn’t always work! With 750,000 manually entered tracks, you can bet people made typos, and you can bet there are tracks that have commas without spaces after them, like this beauty:
Deftones,U,U,D,D,L,R,L,R,Select,Start,Saturday Night Wrist,Maverick
(which, by the way, generates 220 possible options with no great way to determine which one is correct without a human and/or an internet connection. Oy vey.)
But this method works well enough to turn what I initially thought to be an impossible task into a very very doable task, which I’m happy to report I completed successfully.
I packaged all this code into a ruby gem (with a command line option) on the off chance that some other poor soul has found themselves in a similar CSV parsing dilemma. If that’s you: the chances of you getting out of that situation alive? Better than average.
Jeff Keen is an indy UX/dev consultant. Looking for Ember.js, Ruby/Rails, or UX help? Let’s talk.