How to Correct 32,000 Incorrect CSV Files in Fewer Than 32,000 Steps
Or how impossible problems can become possible when given no other choice.
--
CSV feels like the simplest of file formats, where it might seem that there’s not much to know after mentally expanding the acronym — Comma Separated Values. But tell me this: if it’s so simple, then why are there so many CSV parsing libraries, alternative CSV parsing libraries, and CSV parsing libraries that claim to be better or smarter, and a mountain of mangled CSVs in existence?
CSV isn’t so much a file format as it is a loose set of guidelines for converting tabular data into text. The closest thing to a spec for it is this, which deals with vital and often overlooked questions such as:
1. “What happens if a value has a comma in it?” — oh, you quote it
2. “What happens if a value has a quote in it?” — oh, you put another quote before it
One question the spec definitely does not cover is one I needed answering: “What do you do with 32,000 files claiming to be valid CSVs but of the 750,000 some lines an unknown number of them have extra unquoted commas hidden in the values, basically making the data untrustworthy?”
This is not such a simple problem, but it’s an interesting problem.
Background
I’ve been building a platform for public and community radio stations the last year and have been working with the fantastic 91.7 KOOP in Austin, TX to pilot it. Part of this project involved exporting the data from their old system (some 750,000 tracks) into the system I built for them. And noticing some tracks from an artist named “10” doing a song called “000 maniacs”, I realized the old system’s crusty CSV exporter did not do the right thing when it came to commas in values, leaving me with a real mess.
How many track titles, artists, albums, or record labels have commas in their names? Many.
I initially thought this was an impossible problem to solve (at least in a way that would not drive a person insane) so I tried reaching out to the developer of the old system to fix their ancient CSV exporter, or to send me data in another format that I could sort…