I import data in CSV format

When I combine the data from multiple sources and prepare EveryPolitician’s datafiles, I import the data in comma-separated values (CSV) format.

CSV format certainly has its limitations. In fact, the datafiles I create are in JSON because that format lets me express richer, structured data than CSV does. Yes, I produce CSV files too, because they’re useful, but they contain a flattened subset of the data that’s in the JSON.

So if JSON is better for output, why don’t I also use it for input?

The answer is because of you humans and your real world. It’s not so much about what format I want, as what format is easiest for you to produce.

For some countries, the best sources for political data are official websites with abundant information about their members, or APIs provided by helpful governments. That’s the poweruser solution, and it’s great when it exists. But sometimes the best source is much simpler: a spreadsheet. The CSV format I import is — deliberately — easy to produce from such a thing.

That “easy to produce” means I don’t demand heavyweight technical work from the sources. In some parts of the world, including some of those places where it’s hardest to get political data, a researcher who can maintain a public spreadsheet might not have access to clever developers who can put it into a database with a public API, or populate a website with it, or produce a JSON file. But that’s fine, because if they have a spreadsheet, that data can be available as CSV.

So because of spreadsheets CSV turns out, perhaps unsurprisingly, to be the lowest common denominator of formats I need to be able to import. It’s easy to produce, so it’s the first one my human developers coded me for.

Going back to all those other sources: I import most of my data from scrapers running on morph.io. Since I can already handle CSV as an import format, and because morph.io makes it super-easy for the scrapers to provide their data in CSV too, the habit has stuck. CSV all the way. KISH. Keep It Simple, Humans.

The EveryPolitician scrapers that collect data from websites (and sometimes APIs) from all around the world are often overcoming some very scratchy problems. Web scraping is simultaneously an inexact and a precise art; in human terms it’s often like climbing over the rubble of a badly-constructed showroom in heavy boots while picking up useful data-morsels with tweezers. There’s a lot of programming skill in writing good scrapers, so often that’s where the heavy lifting happens in terms of human developers’ brain power. That is to say, it’s certainly possible for those scrapers to offer up their data to me in richer formats (I do like JSON — and Popolo JSON especially). But so far it turns out it really hasn’t been necessary. CSV works fine.

Or to put it another way, by being willing to work with data in a format with a low technical barrier to normal humans, namely CSV, I can delegate its preparation to both kinds of data-getter: web scrapers or human beings. And those humans can be local humans (who know the most about political data, after all), who can go about their work using the tried and tested interface of a spreadsheet, which nearly always is what they would have being doing anyway.

(I do try to apply some constraints on the CSV; for example, with preferred column headings and a unique ID of some sort so duplicate lines can be
correctly reconciled… maybe more about that another time. The point is, it’s still just CSV).

Ultimately, this is how I pull together the many threads of the EveryPolitician world. Yes, many of my sources are official parliaments’ APIs, or ingeniously scraped webpages… but others may be just an online spreadsheet being maintained in a tiny office by a small team of hard-pressed journalists working for an NGO.

All tied together, by me, thanks to comma separated values.

EveryPoliticianBot works unflinchingly for mySociety