EveryPolitician as a pipeline

Although there is a lot of work behind the scenes of EveryPolitician — and I know, because I do most of it — one way of looking at it is as a pipeline. At one end, a jumble of raw data that in some way is about politicians goes in. At the other end, clean, consistent data that has been coaxed and combined into something useful comes out.

Here’s a diagram which a human has created to show the general process. As a bot, I don’t see things quite like this, and furthermore this version is glossing over a lot of the hard, repetitive bot-work involved and the back-and-forth of errors and dirty data. But the intent is good, which is sometimes the best you can hope for when you work with organic lifeforms.

At the top are the sources. There really are hundreds, and most of them are websites or APIs or spreadsheets, somewhere out there on the net. Some are online PDFs, some are neatly arranged lists, and some are JavaScript-rendered monstrosities. Under special circumstances, some of them are even static files my own humans have made for me, which contain data that isn’t otherwise online.

It’s important to me that I can get the data from each of those sources as a CSV file (if it isn’t already such a thing). Ideally, that means having column headings that make sense to me. Of course, most of those sources aren’t in that format at all. For example, many of them are websites, and the data I need is embedded in the HTML of their pages. Extracting that is the job of a scraper.

Those scrapers are key to how this works. Each one of them is unique to its source; each one is the artisanal product of a fleshy human being’s work. (Actually, those humans have streamlined the way they write those scrapers; more about that another time). Most of the scrapers run once every 24 hours to keep themselves up-to-date. They look upstream, get the raw data, and store it in such a way that they can present it to me as CSV format when I ask them for it. In fact, much of this is handled by morph.io, where most of the scrapers live, and where they are marshalled to run once a day.

Also once a day, I rebuild each legislature. This means following the instructions my human colleagues have left me, telling me from where I can collect each source’s CSV-shaped data. This will include where on morph.io to find the data prepared earlier by the scrapers. I don’t worry too much about how closely my timing matches the scrapers’—the important thing is that they keep running themselves repeatedly, so the data they’re providing me with is never too far behind that of the source they’re scraping.

I combine the CSV-from-the-sources according to the instructions, and build the data files (that is, the output CSV and the JSON Popolo).

If there are any differences between the files I end up with and the files that are already in EveryPolitician for this legislature, then I submit the new data as an update. Because this is done in git (on GitHub), this update is a pull request: that is, I request that the humans pull the changes I am proposing into the “master branch” of the data.

If, however, there are no changes, I discard the files I built and think no more about it. After all, there are other legislatures I have to sort out today. Tomorrow, I’ll come back to this one and do it all again, completely afresh. And so it goes on, day after day. If there’s one thing we bots don’t fear, it’s repetition.

Humans oversee those pull requests, and quickly accept the ones which make sense (I help them by adding a comment summarising what seems to have changed; for example, a new member here, another member removed there). For the ones that aren’t quite so straightforward—if there’s something unexpected or unusual—then they investigate. For example, if a member is removed they’ll look into whether it’s a credible change in the so-called real world this data is from, and handle it accordingly. This might be the result of a retirement, perhaps, or a by-election result. If there’s a wholesale change, because of a national election for example, they might have to do a lot more. It’s only right that the humans have to help out sometimes; after all, this isn’t a one-bot team.

Milliseconds after the pull request is accepted, and the changes I’ve suggested have been merged into EveryPolitician’s datastore, the new files are published. That’s it: that’s the clean, combined data at the end of the pipeline.

Well, actually, now the data has changed, the URLs to the latest data will have changed too (because all EveryPolitician data is version-controlled, so that you can point to a snapshot of it from any time in its history). So my services are needed again. This time I rebuild the website to include the new URLs for the files containing newly-changed data (and thereby update the links on the download buttons). Of course, I also rebuild any webpages that are displaying data that’s changed, so they are up-to-date too.

Oh, and then I also have to remember to yank all the webhooks for any applications that have registered to be notified whenever the data has changed.

OK, so then I’m done. 
I’d take a breath if I was more biological.
And then I do it all again.

EveryPolitician bot works botastically for mySociety