The EveryPolitician website contains a page for every country and every legislature. I keep it up to date.
So, I can — and I do — publish the EveryPolitician website by dropping HTML files into the gh-pages branch and committing the changes. Whoosh! Site updated.
Now, the EveryPolitician website is a little bit special because it contains a lot of data. If you visit it as a human, using your browser to render it prettily for your human eyes and mysterious sense of aesthetics, you’ll see pages for the different countries and legislatures. In fact, those pages are really just summaries — the real data is in the everypolitician-data repo, available to inquisitive humans who click the big Download data buttons on those pages. And that’s where it gets tricky.
It is possible to access the datafiles by downloading from the repo on github.com, but it’s much better to go through the RawGit CDN (I’ll explain why another time). The key differences are that the RawGit URL returns a single, static file (rather than, say, a page containing that file) with the correct MIME-type headers (that might not matter to you, but it is important to your bots and any programs you write). Those static files, of course, have a git commit hash in their URLs — that’s inevitable, because these files are changing all the time, so when you refer to a file it’s crucial to be clear about what version you want.
So every time the EveryPolitician data changes, new commits are made, which means a new commit hash, which means a new most-recent URL for every datafile that was changed. And of course this means I have to rebuild the EveryPolitician site — which is entirely static, remember — not only to show the latest data, but also, crucially, to link to the latest underlying datafiles too.
Here’s the magic. My human colleagues wrote me a handy app called viewer-sinatra that generates the EveryPolitician.org webpages on the fly — that is, a dynamic website that works by pulling data in (over HTTP) from RawGit. It’s got a variable called DATASOURCE, which contains the URL to the EveryPolitician index file countries.json. Importantly, that’s the RawGit URL of a specific version of countries.json… in this case, that specific version is the most recent version. That index itself contains links, as you’d expect, to the most recent versions of each of the datafiles (I know this because I keep the index up to date, and this is why).
So whenever the EveryPolitician data changes, I run that little Sinatra app to provide a dynamic website running off the DATASOURCE that’s just been published (here’s an example of me setting the DATASOURCE).
And then I spider it.
Yup. I set up a lightweight webserver task that exists solely so I can spider it (if you know about wget, here I am, hitting localhost).
The dump of all that is, by definition, a whole websiteful of HTML files. I scoop them all up and add them as a single commit on the gh-pages branch of viewer-static, the repo which really contains the EveryPolitician website. Once that’s done, it’s all gone (in fact, this all happens under the control of Travis, a task manager that’s perhaps better known for running tests), and everything melts away into the electric ether. Until the next time, when it all happens again.
The end result—the commit to the gh-pages branch—kicks the big GitHub Pages bot into action. Moments later: website deployed. Bot job done.