Part of my job as a data journalist at The Times and The Sunday Times is combining data sets. The latest release of some government statistics might not be interesting in itself, but compare it to the same data for the past six months or five years and a surprising rise or fall might jump out and make a compelling story.
My usual approach to these problems is to write a script in R to scrape the relevant page for download links (using the
rvest package), download all the data, read it (using the
readxl package), clean it, and save it in a more usable format. But what happens when the only available version is the latest one?
This happens surprisingly often. Tools like the Internet Archive’s Wayback Machine and the Government Web Archive will sometimes have captured older versions, but coverage tends to be patchy. What’s needed is a program to download the data at regular intervals and store it.
Until recently I handled this task using shell scripts triggered by
cron is a Unix tool which allows you to specify a list of commands, with instructions on when and how often they should be run. I wrote a few scripts to extract links from pages on the government website and download the files, adding timestamps to the file names, and set these to run every hour, day, week or month. The scripts lived on a ‘free tier’ Amazon EC2 server which I checked in on occasionally via SSH.
This was all well and good until the server ran out of hard drive space: checking in on it recently, I found that I’d failed to capture weeks’ worth of data. The short-term solution was to allocate more storage, but the whole system had been annoying me for a while — storing dates and times in file names felt messy and transferring data to my laptop using
scp each time I wanted to do some analysis was a pain.
One of the downsides of Lambda functions is that they require a lot of configuration. The function’s code has to be packaged as a zip file and uploaded to AWS and any S3 buckets used for storage have to be manually connected to the function, with the appropriate permissions set. All this is done through the AWS web interface, which can be… less than intuitive.
To mitigate this, my colleague Chris recommended the Serverless framework. Serverless is a tool which bundles together many of the complexities of Lambda functions (and similar services from other cloud providers like Microsoft), allowing developers to configure, test and deploy functions using a configuration file and a set of simple commands.
All the settings are stored in a Yaml file called
serverless.yml. For my simple scripts, this runs to about 30 lines and does little more than specify an S3 bucket for storage, grant the appropriate permissions and list the name of each function, with a schedule for how often it should be triggered — this can be done in the familiar
cron syntax. The functions themselves are Python scripts of about ten lines each.
Re-writing my shell scripts as Python functions took less than half an hour and I spent another hour or so familiarising myself with the configuration syntax — I then deployed the functions directly to our corporate AWS account with a simple
serverless deploy. When the resulting data flows into S3 it’s automatically versioned and I can access it easily through Amazon’s API, downloading all versions of a given file or selecting only those that I need.
For my next trick, I plan to write a new set of functions to process the data and transfer it to a database — each time a new Excel file enters my S3 bucket, the script will read it, compare it to existing data and insert any new records into either a relational or a document database. But that’s work for another day!