COVID-19 in Scotland — Four ways to get the data
Scottish Government are publishing data related to COVID-19 on their open statistics portal — statistics.gov.scot. This data is the data that supports Scottish Government’s daily updates, and is updated regularly.
There is a lot of data in this release, including:
- number of people tested, and number positive and negative
- numbers of calls to 111 and coronavirus helpline
- number of people in ICU with confirmed or suspected COVID-19
- number of people in hospital with confirmed or suspected COVID-19
- ambulance attendances (total and COVID-19 suspected)
- ambulance conveyances for suspected COVID-19
- delayed discharges
- NHS staff absence related to COVID-19
- number of care homes notifying cases of COVID-19 to date
Figures are available daily, and are available for the whole of Scotland only (ie no sub-national geographies).
statistics.gov.scot is a platform developed and maintained by Swirrl¹, using their PublishMyData platform. Under the bonnet, it’s powered by linked data, where all the data is stored in RDF, but this doesn’t matter to end users. What matters is that it allows us to do some interesting stuff with the data. The examples below would work for all implementations of PublishMyData².
The main thing I’m going to look at here is that each dataset on statistics.gov.scot has a permanent URL which you can use to access it, as well as a permanent URL to download the data as a csv file, in tidy format (one observation per row). While this is a great feature for anyone who wants to check the data daily, the fact that the link to the csv download is permanent means we can access the data programmatically, and create things that automatically refresh when the data is refreshed by the publishers. Below are four simple ways to do this.
First, here’s the two links we’re going to need. Here’s the dataset home page we’re looking on:
And here’s the permalink to the csv download:
When we download the csv, here’s what it looks like in Excel:
If you look a carefully at the URLs, they are predictable based on the name of the dataset in on statistics.gov.scot. This means it’s possible to programmatically access all datasets, but we’ll save that for another day.
So to the examples then. Note that these are very basic examples to show the principles, so very little thought has been given to the presentation of the data. I’ve also assumed some knowledge / interest in this, so it’s not a step-by-step guide — but hopefully still useful for prompting ideas.
1. R
I do everything in R using RStudio — an integrated development environment for R. Wherever possible, I use the Tidyverse for munging data. This is a collection of packages that are useful for data science. Included in the Tidyverse is a package called readr
and this allows us to grab csv files from a URL, and put the contents into a dataframe:
libraries(tidyverse)covid_data <- readr::read_csv("https://statistics.gov.scot/downloads/cube-table?uri=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Fcoronavirus-covid-19-management-information")View(covid_data)
Now that we’ve got our data in a dataframe, we can do things like filtering, and making charts (using ggplot):
covid_deaths <- filter(covid_data, Variable == 'Number of COVID-19 confirmed deaths registered to date') %>%
filter(FeatureCode == "S92000003") %>%
mutate(Value = as.integer(Value))covid_death_chart <- ggplot(data = covid_deaths, aes(x = DateCode, y = Value)) +
geom_line()View(covid_death_chart)
Here we are filtering the dataset down to only show confirmed deaths to date, in Scotland (S92000003), and ensuring the Value
field is numeric. This gives us this basic line chart:
So because of the way that this chart pulls the data from the URL, every time we run the script, it will pull the latest data from statistics.gov.scot, without us needing to actually go and get the data.
In order to make this truly dynamic, though, we can use a package in R called Shiny, which makes it easy to create web-based tools and visualisations using the features of R.
library(shiny)
library(tidyverse)
library(plotly)ui <- fluidPage(titlePanel("Number of COVID-19 confirmed deaths registered to date"),
plotlyOutput("covidPlot")
)server <- function(input, output) {covid_data <- readr::read_csv("https://statistics.gov.scot/downloads/cube-table?uri=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Fcoronavirus-covid-19-management-information")
covid_deaths <- filter(covid_data, Variable == 'Number of COVID-19 confirmed deaths registered to date')
covid_deaths <- filter(covid_data, Variable == 'Number of COVID-19 confirmed deaths registered to date') %>%
filter(FeatureCode == "S92000003") %>%
mutate(Value = as.integer(Value))
output$covidPlot <- renderPlotly({covid_death_chart })
}# Run the application
shinyApp(ui = ui, server = server)
In making this chart, I’ve used plotly, another library, which adds interactivity to ggplot charts, such as popups on the line.
We can then host this, either on shinyapps.io (this example — https://propolis.shinyapps.io/scot_covid/), or you can setup a Shiny server on Amazon (for example), and people can view your visualisation / tool there — and because this chart points back to the source data, this will update when either new days’ data is added, or corrections are applied to the data.
2. Python
For python, I’m going to use PyCharm, which is another IDE, similar to RStudio, but for Python. The principle here is also very similar to R, in that we are going to use a couple of packages — one to get and transform the data, and one to draw the chart. These two packages are pandas (for data transformation), and matplotlib (for drawing charts).
You will need to install these packages in your PyCharm environment, and then create a new python script. In this python script, we’ll first import the packages that we need:
import pandas as pd
import matplotlib.pyplot as plt
This bit of code imports the whole of the pandas package, and the module of the matplotlib package that relates to drawing charts.
Next, we’re going to grab the data from statistics.gov.scot using the read_csv
function from pandas, and put the data into a dataframe:
df = pd.read_csv("https://statistics.gov.scot/downloads/cube-table?uri=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Fcoronavirus-covid-19-management-information")
Then we want to convert the value field to a numeric type, and the DateCode to a date type. We also want to filter the dataframe so that it just one dataset, for the whole of Scotland only:
# convert Value to numeric field and DateCode to a datedf['Value'] = pd.to_numeric(df['Value'], errors='coerce')
df['DateCode'] = pd.to_datetime(df['DateCode'])df_icu_conf = df.query('Variable == "COVID-19 patients in ICU - Confirmed"')
df_icu_conf_scot = df_icu_conf.query('FeatureCode == "S92000003"')
df['DateCode'] = pd.to_datetime(df['DateCode'])
The object df_icu_conf_scot
now contains all observations that relate to the confirmed number of COVID-19 patients in intensive care, in Scotland. The final thing here is to draw our chart:
ax = plt.gca()
df_icu_conf_scot.plot(kind="line", x="DateCode", y="Value", ax=ax)
plt.xticks(rotation=90)
plt.title("COVID-19 patients in ICU - Confirmed")
ax.legend().set_visible(False)
ax.set_ylim(ymin=0)
plt.show()
This uses the pyplot module to create the chart axes, then create a line chart with the date along the x axis, and the value up the y axis. Then we rotate the x axis labels, add a title, hide the legend, and set the y-axis to start at 0. plt.show()
opens the chart for display.
Running this at the time of writing gives us this chart:
Each time I run this python script, the chart will be updated with the latest data on statistics.gov.scot.
This could be useful for building daily dashboards, with some quite complex filtering and calculations.
Another thing that it’s fairly straightforward to do is generate simple dynamic text. Here, we’re going to get the data, again using pandas.
import pandas as pd
df = pd.read_csv("https://statistics.gov.scot/downloads/cube-table?uri=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Fcoronavirus-covid-19-management-information")df['Value'] = pd.to_numeric(df['Value'], errors='coerce')
df['DateCode'] = pd.to_datetime(df['DateCode'])
# filter to only one dataset and Scotland
df_icu_conf = df.query('Variable == "COVID-19 patients in ICU - Confirmed"')df_icu_conf_scot = df_icu_conf.query('FeatureCode == "S92000003"')df_icu_conf_latest = df_icu_conf_scot[df_icu_conf_scot['DateCode'] == df_icu_conf_scot['DateCode'].max()]
current_count = df_icu_conf_latest.iloc[0]['Value']
latest_date = df_icu_conf_latest.iloc[0]['DateCode'].strftime("%A %d %B %Y")print("As at " + str(latest_date) + ", there were " + str(current_count) + " confirmed cases of COVID-19 in intensive care units in Scotland")
We’ll filter the dataframe to only show the variable that we’re interested in, and then filter again to return a one row dataframe, containing only the latest day’s data.
Using iloc
, we can get the value from the Value
field and pass that into a variable, and the formatted date of the observation which we pass into another variable. We can then use in a basic string concatenation to give (at the time of writing):
As at Thursday 30 April 2020, there were 101 confirmed cases of COVID-19 in intensive care units in Scotland
These python scripts can be run locally, and it would be possible to build up a dashboard that outputs to pdf and is emailed to a distribution list. To make this even more dynamic, and requiring less human interaction to run and distribute, it is possible to use these scripts on the web, using something like Flask to create dynamic dashboards that will automatically refresh on demand.
3. Google Sheets
This is a very short and straightforward example of using Google Sheets to get the data.
Open a new sheets doc, and in cell A1, paste the following
=importData(“https://statistics.gov.scot/downloads/cube-table?uri=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Fcoronavirus-covid-19-management-information")
When you press enter, the sheet will refresh, and will automatically populate with the latest data from statistics.gov.scot, and from here you can make various charts and views of the data.
There is a setting in google sheets to change the recalculation period within the spreadsheet. The default is to recalculate on change, but we can set it to every hour, so our spreadsheet will go and get a fresh copy of the data each hour, and any charts will update.
4. Tableau
Tableau is a great tool for analysing and visualising data, and it has a really strong community of users. Tableau supports a range of data inputs — local csv files, Excel docs, pdf etc. It also supports importing data from the web, through the use of Web Data Connectors (WDC) — a combination of a javascript and a html file that tells Tableau how to interpret the results of an API call.
I have already created a WDC for the statistics.gov.scot COVID-19 data. I’ve put this on github, so you can have a look at the code here. If you know javascript / html, it should be easy enough to clone and modify for other datasets on statistics.gov.scot (or indeed any of the other government departments running PublishMyData).
For this example, I’m using Tableau Public (because it’s free). On the landing page, in the left sidebar there is the option to connect using a Web Data Connector.
Select this, and you will be presented with a pop-up window asking for the url of the web data connector. My one for the COVID data is available on Github at this url:
https://northernjamie.github.io/scot-covid-wdc/scot_covid_wdc.html
Put this into the url box and press enter, and you’ll get this interstitial page.
Hit the ‘Get the data’ button, and Tableau will fetch the data definition for the particular dataset that we’re connecting to (this is defined in the javascript file https://github.com/northernjamie/scot-covid-wdc/blob/master/scot_covid_wdc.js). This will take a few seconds, but once done, you’ll see this screen in Tableau:
From here, we can click ‘Update Now’ to get the actual data from statistics.gov.scot. This is what we can now see:
At the time of writing — there are 928 rows of data, which seems correct.
If you press ‘Sheet 1’ (which is conveniently highlighted in orange in the screenshot above), then you can start to make use of Tableau’s visualising tools:
By adding in the ‘Date Code’ field to our columns section, and the ‘Value’ field to the Rows section, we can draw a chart (click Show Me in the top right corner if you can’t see chart settings):
The problem with this is that it summing together all the different variables in the dataset, which is meaningless. So we want to add a filter on the ‘Variable’ field . To do this, click the drop down arrow next to ‘Variable’ in the left panel, and choose Show Filter. This will now allow you to choose which variable to show in the chart.
We also want to add a filter so that not all geographical areas are shown at the same time. To limit out visualisation to Scotland, we add the FeatureCode
field to the filters box, and select only S92000003.
We can now publish this to Tableau’s public servers (you’ll need an account to do this, but it is free). You can see my version of this here:
Unfortunately — Tableau Public doesn’t seem to allow automated updating of data. If I wanted to refresh the data, I would need to republish my visualisation. Tableau server does appear to allow scheduling, but this is expensive. There may also be a way of combining Google Sheets with Tableau to allow a nearly live connection between the data and the visualisation, but I haven’t explored this.
End
Hopefully these examples give an indication that with not much effort, and free software, you can create interesting and useful views of data that mostly take care of themselves once they’ve been set up.
This is dependent on data publishers publishing data in a consistent way, and is something that comes up time and time again — when data publishers put effort into publishing their data the right way, then it means that the users of that data don’t have to put as much effort into preparing and cleaning the data, which in turn means there is a net saving in terms of effort. And while this principle works best for those datasets that are updated on a regular, frequent basis, the principle applies to all datasets. Everybody wins.
If you do manage to use some of these techniques to make something, I’d love to see it — the more of this stuff that we share and make, the more other people are likely to have a go themselves. Feel free to share with me on twitter — @northernjamie!
[1] Disclaimer — I work with Swirrl
[2] Current publicly available implementations of PublishMyData are:
- https://statistics.gov.scot — Scottish Statistics
- https://opendatacommunities.org — Ministry for Housing, Communities and Local Government
- http://linked.nisra.gov.uk/home — Northern Ireland Statistics and Research Authority