COVID-19 in Scotland — Four ways to get the data

Jamie Whyte
11 min readMay 1, 2020

--

Scottish Government are publishing data related to COVID-19 on their open statistics portal — statistics.gov.scot. This data is the data that supports Scottish Government’s daily updates, and is updated regularly.

There is a lot of data in this release, including:

  • number of people tested, and number positive and negative
  • numbers of calls to 111 and coronavirus helpline
  • number of people in ICU with confirmed or suspected COVID-19
  • number of people in hospital with confirmed or suspected COVID-19
  • ambulance attendances (total and COVID-19 suspected)
  • ambulance conveyances for suspected COVID-19
  • delayed discharges
  • NHS staff absence related to COVID-19
  • number of care homes notifying cases of COVID-19 to date

Figures are available daily, and are available for the whole of Scotland only (ie no sub-national geographies).

statistics.gov.scot is a platform developed and maintained by Swirrl¹, using their PublishMyData platform. Under the bonnet, it’s powered by linked data, where all the data is stored in RDF, but this doesn’t matter to end users. What matters is that it allows us to do some interesting stuff with the data. The examples below would work for all implementations of PublishMyData².

The main thing I’m going to look at here is that each dataset on statistics.gov.scot has a permanent URL which you can use to access it, as well as a permanent URL to download the data as a csv file, in tidy format (one observation per row). While this is a great feature for anyone who wants to check the data daily, the fact that the link to the csv download is permanent means we can access the data programmatically, and create things that automatically refresh when the data is refreshed by the publishers. Below are four simple ways to do this.

First, here’s the two links we’re going to need. Here’s the dataset home page we’re looking on:

https://statistics.gov.scot/resource?uri=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Fcoronavirus-covid-19-management-information

And here’s the permalink to the csv download:

https://statistics.gov.scot/downloads/cube-table?uri=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Fcoronavirus-covid-19-management-information

When we download the csv, here’s what it looks like in Excel:

screenshot of COVID-19 data in excel
COVID-19 data in Excel

If you look a carefully at the URLs, they are predictable based on the name of the dataset in on statistics.gov.scot. This means it’s possible to programmatically access all datasets, but we’ll save that for another day.

So to the examples then. Note that these are very basic examples to show the principles, so very little thought has been given to the presentation of the data. I’ve also assumed some knowledge / interest in this, so it’s not a step-by-step guide — but hopefully still useful for prompting ideas.

1. R

I do everything in R using RStudio — an integrated development environment for R. Wherever possible, I use the Tidyverse for munging data. This is a collection of packages that are useful for data science. Included in the Tidyverse is a package called readr and this allows us to grab csv files from a URL, and put the contents into a dataframe:

libraries(tidyverse)covid_data <- readr::read_csv("https://statistics.gov.scot/downloads/cube-table?uri=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Fcoronavirus-covid-19-management-information")View(covid_data)
Screenshot of COVID-19 data in R
COVID-19 data in R

Now that we’ve got our data in a dataframe, we can do things like filtering, and making charts (using ggplot):

covid_deaths <- filter(covid_data, Variable == 'Number of COVID-19 confirmed deaths registered to date') %>%
filter(FeatureCode == "S92000003") %>%
mutate(Value = as.integer(Value))
covid_death_chart <- ggplot(data = covid_deaths, aes(x = DateCode, y = Value)) +
geom_line()
View(covid_death_chart)

Here we are filtering the dataset down to only show confirmed deaths to date, in Scotland (S92000003), and ensuring the Value field is numeric. This gives us this basic line chart:

A basic line chart in R
A basic line chart in R

So because of the way that this chart pulls the data from the URL, every time we run the script, it will pull the latest data from statistics.gov.scot, without us needing to actually go and get the data.

In order to make this truly dynamic, though, we can use a package in R called Shiny, which makes it easy to create web-based tools and visualisations using the features of R.

library(shiny)
library(tidyverse)
library(plotly)
ui <- fluidPage(titlePanel("Number of COVID-19 confirmed deaths registered to date"),
plotlyOutput("covidPlot")

)
server <- function(input, output) {covid_data <- readr::read_csv("https://statistics.gov.scot/downloads/cube-table?uri=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Fcoronavirus-covid-19-management-information")

covid_deaths <- filter(covid_data, Variable == 'Number of COVID-19 confirmed deaths registered to date')

covid_deaths <- filter(covid_data, Variable == 'Number of COVID-19 confirmed deaths registered to date') %>%
filter(FeatureCode == "S92000003") %>%
mutate(Value = as.integer(Value))

output$covidPlot <- renderPlotly({covid_death_chart })
}
# Run the application
shinyApp(ui = ui, server = server)

In making this chart, I’ve used plotly, another library, which adds interactivity to ggplot charts, such as popups on the line.

Screenshot of a basic line chart on the web using R Shiny
Basic line chart on the web using R Shiny

We can then host this, either on shinyapps.io (this example — https://propolis.shinyapps.io/scot_covid/), or you can setup a Shiny server on Amazon (for example), and people can view your visualisation / tool there — and because this chart points back to the source data, this will update when either new days’ data is added, or corrections are applied to the data.

2. Python

For python, I’m going to use PyCharm, which is another IDE, similar to RStudio, but for Python. The principle here is also very similar to R, in that we are going to use a couple of packages — one to get and transform the data, and one to draw the chart. These two packages are pandas (for data transformation), and matplotlib (for drawing charts).

You will need to install these packages in your PyCharm environment, and then create a new python script. In this python script, we’ll first import the packages that we need:

import pandas as pd
import matplotlib.pyplot as plt

This bit of code imports the whole of the pandas package, and the module of the matplotlib package that relates to drawing charts.

Next, we’re going to grab the data from statistics.gov.scot using the read_csv function from pandas, and put the data into a dataframe:

df = pd.read_csv("https://statistics.gov.scot/downloads/cube-table?uri=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Fcoronavirus-covid-19-management-information")

Then we want to convert the value field to a numeric type, and the DateCode to a date type. We also want to filter the dataframe so that it just one dataset, for the whole of Scotland only:

# convert Value to numeric field and DateCode to a datedf['Value'] = pd.to_numeric(df['Value'], errors='coerce')
df['DateCode'] = pd.to_datetime(df['DateCode'])
df_icu_conf = df.query('Variable == "COVID-19 patients in ICU - Confirmed"')
df_icu_conf_scot = df_icu_conf.query('FeatureCode == "S92000003"')
df['DateCode'] = pd.to_datetime(df['DateCode'])

The object df_icu_conf_scot now contains all observations that relate to the confirmed number of COVID-19 patients in intensive care, in Scotland. The final thing here is to draw our chart:

ax = plt.gca()
df_icu_conf_scot.plot(kind="line", x="DateCode", y="Value", ax=ax)
plt.xticks(rotation=90)
plt.title("COVID-19 patients in ICU - Confirmed")
ax.legend().set_visible(False)
ax.set_ylim(ymin=0)
plt.show()

This uses the pyplot module to create the chart axes, then create a line chart with the date along the x axis, and the value up the y axis. Then we rotate the x axis labels, add a title, hide the legend, and set the y-axis to start at 0. plt.show() opens the chart for display.

Running this at the time of writing gives us this chart:

Line chart drawn using Python / matplotlib
Line chart drawn using Python / matplotlib

Each time I run this python script, the chart will be updated with the latest data on statistics.gov.scot.

This could be useful for building daily dashboards, with some quite complex filtering and calculations.

Another thing that it’s fairly straightforward to do is generate simple dynamic text. Here, we’re going to get the data, again using pandas.

import pandas as pd

df = pd.read_csv("https://statistics.gov.scot/downloads/cube-table?uri=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Fcoronavirus-covid-19-management-information")
df['Value'] = pd.to_numeric(df['Value'], errors='coerce')
df['DateCode'] = pd.to_datetime(df['DateCode'])

# filter to only one dataset and Scotland
df_icu_conf = df.query('Variable == "COVID-19 patients in ICU - Confirmed"')
df_icu_conf_scot = df_icu_conf.query('FeatureCode == "S92000003"')df_icu_conf_latest = df_icu_conf_scot[df_icu_conf_scot['DateCode'] == df_icu_conf_scot['DateCode'].max()]

current_count = df_icu_conf_latest.iloc[0]['Value']
latest_date = df_icu_conf_latest.iloc[0]['DateCode'].strftime("%A %d %B %Y")
print("As at " + str(latest_date) + ", there were " + str(current_count) + " confirmed cases of COVID-19 in intensive care units in Scotland")

We’ll filter the dataframe to only show the variable that we’re interested in, and then filter again to return a one row dataframe, containing only the latest day’s data.

Using iloc, we can get the value from the Value field and pass that into a variable, and the formatted date of the observation which we pass into another variable. We can then use in a basic string concatenation to give (at the time of writing):

As at Thursday 30 April 2020, there were 101 confirmed cases of COVID-19 in intensive care units in Scotland

These python scripts can be run locally, and it would be possible to build up a dashboard that outputs to pdf and is emailed to a distribution list. To make this even more dynamic, and requiring less human interaction to run and distribute, it is possible to use these scripts on the web, using something like Flask to create dynamic dashboards that will automatically refresh on demand.

3. Google Sheets

This is a very short and straightforward example of using Google Sheets to get the data.

Open a new sheets doc, and in cell A1, paste the following

=importData(“https://statistics.gov.scot/downloads/cube-table?uri=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Fcoronavirus-covid-19-management-information")

When you press enter, the sheet will refresh, and will automatically populate with the latest data from statistics.gov.scot, and from here you can make various charts and views of the data.

Screenshot of COVID-19 data in Google Sheets
COVID-19 data in Google Sheets

There is a setting in google sheets to change the recalculation period within the spreadsheet. The default is to recalculate on change, but we can set it to every hour, so our spreadsheet will go and get a fresh copy of the data each hour, and any charts will update.

Screenshot of Google Sheets setting showing recalculation options
Google Sheets setting showing recalculation options

4. Tableau

Tableau is a great tool for analysing and visualising data, and it has a really strong community of users. Tableau supports a range of data inputs — local csv files, Excel docs, pdf etc. It also supports importing data from the web, through the use of Web Data Connectors (WDC) — a combination of a javascript and a html file that tells Tableau how to interpret the results of an API call.

I have already created a WDC for the statistics.gov.scot COVID-19 data. I’ve put this on github, so you can have a look at the code here. If you know javascript / html, it should be easy enough to clone and modify for other datasets on statistics.gov.scot (or indeed any of the other government departments running PublishMyData).

For this example, I’m using Tableau Public (because it’s free). On the landing page, in the left sidebar there is the option to connect using a Web Data Connector.

Screenshot of Tableau Public start screen
Tableau Public start screen

Select this, and you will be presented with a pop-up window asking for the url of the web data connector. My one for the COVID data is available on Github at this url:

https://northernjamie.github.io/scot-covid-wdc/scot_covid_wdc.html

Screenshot of Tableau web data connector selector
Tableau web data connector selector

Put this into the url box and press enter, and you’ll get this interstitial page.

Screenshot of Tableau web data connector interstitial page
Tableau web data connector interstitial page

Hit the ‘Get the data’ button, and Tableau will fetch the data definition for the particular dataset that we’re connecting to (this is defined in the javascript file https://github.com/northernjamie/scot-covid-wdc/blob/master/scot_covid_wdc.js). This will take a few seconds, but once done, you’ll see this screen in Tableau:

Screenshot of tableau connection information screen
Connection information screen

From here, we can click ‘Update Now’ to get the actual data from statistics.gov.scot. This is what we can now see:

Screenshot of preview of COVID-19 data in Tableau
Preview of COVID-19 data in Tableau

At the time of writing — there are 928 rows of data, which seems correct.

If you press ‘Sheet 1’ (which is conveniently highlighted in orange in the screenshot above), then you can start to make use of Tableau’s visualising tools:

Screenshot of Tableau visualisation builder
Tableau visualisation builder

By adding in the ‘Date Code’ field to our columns section, and the ‘Value’ field to the Rows section, we can draw a chart (click Show Me in the top right corner if you can’t see chart settings):

Screenshot of Tableau line chart
Tableau line chart

The problem with this is that it summing together all the different variables in the dataset, which is meaningless. So we want to add a filter on the ‘Variable’ field . To do this, click the drop down arrow next to ‘Variable’ in the left panel, and choose Show Filter. This will now allow you to choose which variable to show in the chart.

Screenshot of Tableau line chart with filters
Tableau line chart with filters

We also want to add a filter so that not all geographical areas are shown at the same time. To limit out visualisation to Scotland, we add the FeatureCode field to the filters box, and select only S92000003.

Screenshot of filtering to only include whole-Scotland data
Filtering to only include whole-Scotland data

We can now publish this to Tableau’s public servers (you’ll need an account to do this, but it is free). You can see my version of this here:

https://public.tableau.com/profile/jamie.whyte#!/vizhome/COVID-19-statistics_gov_scot/Sheet1?publish=yes

Screenshot of our COVID-19 ‘Visualisation’ on Tableau Public Server
Our COVID-19 ‘Visualisation’ on Tableau Public Server

Unfortunately — Tableau Public doesn’t seem to allow automated updating of data. If I wanted to refresh the data, I would need to republish my visualisation. Tableau server does appear to allow scheduling, but this is expensive. There may also be a way of combining Google Sheets with Tableau to allow a nearly live connection between the data and the visualisation, but I haven’t explored this.

End

Hopefully these examples give an indication that with not much effort, and free software, you can create interesting and useful views of data that mostly take care of themselves once they’ve been set up.

This is dependent on data publishers publishing data in a consistent way, and is something that comes up time and time again — when data publishers put effort into publishing their data the right way, then it means that the users of that data don’t have to put as much effort into preparing and cleaning the data, which in turn means there is a net saving in terms of effort. And while this principle works best for those datasets that are updated on a regular, frequent basis, the principle applies to all datasets. Everybody wins.

If you do manage to use some of these techniques to make something, I’d love to see it — the more of this stuff that we share and make, the more other people are likely to have a go themselves. Feel free to share with me on twitter — @northernjamie!

[1] Disclaimer — I work with Swirrl

[2] Current publicly available implementations of PublishMyData are:

--

--

Jamie Whyte

family; (open/linked)data; civic tech; northern; Propolis; @opendatamcr; #rstats; @odcamp; @InspireLearnMAT & @TTSAtrafford; @swirrl; @YouthAreUnited