From static to animated time series: the tidyverse way

Giulia Ruggeri
EPFL Extension School
7 min readApr 17, 2020

Reproducing and animating COVID-19 time series using R

In recent months there have been examples of how to plot timeseries of COVID-19 confirmed cases all over the web, and you might be overloaded by the amount of data visualisations that are shared each day on every possible platform. It is indeed difficult for anyone passionate about data visualisation to not want to plot the data yourself. For this reason, in this article we will go through the steps necessary to create a timeseries plot with R, inspired by the many examples that can be found on the web: from the Financial Times, the New York Times and the Economist and so on. In this chart, the number cases since the 150th case are plotted, for each country. As a second step we will then animate the chart, always in R, using the package{gganimate}.

We will import the data from the data repository of the John Hopkins University, which collects data from local governments and the World Health Organization.

As a very first step in our R script or R notebook, we will want to import all the libraries that we use in this little exercise. We start with the omnipresent {tidyverse}, which allows us to load {readr},{tidyr}, {dplyr} and {ggplot2} that we will use to import, manipulate and plot the data.{lubridate} is our package of choice for working with variables that store dates' values, and{janitor} is used to clean the column names.{rcartocolor} contains some very nice color scales, useful for cartography and more, and finally, we use{gganimate} to animate the chart.

We can now import the data directly from the dataset, which is continuously updated by John Hopkins University, shared on Github.

In order to recreate the chart, inspired by John Burn-Murdoch’s work, we need to tidy the data first, and then do some basic manipulation of it. Since each column corresponds to the different dates, the first thing to do would be to transform the data from wide format to long format: for this we use the pivot_longer() function from the {tidyr} package.

We will also try to be smart here, and select for our pivoting only the columns that are named as a date in the following format: “digit/digit/digit”. The new column date will need to be transformed into a date column, and to do this we can use the smart mdy() function from the{lubridate} package: where m stands for month, d stands for day and y stands for year.

In the rest of the script we will keep only those country_regions that we want to plot. We will also filter out the China data because it is a bit off scale compared to the other countries, but this may not be valid in few weeks time, given the unfortunately fast pace at which the pandemic has spread.

In the country_region column, some countries are reported by subregions but we actually want only one value per day per country. We will therefore use the group_by() and summarise() to make sure that this is the case, and that this corresponds to the sum of cases in all the regions in each country.

Now cases_data is beginning to looking tidy (one row per country per date and just one column storing the cases) - what a joy for the eyes!

If we just wanted to plot the raw timeseries, this would be enough. However, we want to go a bit further than this and plot the dates since the 150th case in each country. We also want to keep only the countries that reached the 150th case from more than 9 days, so as not to have an overcrowded chart. By group, we will again calculate the days since the 150th case and for good practice we will then ungroup() the data.

Many visualisations that can be found on the web or in newspapers add to the country trend lines a dashed line that corresponds to the 33% daily rise. We will then add it to our original tibble (which is just another way to say table in tidyverse language). We will also create a new variable line_type that we will use to set the line corresponding to the 33% daily rise to be a dashed line, in our plot.

Now we can make our first attempt at plotting, using the {ggplot2} library. To do this, we will map the color aesthetic to the country_region column. This will apply a different color to each country (in the chosen color scale). Beware that most categorical color scales do not have more than 12 colors, therefore the last color of the scale will be repeated for as many times as is necessary to cover all the extra countries. We will also transform the y-axis to a logarithmic scale, using scale_y_log10(). The rest of the code is done to set a color scale and to set a theme to the plot, as well as all the annotations.

This was a first attempt but we can already see that there is one problem with this visualisation: the colours. The number of countries plotted is indeed too big for many categorical colour scales. If we plotted each line in a different colour we would not be able to really discern the countries. If we want to still use different colours for the countries that we want to highlight we need to give an order to the countries, based on the data (i.e. based on the maximum value of days_since_150, for each country). To do so we can use the fct_infreq() from the {forcats} package.

fct_infreq() orders the factors by their frequency. The countries will therefore be ordered by the number of corresponding observations (rows) in the country. Countries that have more rows (more days since the 150th case) will come first. We can now plot the datab again, using the same code as before.

Annotating the plot

If we want to annotate the plot with the name of each country, we can then use geom_text() in our ggplot call. We can place the annotation next to the last point of each line by using a different tibble for the annotation. In annotations we will just keep the observation corresponding to the maximum value of days_since_150, for each county. If we want to only annotate the first 12 countries (12 because the colour scale we have chosen has 12 colours), we can create a new variable named label_country, that will be empty for all countries apart from the first 12, ordered before using the fct_infreq().

Now we are ready to annotate the plot. To increase the visibility of the labels we can use geom_shadowtext() from the {shadowtext} package, which adds a subtle little 3d look, which makes the labels stand out a little bit. As we all know, the devil is in the details!

Now, how do we plot the colored line on top of the grey lines? To do this, we would need to reorder the country_region factors again, but this would affect the way the the lines are coloured. How can we cope with this?

The way to do it is to manually set the color of the countries, and then order the countries in the reverse order from the one we used before. A little hacky trick, indeed. The hex values corresponding to the colors were taken directly from the carto.com webpage.

Animating the plot

How do we animate the plot? Luckily, the developers of the {gganimate} package thought about us and made the animation very easy to implement to any chart created using{ggplot2}. We really need to add one function: transition_reveal(). This will allow us to reveal the lines along the x-axis, one day at the time. For the animation, we will not use the annotation dataset anymore, as we want to have the annotation to appear at each point of the chart.

We can also save the animation to a .gif file, using the anim_save() function from the{gganimate} package.

We have now created an animation from a simple line chart in R. This type of work flow is indeed applicable to any time we want to create plot, annotate, and animate a line chart. Hopefully now, when you see a COVID-19 related chart, you will also be interested in how to replicate it and — because why not? — animate it.

Beware that these data are updated continuously and the input file format may change (file name or even column names). I would suggest you to check the github page of the John Hopkins University from time to time, to both see what data is available and if the file names changed.

Originally published at https://github.com.

--

--

Giulia Ruggeri
EPFL Extension School

Senior Data Science Educator at the EPFL Extension Schools, with a background in air pollution and public health.