Visualising the Spread of Covid-19
Obtaining insights by visualising the pandemic spread
There is no doubt that Covid-19 has caused tremendous panic among the public. With social media constantly bombarding us with updates on the situation. Controversially, one of the good outcomes, was the rapid and ongoing cooperation between countries and organisations to share as much data as possible early on, to allow for a timely response by countries not yet affected by the virus.
This post was therefore inspired by the availability and ease of obtaining the necessary data. It is an attempt to understand how quickly the virus has spread among other insights. In this post, I will be explaining how I got the data and how it was pre-processed but I will also show several visualisations one can perform with the given data.
Follow the link to this repository where you will find the starting code.
To pre-process the data, I followed the instructions from this video. But to make things easier, I’ll explain the process here. Initially, I am reading the three dataset files in the data directory (which can be downloaded from here, we only use the time-series data), loading them into a dictionary holding pandas Dataframes. I then pre-process the data as shown below.
Here I first fill all the NaN values in Province/State with the country name from Country/Region, then I unpivot the data so that instead of having every day as a column we get it as a row instead and finally I change the Date Series to a date-time object.
Let us now move on to combining the three files into a single Dataframe.
The above code will initially merge the data from all the three files in the data directory. This performs an outer merge on the dataframes’ common columns, namely “Province/State”, “Country/Region” & “Date”. Further on, we create a copy of the whole dataset to keep the same data, however in the new dataframe we create a new column which holds the date of the next day, but named “Date-1”, for each day in our original dataframe and rename the columns “confirmed”, “deaths” and “recovered” by appending “-1” to the name.
In simple terms, a left merge between the original and the newly created dataframe will now match the Date with Date-1 bringing the previous day’s values along. In effect, for every day we offset the value for all case categories by one while holding those values in another column. I then subtract the values for every case category to get the new cases per day, and finally to provide a starting point for the new columns we simply provide the value of every category from the first recorded day.
Visualisation
Let us first get an overview of how the virus spread around the world. Initially I wanted to visualise how the total cumulative cases and the total new cases changed.
In the first plot things seem a bit bad given the growth of the number of cases. However since this takes into account all the cases that came before it, we need to look into how the newly identified cases change per day, which is the second plot. Things don’t look so bad in the second plot however we can definitely identify the two spikes close to the 11th and 12th of February & March respectively. According to CNN news on the 13th of February many cases were reported pointing to the fact that maybe the virus spread to other countries also.
However this resource shows otherwise with most of the newly found cases mostly found in East Asia. About the second spike, however, it seems to be around the time when several countries identified many cases since it was around the time when strict measures were being taken.
In terms of total cases I wanted to plot a pie chart also to see the percentage of the different cases. This is shown below.
Finally since China managed to contain the virus spread I wanted to see how the curve looks like in that case.
Given that the virus was somewhat contained the curve has a logistic sigmoid shape to it since not many new cases are identified.
I finally wanted to see how my home country, Cyprus, and country of residence, UK, are doing. Below I plotted the curves for cumulative case per day.
Fortunately, not many cases have been identified, however this does not mean there won’t be any. It could be just that Cyprus is still far behind in the pandemic spread but given government measures the situation will hopefully not worsen.
Conclusion
The data plotted above was until the 16th of March. I’ll try to keep the data updated every so often, hopefully not for long. It will be nice to have some other visualisations such as an animated map of the world possibly showing where and when the virus spread or a lineplot showing the change in number of cases over time for every country.
These are just some basic graphs to get an idea of how a virus can spread, so if people have any other visualisations in mind just add it on this repository.
Stay safe and contribute freely to the repository!