Oh no! Not another coronavirus post! Yes, I know, we are bombarded by pandemic content these days. My apologies for creating even more. However, it is not my purpose to bore you with more of the same, or to confuse you with pointless details. Being a passionate information designer I decided to have a look at good and bad practices in COVID-19 related content from a data visualization point of view. I hope this will be a useful and inspiring overview.
This post is a work in progress. The first four chapters are currently finished, the remaining chapters will be added below as soon as they are released.
We are living in remarkable times. The novel coronavirus is causing an epidemic spreading with a velocity we have never experienced before. Busy long-distance air and rail traffic have made it impossible to contain the virus after its first outbreak in China. For the first time our modern world is confronted with a pandemic of this scale and magnitude, and our healthcare systems are being put to the test.
But in fighting these challenges, the world has never been as united as today. Research teams across the globe are working together to develop cures, social media are used extensively to keep everyone informed, and innovative companies are coming up with solutions to keep people at home and the virus at bay. Technology plays a crucial role in this fight.
As an information designer, I am specifically fascinated by the efforts of the data science and visualization communities. The newest developments in these fields are put to use to turn a complex and rapidly changing topic into easy-to-communicate visuals. In only a matter of days, nearly everyone is familiar with the ‘flatten the curve’ visuals, or Washington Post’s animations on the impact of social distancing.
In this post, we will explore some of the marvelous ways people around the world are using data visualization in the fight against the novel coronavirus.
- Chapter 1: Finding reliable data
- Chapter 2: Visualizing exponential growth
- Chapter 3: Mapping the virus
- Chapter 4: We need to talk about flattening the curve
Chapter 1: Finding reliable data
As noted by Edward Tufte, excellent graphics consist of complex ideas communicated with clarity, precision, and efficiency. At the core of a good data visual, therefore, lies accurate data. So before we start diving into coronavirus graphs, we will first take a brief stop at trustworthy data sources.
Sources of reliable data
There are currently four important places where one can obtain reliable and relatively complete aggregate data about the Coronavirus epidemic:
- World Health Organization
The World Health Organization publishes daily Situation reports detailing the number of confirmed cases and deaths per country. They also provide a Situation dashboard which is updated three times per day.
- John Hopkins University
Researchers at John Hopkins University also maintain a dashboard providing an overview of the current number of cases, deaths and recoveries on a per country basis. The underlying data is made freely available through GitHub.
- European Center for Disease Control and Prevention
The ECDC publishes daily statistics on the pandemic for the entire country (despite its name!). Data is published daily at 1 p.m. CET and is presented on a situation update page.
- Our World in Data
The team of Max Roser collects and combines all available data and information about the epidemic on a single page. This excellent summary provides interactive charts on many different topics ranging from the number of cases to symptoms, incubation period and fatality rate. Each chart comes with a downloadable data set.
Accuracy of data
Collecting and aggregating global data in a rapidly changing environment, such as during a pandemic, is obviously very tricky. None of the above datasets should therefore be considered an ‘absolute truth’, as minor errors are bound to happen. Such errors can be related to reporting difficulties or contradicting sources, or differences and shifts in methodology, but can also be due to minor errors such as typos.
As an example, let us compare the three datasets above for the total number of confirmed cases in Belgium (between March 1 and March 19) with the official numbers communicated by the Belgian government (which can be found here).
Immediately we can note some discrepancies. The John Hopkins University data follows the government data most closely, with an exception on March 12 where for some reason the number was not updated.
The two other datasets (WHO and Our World in Data) appear to lag behind by one day up until March 16, possibly because WHO Situation reports are published at specific timings which don’t match accurately with government reporting timings. Also, these datasets miss the same update as the John Hopkins numbers (from 314 to 399 cases), they were not updated on March 17, and they appear to have a typing error in them (1.085 cases on March 16, while the official government number was 1.058).
Finally, Our World in Data temporarily stopped updating beyond March 17 because WHO shifted their reporting window: up until Situation report 57 the observed 24-hour time window ended at 10 a.m. CET, since then it ends at midnight. This causes a small overlap making it difficult to accurately compare data and analyze trends.
- Update March 23: Note that Our World in Data stopped relying on WHO data as they found too many errors in the daily Situation reports. Instead, they switched to data provided by the ECDC.
In summary, John Hopkins University data most closely matches official government numbers (for Belgium).
Finding more data sources
If you are looking for alternative data sources, direct reports by governments, or data on specific regions or cities, I highly recommend the data section of the Coronavirus Tech Handbook, a crowdsourced document bringing together all the tools, datasets and visualizations on this topic.
The sheer amount of available data can make it a bit overwhelming, especially taking into account that new numbers are being announced almost constantly. When in doubt, I would advise to stick to the four most complete data sources listed above.
Chapter 2: Visualizing exponential growth
Let me warn you in advance: this will probably be the most theoretical and mathematical chapter of this entire blog post. We’ll have a short look at the underlying scientific principles of a pandemic and analyze how this translates to visualizing data. If that’s not really your thing, it’s totally okay to just look at the pictures and then skip to the next chapter! 😉
A mathematical approach to pandemics
A pandemic disease is a complex thing to model. We live in a world of nearly 8 billion people in over 230 countries, connected with each other through 100.000 daily plane flights and an equally mind-boggling number of train and bus rides. Nevertheless, many experts have attempted to model the spread of a disease in a closely-connected world.
For example, professor biostatistics Kurt Barbé models the pandemic spread through a first order differential equation, resulting in the number of active infections following a Gaussian curve, with its typical bell shaped profile:
This is a simplified but not unrealistic model. For example, if we look at the number of active COVID-19 cases in China, where the peak has nearly passed, we can see that the Gaussian profile is a pretty good approximation (the sudden increase in the number of cases on February 13 is the result of a change in reporting methodology).
We can assume that if the number of active infections follows a Gaussian-shaped profile, the number of new infections and the number of deaths will also follow Gaussian profiles. If we plot cumulative data, such as the total number of confirmed cases or the total number of deaths, this will follow an S-shaped cumulative function profile — the integral of the Gaussian function. For example, the total number of deaths by COVID-19 in China looks like this:
Crucially, at the start of such peaks and S-curves, the shape will follow an approximately exponential profile, with the number of infections or the number of deaths doubling every few days at a constant acceleration. That’s the reason why we are seeing so many graphs appearing which show the number of infections on a logarithmic scale — such as the one below by John Burn-Murdoch for the Financial Times. On such a log-scale, exponential growth will appear as straight lines. The steeper the line, the more rapidly the growth is accelerating.
Which brings us to the following point of discussion: is using a logarithmic scale a good idea?
Logarithmic scales: yes or no?
Using logarithmic scales in a data visualization is sometimes frowned upon, as it has some obvious drawbacks:
- Requires additional explanation when communicating towards a general public, which might not be familiar with this type of plot.
- Not a good way at all to compare values with each other.
However, there are specific cases where the use of logarithmic scales might be justified: when the underlying mechanism behind the data is multiplicative in nature, leading to (more or less) exponential growth. This is exactly the case in the early outbreak stage of a contagious disease (when only a minor fraction of the population is infected).
For example, a patient infected with smallpox will infect on average 5 other people. This is the basic reproduction number of the infection. These 5 people will — on average — each infect 5 new people, or 25 in total. These 25 will infect 5 x 25 = 125, and so on, and so on. For the novel coronavirus, early estimates of the basic reproduction number range between 1.4 and 3.9, which is higher than a seasonal flu (0.9–2.1), but much lower than for example measles (12–18).
The basic reproduction number is influenced by many different factors which cannot be controled, such as the incubation time and the infectiousness of the disease. However, it also depends on the number of susceptible people that affected patients are in contact with. This is the main reasoning behind social distancing measures to ‘flatten the curve’ (more about flattening the curve visuals in a later chapter): the lower the number of people an infected patient comes in contact with, the lower the reproduction number and the slower the disease will spread through the population.
As many countries are currently in the early exponential growth phase of the epidemic, and taking measures to reduce the reproduction number, this is an appropriate opportunity to plot the number of infections (and the number of deaths) on a logarithmic scale. It enables us to quickly evaluate in which countries measures are more effective and the disease is spreading less rapidly, such as Japan or Singapore:
As an additional benefit, linear-logarithmic plots like these can educate the general public about the exponential nature of the disease in its early stage. This avoids more sensation-oriented headlines such as ‘More new infections today than ever before!’. While this is true, it doesn’t have to mean things are getting out of control. Everything might be entirely as expected, or even improving. As Hans Rosling notes in his bestseller Factfulness (if you haven’t read it, do it immediately!): things can be both better and bad.
A final note on logarithmic scales before I shut up about it and we can move on to more exciting things. As the infection continuous to spread, a growing fraction of the population will become either already infected, or immune when they have been infected in the past but survived. Also, vaccines can be developed, or increasingly strict measures of social distancing and quarantine can be enforced. In practice, this means the effective reproduction number will drop, the exponential growth will start to decelerate and the number of infections will reach a peak and start dropping again. When this happens, the usefulness of logarithmic scales has reached its end.
But, aren’t absolute numbers meaningless?
By now, I have created many different visuals related to the novel coronavirus, and one of the most pervasive comments concerns the use of absolute numbers. Many people argue that we cannot directly compare numbers between countries without correcting for the population count. This sounds relatively convincing — how on earth can we compare the number of deaths between a tiny country such as Belgium, and a massive nation like China?
However, this is a multifaceted question without a definitive answer. Let me list the major arguments for both approaches:
We must use relative numbers, because:
- The number of infections and deaths in a country depends on the size of that country.
- We want to evaluate the stress the pandemic will put on a country’s healthcare system, which can usually only support a certain fraction of the population being infected.
We must use absolute numbers, because:
- The rate at which a disease spreads depends on the population density and the level of social distancing, but has nothing to do with the population number. In this regard, country borders are pretty arbitrary ways of grouping people, anyway.
- Relative numbers will create pretty meaningless ‘outliers’ for some very small countries.
- Should we plot the relative or the absolute number of infections?
In my impression, most data visualizers follow the approach of using absolute numbers when plotting the total number of infections. The above-mentioned John Burn-Murdoch also agrees:
Nevertheless, using relative numbers can be very useful for other use cases such as:
- Showing the number of tests performed per capita.
- Showing the available number of doctors, hospital beds, intensive care beds,… per capita.
- Comparing how hard different countries are currently hit by the crisis.
For example, comparing the number of COVID-19 tests performed by country in absolute and relative numbers reveals some interesting insights:
Chapter 3: Mapping the virus
A pandemic has a strong geographical factor attached to it, so obviously we are drawn to using maps to visualize how the virus is spreading. Both data visualizers and their audience simply love maps, and I personally do to. As a child, my (geographical, historical, biological, even biblical) atlases where my favourite books and I could browse through them for days. However, pretty as they may be, maps have their own pitfalls and caveats. So be prepared!
Beauty in times of despair
Let’s start with some of the most well-designed examples of maps I have encountered during my research for this chapter. The absolute winner, in my opinion, are these clean but very effective maps by the Washington Post:
To further clarify things, these maps are complemented by a simple table detailing the exact number of confirmed infections or deaths. This gives the reader the choice to look at the broader picture, dive into the detailed numbers, or both.
It should be noted that the BBC uses very similar, equally beautiful maps. These are examples of proportional symbol maps, or what most normal people simply call bubble maps. But why exactly do these bubble maps work so well?
One of the most common issues encountered when creating data maps is the impact of population and population density. If we simply color a map according to the presence of a certain parameter, we can easily mask the fact that we are actually looking at a map of an underlying different parameter, such as population density. This may sound a bit abstract, but the excellent xkcd made an — as always — amazing cartoon (chartoon?) about this which explains it much better than I can:
Now while this may sound funny, it is something that unfortunately happens quite regularly in real life. In particular election maps are very vulnerable for this kind of problem. For example, this is a (rather famous) map showing the election result (per county) of the US Presidential Election in 2016:
While this map has been used several times to claim a ‘landslide’ victory for the Republican candidate, it is actually rather useless, as it completely ignores the fact that the overwhelming majority of Americans lives in the cities near the Upper East Coast (New York, Washington, Boston,…), near the Great Lakes (Chicago, Detroit), the West Coast (Los Angeles, San Francisco, Seattle), or in Florida. In short: land area does not equal population. While a map like this one is not strictly lying, it is (intentionally or not) hiding the fact that the split between Republican and Democratic votes was nearly 50–50.
Many of the maps published during this coronavirus crisis suffer from a similar problem. Take for example this map by ABC News showing the countries where COVID-19 cases have been confirmed:
Although somewhat helpful, such a map may say more about how connected a country is to the rest of the world, rather than showing how the virus has spread. In any case, it does not provide information about the number of cases. From this map on March 10, we cannot deduct that there was only one confirmed case in Burkina Faso, but over 10.000 in Italy, and over 80.000 in China.
A typical approach to avoid this issue are choropleth maps, a complicated name for something very simple. Blame the Greeks, choros means ‘region’ and plethos means ‘multitude’, hence the name. My scientific brain always tricks me into saying ‘chloropleth maps’, probably because it thinks about chloroplasts in plant cells. But don’t get your hopes up, there’s no connection at all — just my stupid brain. The ‘chloro-’ in chloroplasts also comes from Greek, but from chloros, ‘green’. The same origin, it turns out, as chlorine (because of its pale green color) or chloroform (which contains chlorine). But my apologies, I digress… I might have been reading too much Stephen Fry lately, who would probably love this kind of etymological exploratory ramblings.
So, a choropleth map. In such a map, regions are again colored, but the value of the color (lightness or darkness) depends on the underlying parameter, for example the number of infections in a country. In its most basic form it looks like this example by CNBC:
Choropleth maps can be particularly helpful in comparing different counties or regions within a country, such as this map by CNN:
However, choropleth maps have their own unfortunate downsides and pitfalls. I will not go into much detail here, as everything was already written down excellently by ‘cartonerd’ Kenneth Field. Let me just summarize:
- choose your colors or color scheme responsibly,
- choose your categories responsibly, and
- use relative numbers to avoid population density distortion.
Or, just maybe, a bar chart might be a better choice:
Another limitation of choropleth maps is that small nations or regions are nearly impossible to see. Again, area size is messing up our ability to interpret the map correctly. For example, try to find the number of cases in Singapore, Luxemburg or Barbados on the maps of Our World in Data:
Bubble maps, such as the ones by the Washington Post shown above, avoid this trap because each nation gets its own bubble, independent of area, population, or population density. This is what makes this kind of chart so successful to map a wide range of values in a wide range of countries around the globe.
There is only one minor downside: bubbles can start overlapping each other when two neighbouring regions have very large values (or one of them has a large value while the other only a small one). Then your bubble chart might start looking like this:
Not particularly helpful, I’m afraid… There is no perfect solution to avoid this, but play around with the opacity of your bubbles, be clever with bubble outlines, and the result might be both beautiful and effective:
The return of the table
I already hinted earlier that in some cases, a simple bar chart might be a better option than a complicated map. As Leonardo Da Vinci said: “Simplicity is the ultimate sophistication” (except that he never said that). Another simple but effective alternative might just be… a table.
Many great examples can be found, including the Washington Post ones at the beginning of this chapter, but I was particularly charmed by the Datawrapper ones by Lisa Charlotte Rost, with a clever use of color to bring a touch of optimism to this heavy subject matter:
Okay okay, not everything is perfect when using tables. For example, they can also be used to misinform, or at least to distort information or present data in a way that suits you best. For a while, the following table was popular on social media in the Netherlands, showing how the country was following the exact same pace as Italy, with only a few weeks of delay. Panic!
However, it was intelligently shown by RTL Nieuws (in Dutch) that the situation looks completely different when you choose different dates to start the comparison, such as the date of the first death:
Also, differences in age distribution among the population have an impact on the death rate, so it’s rarely a good idea to blindly start comparing different columns or rows with each other, without thinking things through. Remember: if creating panic is your goal, you will always find some data somewhere presented in such a way that you can do so.
There are many, mány more amazing things you can do with tables, also in coronavirus times, but that will be something for another chapter!
To end this chapter on a lighter note, let’s just have a look at some garbage maps from around the web. Starting with the Daily Mail committing some serious data visualization crimes, showcasing what Andy Kirk aptly calls ‘the Staircase from Hell’:
Metro sits in the same boat, but at least they blow up the map of Europe to ensure that poor San Marino isn’t left out:
BBC, by the way, has written an interesting story on an old map showing air travel routes going viral (pun not intended) and causing panic because of poor journalism, such as this badly chosen tweet by the Sun:
Finally, if you would ever think about using a pie chart as an alternative to a map… just don’t:
Chapter 4: We need to talk about flattening the curve
Data visualizations rarely become hugely popular among the general public. But recently, a visual appeared which quickly spread on social media, in newspapers and magazines, and even in press conferences and on television. Hundreds of millions have probably seen it by now. Yes, the Flattening the Curve visual is truly a remarkable success story in the dataviz world. So let’s dive into it!
What curve are we flattening?
Let’s start by looking at a prime example of a ‘flatten the curve’ visual, the one in the COVID-19 #Coronavirus Data Pack at Information is Beautiful, as it combines a simple and clean curve design with a lot of additional information:
We are looking at how the expected number of daily infections will evolve over time since the date of the first case. As described in chapter 2, the shape of these curves is expected to be more or less Gaussian in nature, with an exponential growth phase at the start of the outbreak and reaching a peak at a certain point in time, typically several weeks since the date of the first case. The crux of the graph is the dotted line — the capacity of our healthcare system. There is only a certain number of patients we can treat simultaneously in an effective way, be it due to a lack of equipment (intensive care beds, ventilators,…) or a lack of personnel.
The color choice is very smart here. Orange is related to the curve we want to avoid: an outbreak without protective measures being taken. In this case the number of daily infections will quickly outpace the capacity of our healthcare system, leading to a high fatality rate because patients cannot all be treated effectively.
Blue indicates the scenario we aim for when protective measures are taken (the protective measures are also listed in the figure above). In this case, the spread of the disease will be slowed down. This implies that the outbreak might take longer to die down, but more importantly, the peak in the number of daily infections will not (or not by much) exceed the capacity of our healthcare system. This enables us to treat all patients as effectively as possible and lower the fatality rate.
In one sentence: we need to take protective measures to flatten the curve.
Many visualizers have come up with their own version of the ‘flatten the curve’ graphic. One of the more popular examples was the following animation by illustrator Toby Morris and microbiologist Siouxsie Wiles for The Spinoff, an online magazine from New Zealand:
Before we dive into the many different variations of this curve, we need to briefly discuss a more technical point: should both graphs have the same area, or not?
When looking at both visuals above, this is not very clear. In the Information is Beautiful version, it might seem that the blue, flat curve has a somewhat smaller surface area compared to the orange version, but it is not very extreme. In the animated version, both appear to have a similar area. In some visuals, the difference is much more extreme, such as in this one by Vox:
I have seen some debate amongst data visualizers and epidemiologists around this topic and it appears that by taking protective measures we lower the basic reproduction number (see chapter 2) which leads to both a slower growth and a reduction in the total number of cases (i.e. the area under the chart). This is confirmed by this report by the CDC in which the Flattening the Curve visual first appeared (in 2017!). But again, I am not a doctor nor a microbiologist, so please don’t take my words for granted.
And what happens after the curve? When strict protective measures, such as a lockdown, are released, will we not simply see a ‘restart’ of the outbreak as people start having more social contacts again? In that case, we will have only delayed the outbreak (at a very high economic — and possibliy psychological — cost) and still end up with millions of deaths. Not really, as we will have gained one unbelievably major advantage: time. Time to do more research, bring us closer to effective drugs or even a vaccine, time to build up capacity, time to educate the public.
Tomas Pueyo describes this very convincingly in his article ‘The Hammer and the Dance’, so I will not go into further detail here. Just one word of advice: proceed with caution. Tomas Pueyo is a great and convincing writer, but he is not a doctor or microbiologist. He is an engineer, ex-consultant and entrepreneur (hmm, that sounds familiar to my own story). He is also the guy who wrote an entire book about the amazing Star Wars Ring Theory without even crediting Mike Klimo, the person who originally came up with it. That’s a pretty douche move, Tomas. But hey, let’s get back to visualizing data, shall we?
When flattening the curve got out of hand
This is one of those rare times in history when a data visualization goes, well, viral (ugh!). I can hardly come up with data visuals which have reached the same level of fame as Flattening the Curve. It has popped up in press conferences, in newspapers and on television in nearly every country on earth. Millions of people have seen it. That’s quite an achievement!
If we try to trace this visual back to its origins, probably the first time it appeared in the media was in The Economist at the end of February, recreated from the CDC original by visual-data journalist Rosamund Pearce:
Since then, the spread of the visual has picked up speed several times, for example when population health analyst Drew Harris drew (pun intended) his own version and shared it on Twitter:
The New York Times explains it very well in this article.
By now, Flattening the Curve adaptations have exploded. Maybe we should start talking about flattening the curve of Flattening the Curve visuals? (Not my joke, sorry, credits go to Andy Kirk.)
Hey, don’t get me wrong, I love a good data visualization anytime, but seriously?
Ah, the curve. It just works so well in educating the people! It convinces us in a single glance why staying at home serves the greater good. Never before was chilling in my couch such a great way to save millions of lives, right?
Nevertheless, there are some great curve alternatives (alcurvatives?) to the by now overused visual. The one I must highlight comes from (again, these guys are simply great visualizers) the Washington Post. I like it so much that maybe ‘One Visuals to Rule Them All’ is a more appropriate way to describe it. Rather than showing the same old curves, Harry Stevens decided to simulate the effects of protective measures and social distancing, and show the impact through a series of animations. This is the pandemic spreading with an enforced quarantine, which is starting to ‘leak’:
Using these simulations (by the way, these are not pre-recorded, a new simulation is run on every page reload!) the team convincingly shows the effect of doing nothing, enforcing quarantine, or enforcing moderate or extensive social distancing:
It quickly became Washington Post’s most popular story ever.
To wrap up this chapter on curves, there is one final visual I would like to share with you. Don’t worry, it’s an alternative to the Flattening the Curve overload, so no curves this time. In fact, it’s another animated illustration by Toby Morris for The Spinoff, the same guy we started our chapter with. It shows how only one person avoiding contact with others can have a dramatic impact on the total number of infections:
I love it.
Possible upcoming topics for this blog post:
- Visualizing symptoms, mortality and reproductive ratio
- Coronavirus dashboard design
- Best practices in visualizing pandemic data (good & bad examples, available tools and toolkits)
- Visualizing predictive models (?)
- Coronavirus storytelling and scrollytelling
Disclaimer: I am not a medical doctor or a virologist. I am a physicist running my own business (Baryon) focused on information design.