## A practical guide for the aspiring DIY data scientist

Living in Italy is intense right now.

It took only 17 days to go from first noticing the COVID-19 outbreak, to having to refuse people care because hospitals are overloaded.

Human brains do not intuitively grasp exponentials. We need prosthetics. That’s why I have been learning a bunch of math and amateur data science this week, to help me get my head around the spread of the pandemic.

I’m not an expert at all, so my epidemiology opinions are not relevant. I’m pretty handy with a spreadsheet though, so read on if you want to learn how I used Google Sheets to create a COVID-19 dashboard that’s giving me the charts I need to understand the spread of the virus.

## 🐣 My first COVID-19 dashboard

To start, let me explain the end result, then I will unpack the spreadsheet one formula at a time. Every day I open up my dashboard and it automatically produces me an up to date set of charts like this:

I want to compare what happened in China and Italy, with what is happening in other countries.

Instead of having a timeline that starts on a date like January 1st, I’ve shifted every country so that they are all aligned on the day they first exceeded 100 confirmed cases. The idea is to give a rough estimate for my friends in other countries: if you have 100 cases now, what can you expect in two or three weeks from now?

These charts have a logarithmic vertical scale, meaning each step up is a 10X jump. You can see confirmed cases in China and Italy increasing roughly 10X per week. As the exponential growth subsides, the steep gradient of those lines should flatten off, as it has in South Korea.

If you want to play along at home, make a copy of my sheet here

## 1. Import data

People at Johns Hopkins University are maintaining a global COVID-19 data repository on Github. Aside from some bugs the past 2 days, this has been an excellent resource enabling an open source community of thousands of professional and amateur researchers.

It’s super easy to get the data into Google Sheets, just use the `=importdata()` function and point it at one of the .csv files on Github.

## 2. Sum by country

Notice there are multiple entries for “Australia”, because the dataset is organised by provinces, not countries. My dashboard is only interested in confirmed cases per country, so I need to combine all those Australia rows into one.

I made a new sheet called `SumByCountry!`.

First I need a list of all the countries in the dataset: `=sort(unique(DATASET!B2:B))`

Then for each country on that list, I want to add up all the cases in all the provinces in that country on that day: `sum(filter(DATASET!E2:E, DATASET!B2:B=A10))`

`filter()` takes a range and a condition. I’m telling it to search all the data for a given day (`DATASET!E2:E`), and only return the rows where the Country name is Australia (`DATASET!B2:B=A10`).

## 3. Sort by cases

Next I want to see the data sorted by which country has the most confirmed cases today. So I make a new sheet called `SortByCases!`.

This is the formula for the `SortByCases!` sheet: `=sort(indirect("SumByCountry!A2:"&ADDRESS_OF_MAX_DATA), DAYS_OF_DATA, FALSE)`

`DAYS_OF_DATA` and `ADDRESS_OF_MAX_DATA` are two variables I defined in another sheet called `Model!`. This is how it looks today:

As the dataset increases in size, these values will be automatically updated. Today there are 52 days of data and 125 countries being tracked, so that means I’m interested in all cells from A2 to BA126.

So that is what’s going on with the `SortByCases!` sheet. I extract all the rows and columns I want, and no more, with:`indirect("SumByCountry!A2:"&ADDRESS_OF_MAX_DATA)`. The current output of that formula is `SumByCountry!A2:BA126`.

So I have the right data, now I sort it. Look at the formula again: `=sort(indirect("SumByCountry!A2:"&ADDRESS_OF_MAX_DATA), DAYS_OF_DATA, FALSE)`. It means “take the data from the `SumByCountry!` table, sort it by the values in the 125th column (the most recently updated totals), and present it in descending order (highest caseload first)”.

## 4. Transpose

I want to see the timeline vertically, not horizontally. So I made a sheet called `Transpose!` , populated by this formula: `=transpose(indirect("SortByCases!1:"&NUM_COUNTRIES+1))`.

Note `transpose()` is a function that simply flips rows and columns (not to be confused with my sheet called `Transpose!`).

I use `indirect()` again so that it will automatically expand the data range as the database grows.

## 5. Calculate thresholds

I want to offset the data for each country so I can align all the trend lines around the 100th confirmed case. So please meet my friend the `Thresholds!` sheet.

So first I need to find out: how many days after the dataset started did this country exceed 100 cases? I get this with `=match(TRUE,index(Transpose!C2:C>A5),0)`

That’s a kinda hacky way to use the `index()` function: it takes the column of data for Italy (`Transpose!C2:C`) and converts it into a column full of TRUE or FALSE values depending on if each cell is larger than 100 (`>A5`). Then the `match()` function returns the row number of the first cell that exceeds 100.

So you can see Italy reached 100 confirmed cases on day 33, Iran on day 36, and so on.

## 6. Offset

Finally I use those thresholds to create a new table, where each column is offset by a different number of days, so they are aligned on the 100th case.

After some experimentation, I decided I want to see up to 10 days prior to the 100th case. So Row 6 of this sheet tells me which data I need to extract.

The formula in `C6` is `=address(C4,C5,4) &":"& left(address(C4,C5,4),1) `. It takes the row and column numbers and produces the cell reference `C23:C`

Finally, that cell reference is pulled into the row starting at `C11` with `={indirect("Transpose!" & C6)}`

## 7. Woohoo! Pretty charts!

Now I have all the data organised in the table that I want, the final step is to make charts of the columns I’m most interested in.

That’s it! I hope I explained each step well enough for you to remix and come up with your own visualisations.

If this tutorial is useful for you, I’d love to hear what you come up with — let me know on Twitter.

p.s. if you like this sorta thing, check out the Coronavirus Tech Handbook, an amazing crowdsourced resource for technologists building things related to the outbreak

p.p.s. this work is released with no rights reserved, use it however you like. You can support my writing on Patreon.

by the author.

--

--

## More from How To Be Awesome With Spreadsheets

Practical tutorials to learn spreadsheet wizardry

## Richard D. Bartlett

I write about working together. Me: richdecibels.com My work: thehum.org