Mapping clusters of coronavirus cases in Australia

Nhung Nguyen
Analytics Vidhya
Published in
7 min readMar 15, 2020

While people were busy fighting over toilet paper or pasta this weekend, I decided to put all my energy into visualising the network of confirmed coronavirus cases in Australia .

Disclaimer: This is purely a personal quest as a data nerd and has nothing to do with my work. Opinions expressed are solely my own and do not express the views or opinions of my employer.

Based on publicly available data, I was able to group 198 cases from 25/01/2020 to 13/03/2020 into eleven clusters, and seeing some transmission links between cases.

Motivation

When one of my colleagues shared an article on the different Covid-19 dashboards, I had a surge of data envy seeing this excellent Coronavirus dashboard on current cases in Singapore by Upshot.

What was most interesting to me what the network visualisation of different infection clusters and the details revealed for each of the cases and how they’ve linked together. I then found out that through serological test and contract tracing, Singapore authorities were able to trace and find infection source for one of their coronavirus clusters 😮.

Infection source tracing graphic from Singapore — Strait Times
https://www.straitstimes.com/singapore/grace-assembly-coronavirus-mystery-solved-mega-cluster-linked-to-2-wuhan-tourists-via-a

Questions

Could we have done the same for cases in Australia? I was determined to find out using publicly available data to see seek answers for the following questions:

  • What are the different clusters of Coronavirus for Australia?
  • How are the cases linked to each other?
  • Can we trace infection route between clusters?
  • How do those clusters grow over time?

Data collection

I was hoping for a data set for confirmed coronavirus cases in Australia that is well structured, updated hourly and has all the information I need (like this one for Korea) but alas, there was none 😿!

The most up to date details information I could find was on this wikipedia page, where contributors have pulled in data from public health sources each day and kept a tally on the total cases in Australia. I was excited to find out contact tracing information 🕵 on some of cases reporting by state health department 🙌. The problem? It’s just dense text and very difficult to follow after a while 😭. Don’t believe me? Try to follow this:

https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Australia#Total_cases

What if I could map out the data out of the text into something more structured? — thought the stubborn and ignorant data nerd in me. There’s only a hundred rows or so at the time I started on this assignment.

So I began a spreadsheet for each confirmed cases in Australia. (⚠️Don’t ever try this with a pandemic!) For example an entry like this…

On 25 January, the first confirmed case was announced in Victoria, a Chinese national in his 50s who had travelled from Guangzhou in Guangdong province to Melbourne via China Southern Airlines flight CZ321 on 19 January.

is translated into:

Case_id: 1;
Confirmed_date: 25/01/2020;
State:VIC;
Gender: Male;
Age: 50-59;
Infection-source:China;
Infection-type: Overseas;
Associated_flight: CZ321;
URL: https://www2.health.vic.gov.au/about/media-centre/MediaReleases/first-novel-coronavirus-case-in-victoria;

Whenever there is a mention of links between cases, I try to map them together if possible. There seem to be 2 main link types: (1) a family member of confirmed patient, and (2) has close contact with a confirmed patient.

For example, this description mention a link between a newly found case and a previous case:

The 31st case in the country and the seventh case in NSW, a 41-year-old woman,[37] was a close relative of the fifth confirmed NSW case, a man who recently returned from Iran

  • NB: the fifth confirmed NSW case is the 27th confirmed case in Australia

is encoded as:

edge_id: 12;
type: family_member;
source: 31;
destination: 27;
additional_info: The 31st case in the country and the seventh case in NSW, a 41-year-old woman,[37] was a close relative of the fifth confirmed NSW case, a man who recently returned from Iran;

Challenges & limitations:

  • It’s actually quite painful to map out the links accurately due to the fact that while my case_id is recorded as the count for the whole Australia, the link often comes from each of the state’s health authority and they referenced the case number as in their own state (ie the 31st case nation-wide is the 8th case in NSW and so on). I ended up adding a column for state_case_id to make sure I get the linkages right.
  • Information are incomplete and inconsistent between different states and different days. Initially, there were a lot of details reported on the cases, but there weren’t a lot of details for later cases, causing missing data.
  • Sometimes, linkages between cases come from my own reductive reasoning when combining different sources (authority reports and media coverage) and might be plain wrong.
  • It takes a huge amount of time and efforts to do this manually. There might be plenty of errors in the data entry itself and I haven’t had the chance to go through and fact check each and every single entries. The cases jumped everyday and it’s getting to the point of impossible to keep track of every records.

Outcomes

What can I find out from the data? Let’s look back of my original questions:

  • What are the different clusters of Coronavirus for Australia?
  • How are the cases linked to each other?
  • Can we trace infection route between clusters?
  • How do those clusters grow over time?

Clusters

From the spreadsheet, I managed to group cases from 25/1 to 13/3 in 11 clusters:

  • 🇺🇸 United States: 18 cases
  • 🇮🇷 Iran: 17 cases
  • 🗺 Overseas: 17 cases
  • 🌏 South East Asia: 15 cases
  • 🇨🇳 China: 14 cases
  • 🇮🇹 Italy 14 cases
  • 👴 Dorothy Henderson Lodge aged care: 12 cases
  • 🇬🇧 United Kingdom: 9 cases
  • 🛳 Diamond Princess cruise ship: 9 cases
  • 🇪🇺 Europe: 7 cases
  • 🏥 Ryde hospital: 6 cases
  • 🎖 Defence: 4 cases

along with 16 human transmission cases (with known links to a previously confirmed case in Australia) and a big bucket of 39 unknown, where source of infection is under investigation, or no recent travel history and no contact with previous cases.

Connections between cases

There were a few connections between cases where you can see how an infected patient from overseas can spread the disease to someone they live with or has close contact with. But the data is limited.

Infection routes

Sadly, I wasn’t able to trace any infection route between clusters as what I I had hoped for initially., except some known linkages between the Dorothy Henderson Lodge aged care facility and Ryde hospital, as they’re located not far from each other and there were either doctor visiting the aged care facility or resident going to the hospital.

What would have been interesting is to see how the virus first spread from a returned traveller to the community locally but there is no obvious link as I can see from the data alone.

Clusters growth over time

I thought it’d be interesting to see if there’s any changes over time for each clusters, which is represented in this visualisation below. It’s a bit hard to know what’s going on as you go further down the chart but looking at this as the whole really shows how things have ramped up in Australia from the beginning of March.

This is the timeline from 1/3/2020 to13/3/2020 with China cluster out of the picture…

To conclude

While I didn’t get to trace any infection route, I was still somehow satisfied 😊 to be able to make sense of the data, see the different clusters and get a full picture view of all the cases and how they evolved over time .

The visualisations could definitely improved if I had more time (I could find so many things to fix with them!) but this is already a rather time-consuming exercise.

It’d be interesting to see if there’s more public interest in gathering more useful structured data in time like this to better understand how the virus spread and its speed, and which intervention works.

--

--

Nhung Nguyen
Analytics Vidhya

UX designer, fond of observing and chatting to users, slightly obsessed with data vis, often get caught talking to other people’s cats.