A Graph on COVID-19 cases in Singapore

SingTat
The Startup
Published in
7 min readMar 28, 2020

It started when I was trying to keep up with the daily covid-19 updates from the Ministry of Health (MOH). At some point, it became pretty confusing “who is linked to who and where” so I created a simple graph to help me visualise the connections. Over time, I added filter and search functions to make it easier and quicker to find information about the cases and clusters.

You can explore the graph at https://www.maventechnologies.com.sg/covid-19/.

Here’s a simple video to illustrate some of the functions in the web page.

Behind the Scenes

I feel I have to say this. I am not targeting any case or cluster. I bring them up only to help explain some of the concepts.

These are the processes and tools I have used to build the page.

Processes and tools for the project

The landscape has changed a lot since I started the project in early Feb. Looking back, I can actually categorise the work done under 3 main tasks:

  1. processing the daily source data; affects statistics, data integrity, search capability
  2. implementing the graph; route nodes, reduce load time, maintain User Experience (UX)
  3. miscellaneous; resolve bugs, add features, analyse graph for insights

Visually, it looks like this.

Timeline and scope of project

Let’s look at the evolution of the first two items in more details.

Processing Source Data

The source data is easily the most critical component of the project. It affects the graph, statistics, filter, search, everything! There have been 3 revisions so far.

1.0

MOH has usually been the primary source of information but in the early days, I have also analysed content from reputable news agencies like Channel News Asia (CNA) and the Straits Times (ST). Occasionally, as shown in the following example, ST has slightly more details (i.e. specific relationships).

Extracting nodes and relationships from the best source

This might sound trivial but every bit of extra content helps to indirectly boost the search capability. Hence for each case, I would try to select the source with the most information to build the most comprehensive graph and data.

Over time, however, updates from CNA and ST became more consistent with MOH.

2.0

Having focused solely on MOH updates, I started analysing the source data and adding minor tweaks. It could be as simple as changing this:

  • Case X is a 40 year-old male Singapore Citizen…

to this:

  • Case X is a 40-year-old Singaporean man… (notice the hyphen after 40)

This allows me to check for some form of data consistency before I publish the latest updates for the day.

An example is verifying the total number of Singaporeans across summary, filter and search.

Verifying the total for “singaporean” across summary, filter and search

Say, if you were to search for “-year-old”, you will get one record less than the total because case #28 is a 6-month-old baby.

Search for “-year-old” returns 1 record less than total

Another example is searching for a cluster name. The number of cases should tally with the number shown in the cluster menu.

Search is consistent with total in cluster menu

I believe some form of checks are necessary because users will start to doubt the page once they detect inconsistencies. These checks are especially critical when there is a change in the source data, which brings me to 3.0.

3.0

Since the surge in cases from 18 March onwards, MOH has, understandably, been sharing a new format with considerably less information. Here is a sample of the update and the corresponding description generated using a macro.

Extracting nodes and relationships from one random case on 18 March 2020

As seen, the locations where a case stays, works or travelled, are no longer shared; only the cluster location is provided. This resulted in a drop of 80% of Place-Nodes.

While having less nodes reduces the complexity of the graph and speeds up page load time, less information also means less discoveries. In fact, I was doing a bit of investigation into inter-cluster nodes but I had to put that on hold now.

Using Cypher (Graph Query Language) to identify inter-cluster nodes

In addition, those checks I was doing in 2.0 have also expanded with the new format. It seems like given the time, I can always find something to improve on, for example:

  • formatting date to ISO 8601 format
  • replacing nationality with country name
  • checking for different styles of country name (e.g. USA vs US)
  • verifying lookup of country name to ISO 3166 format
  • handling “pending” data

Implementing The Graph

Since the beginning, I have been using the Compound Spring Embedder (cose) layout algorithm (in Cytoscape) to route the nodes. Everyday, new cases are reported and gradually, the number of nodes grew to the hundreds. I have made 3 major revisions in order to keep the graph loading time short and maintain the UX.

1.0

Early on, the positions of the nodes were generated at run-time using CPU resources in users’ browsers. This means every time a user loads the page, the graph layout would be different. At one point, it took 60s just to load the graph with only about 100 nodes and I knew I had to improve.

It took 60s to load the graph on an old piece of hardware

2.0

In 2.0, I generated the positions of the nodes at run-time using the CPU resource in my development laptop. Once done, I uploaded the positions of those nodes to the server. Every user now loads the same graph layout no matter how many times they refresh the page. Without the need to use their CPU for any intensive calculations, users can now load the graph quickly.

Graph loads faster when positions of nodes are pre-determined

This went well for a couple of weeks until routing the ever-increasing nodes became an issue. I was not able to tweak the cose algorithm further so I had to manually drag the nodes to declutter them. That was really time consuming and frankly, unsustainable.

3.0

In 3.0, I removed Place-Nodes associated with single cases to simplify routing. Should these nodes be associated with another case in future, they will be reinstated (e.g. see Changi Airport, which is connected to two cases). Locations not shown in the graph can still be found under details or search.

Removing Place-Nodes that are connected to only single cases

I also started grouping nodes together (called compound nodes in Cytoscape) in my development version to help with the routing. With the ability to drag many nodes at once, it speeds up the process slightly.

Using compound nodes (rectangular boxes) to aid manual routing

As you know, since 18 March, there has been considerably less Place-Nodes to route but the Person-Nodes have increased significantly. One approach I am exploring to optimise routing further is a combination of static and dynamic positioning.

Hopefully I have given you a little insight on what goes into building and maintaining the graph. If you are keen to obtain the daily source data for your reference or research, visit https://www.maventechnologies.com.sg/covid-19/source/api.php.

Going forward, the situation is likely to remain fluid. Furthermore, it will be just a matter of time before the addition of nodes starts affecting the UX again. However, I will try to keep the information in the page as updated as possible and adapt to new developments along the way.

Let us stay vigilant in this battle against the virus. Stay safe and well.

Update (21 Apr 2020)

There has been a gigantic surge in the number of new cases in mid April. Here’s a comparison.

Graph on 30 Mar 2020 (appx. 1000 nodes; image size 1.5MB)
Graph on 15 Apr 2020 (appx. 4000 nodes; image size 6.3MB)

Unfortunately, there are many problems with the graph:

  • huge number of nodes leading to high page loading time; it again takes 60s to load this graph on my iPad Mini
  • it takes me half a day to complete the MOH updates in spreadsheets (including adding relevant search phrases), route the nodes and publish the changes for 400+ new cases; going forward, it is not sustainable (new cases hit 1,426 on 20 Apr)
  • it is cluttered; looking for information is cumbersome
  • the details about each case remain sparse and generic; imagine taking 60s to load the graph, tapping on a node and seeing “Unknown Gender, Unknown Age, Unknown Nationality”

After much considerations, I made the tough decision to rework my processes and adapt the graph to focus on the clusters instead of cases.

New Graph on 19 Apr 2020 (Appx. 100 nodes; images size 0.4MB)

The new graph looks clearer now. You can see the number of cases linking clusters instantly. Hopefully, the number of clusters and cases will start coming down in the next couple of days.

Be sure to “give a medal” to the healthcare warriors when you visit the page.

Stay safe and well.

References

--

--

SingTat
The Startup

IT geek who gets inspirations from everyday life and surroundings