How Network Analysis Can Help Us Win over the Pandemic

Manik Garg
Analytics Vidhya
Published in
7 min readMay 24, 2020
Photo by Clarisse Croset on Unsplash

Although vaccines pose the best means of preventing COVID19 infection, strain selection and optimal implementation remain difficult due to antigenic drift and a lack of understanding of the global spread. Google and Apple have recently launched an exposure or a contact tracing API which can be used by health services to build an app to notify its users and help the health authorities to contain the spread of the virus using network analysis.

But what is network analysis and how does it work?

In this tutorial, I will outline the basics on how to create a visually appealing network graph using python and identify data patterns at hand in order to solve the following questions:

  1. How to trace down all people in a risk group in case a certain person gets infected?
  2. What is the level and pripority of containment required for different infected persons or groups.

Note: If you are only interested in the python function and visualisation, you can skip the theory of network analysis and directly jump into section 1 (Prerequisites).

Theory

Some of the technical jargon and python packages used in this analysis are:

Network is simply a system of set of similar objects called nodes connected to each other using collection of edges. Some examples of network are social network i.e. network among people; the transportation system i.e. roads or trains layout etc.

In mathematics, networks are often referred to as graphs, and the area of mathematics concerning the study of graphs is called graph theory.

Node is any type of agents or elements that we are trying to connect. In this case, it will be people under consideration.

Edge is the connection between the nodes. In this analysis, it is physical contact between people.

Nodes and edges further contain meta data associated with them i. e. for node: information like age, gender, previous respiratory health problem and for edges: date/time two people met, duration of meeting etc., which can be further used to enrich the analysis.

Degree for a particular node is the number of edges it contains. So, if a person met 6 different people, person’s degree is 6.

An undirected graphis a type of network where all the edges are bidirectional. In this analysis, we usedundirected graph. In contrast, a network where the edges point in a direction is called adirected graph. Bank transactions, social media likes are the examples of directed graph.

NetworkX For this analysis, we will use NetworkX API package for Python as it is one of the easiest packages to manipulate, analyse and model graph data.

Closeness a clustering coefficient is a measure of the degree to which nodes in a graph tend to cluster together. Smaller the number, better it is, as it shows the people meet less number of poeople thus controlling the spread.

Note that, in real life the data will be much larger than what we are using for this analysis, however in order to keep it simple we made a small sample. The large scale data can be easily filtered using Subgraph function from the NetworkX.

1. Prerequisites

In this section, we are going to set up the environment needed in order to visualise and analyse the data. If you have already prepared any of the listed parts below, you can skip them respectively.

  • Download or create raw data for your network analysis. I downloaded my data from here.
  • For coding use Jupyter Notebook — an open-source web application used to create/share documents containing live code, equations, visualisations and narrative text. You can either install it from here or you can use the jupyter notebooks hosted by Google called Google Colab for free and save yourself of going through installation.
  • Install NetworkX — a lightweight Python package for the creation, manipulation, and study of the structure, dynamics, and function of complex networks. You can either use a CLI (Command Line Interface) or a Jupyter notebook, to run the following commands:
  • Import the necessary libraries. NetworkX is typically imported as nx:

2. Import the data and create a base network

Next step is to create an empty graph with the name G and import the data set into a dataframe. Once we import the dataframe, we can check its dimensions using following:

The data looks like this -

Raw data frame

we can see that our dataset has two columns which denote the contact between two persons. Note that a person can have more than one contact. Next step is to populate the graph (G) with the data frame(df) using the following code:

The command nx.info(G)will show the following output -

3. Visualise the network

After all this theory and network preparation, now it is time to finally visualise it and see the patterns . To do so, we will run the following code:

Bottom one to create the network and top one to add names to the edges
Network G

But since it is hard to read the network without any label, we will add the label in the graph and it will look like this -

Network G with Labels

Discover the communities in this network

Tightly knit groups which are weekly connected to each other are called communities. I have written the following code to import the community module and build a variable called c_values which is used to color the interconnected edges in the same color to distinguish them.

The code will reproduce the graph like this:

Network G with label and community assigned

Detect the most exposed people

The next aspect we are interested in is finding people who are in contact with most people as they are in higher probability of getting the virus and they can be also vital in breaking the chain if the health officials want to control spread.

I have employed two methods to investigate the important nodes:

  1. Betweenness Centrality

The Betweenness Centrality algorithm calculates the shortest (weighted) path between every pair of nodes in a connected graph, using the breadth-first search algorithm. Each node receives a score, based on the number of these shortest paths that pass through the node. Nodes that most frequently lie on these shortest paths will have a higher betweenness centrality score.

The above code will create the graph like this:

Network G with label, community, and centrality assigned

2. Page Rank

Another method we can use is an algorithm called PageRank, a popular method in SEO which is developed by Google and named after its founder Larry Page. It is used to rank the webpages in their search results. NetworkX provides this natively and computes a ranking of the nodes in the graph G based on the structure of the incoming links.

This will produce the following graph:

Network G with label, community, and centrality assigned

Attributes

In addition, as discussed in the beginning, if the certain age or health related data is available, it can be used to enrich the nodes or entity as illustrated below:

Once age or previous health issue is the part of the graph, they can be used to distinguish the node using colour or size, which will ease the monitoring of people at higher-risk . I didn’t dig into that to keep this article short but it could be interesting project for anyone interested in this topic.

4. Insights from the graphs

Now we have our final network (graph). These are the insights that can be drawn from the graphs and can answer the questions stated earlier:

  • Overview: This graph can provide the overview of an overall network i.e. who has met whom which would very hard to grasp from the tabular data.
  • Clustering: Finding communities of people attached to each other, so that in case someone in the community falls sick, others in that community can be contacted in a timely manner thus controling the spread.
  • Pathfinding : If the resources are not enough to target everyone in the risk community, the health authorities can focus on the nodes lying between two infected nodes which have higher probability of catching the virus.
  • Key Entities: If the health authorities plan to start random testing, they can begin with the bigger nodes like William Penn and Margret Fell which have higher degree of human contact, thus filtering the communities.

Thank you for reading & have a nice week ! The notebook is prepared for immediate running, you can also clone the repo and run the code locally. If you have any question or thoughts , feel free to leave a comment below.

--

--