Can Graph Analytics Help Control the Spread of Viruses?

A brief on how frequently used methods within graph analytics in data mining can be of some help in epidemiology.

Fatemeh Rahimian
Swedbank AI
6 min readJun 11, 2020

--

Image license

These days you hear a lot about the novel corona virus and the disease COVID-19. People are worried about how fast it is spreading, and governments are thinking of ways to limit the spread and confine those people who are already infected. The impact of this virus goes well beyond individual welfare.

In fact, one of the best ways to model a virus outbreak is with a graph. Every person is modeled as a vertex — a node — and every encounter between two people is a link. That would create a huge social network. Then, a virus pops up somewhere in this network and starts spreading along the social links. In our real-world case, someone in Wuhan, China, got infected. Then, people who had an encounter with this individual got infected and that is how the journey of the virus started. It traveled to the neighboring cities, flew to other countries and now we are witnessing a global pandemic.

Why so fast though? You might have heard about the six degrees of separation phenomenon. In a small world network — like the world we are living in — objects (people) are much closer than what they appear. Any two random persons are on average only six steps away from each other. That is why a virus can spread so quickly in the entire network.

Digital viruses or spam that infect computer networks behave similarly. They travel through continents and are transmitted via network links to your organization or home computers. It is therefore crucial to take control of the situation, before it is too late.

But how can we limit the spread of these viruses? Let’s assume that we have a limited number of copies for an antivirus or vaccine that is effective. Who should get a copy in order to limit the spread in the best way? Or maybe we want to know what is the minimum number of people to stay home or computers to be disconnected from the network, in order to confine the dissemination process. Who are the most influential nodes in the network? People/computers with lots of connections, or those with fewer connections? Is it enough to cancel big events when an epidemic is a concern? Or should we have a more targeted selection of events and venues? This brings us to the topic of node importance and node influence in networks. We are particularly interested in a very useful metric called Centrality in graph nomenclature.

Centrality comes in various forms, depending on how we define it. We could say some person is central in a social network because he or she is close to everyone else in the network. Degree centrality and Page rank and Eigenvector centrality are examples of this sort. Degree centrality gives a higher score to nodes with more links, also known as hubs. Page rank and Eigenvector centrality take this measure to the next level. The mathematics here are slightly complicated, but the principles are important and intuitive. The main idea is that links from important nodes (as measured by degree centrality) are worth more than links from unimportant nodes. All nodes start off equal, but as the computation progresses, nodes with more edges start gaining importance. Their importance propagates out to the nodes to which they are connected. After re-computing over a few iterations, values start to stabilize, resulting in the final values for eigenvector centrality.

A totally different perspective on centrality suggests that a node or individual is central if they stand between others on the communication path. Such nodes can facilitate or inhibit the communication of others and are, therefore, in a position to mediate the access of others to information, viri, spam, advertisements, or other forms of influence. This measure is referred to as the Betweenness centrality. The pictures below illustrate how these measures rank the nodes.

The pictures below depict a sample subset of a financial transaction network in one month. This sampled network includes nearly 2,000 nodes (accounts) and just over 3,000 transactions. Nodes are sized and colored based on how they are ranked by the selected metric. The darker and bigger a node is, the higher it is ranked.

Figure 1. Degree centrality
Figure 2. Eigenvector Centrality
Figure 3. Page rank
Figure 4. Betweenness Centrality

As can be seen, degree centrality (Figure 1) highlights all the nodes with high degrees, while eigenvector centrality (Figure 2) is more focused around the big hubs in the network. Page rank (Figure 3) creates a less steep slope in the rankings and covers a wider spectrum of importance values. Finally, the betweenness centrality (Figure 4) targets not only the hubs, but also those nodes that lie between different clusters, just like a bridge. These nodes could even have a low degree and yet be marked as important in holding together the topology. Interestingly, these latter nodes are not identified by our other metrics.

Going back to the case of virus dissemination and the problem of finding a small subset that could control the spread of the virus, I believe betweenness centrality is a better metric for the purpose. It is not blinded by the high degree nodes, which obviously get a lot of attention, and is aware of what is going on in the rest of the network. However, there are some arguments that this metric is only accurate for the top ranked nodes and fall short in ranking the less important nodes. The counterargument could be that if the objective is finding the most influential node, then the metric is serving its purpose, but of course if you want to have a precise rank for every single node, then you might need to look for other metrics. So, my verdict so far stays with the betweenness centrality metric.

But does this mean we have found a perfect solution? I’m afraid not. Why? The problem is that we are ignoring the fact that most of these rankings would change as soon as an action is taken. Assume we find the most influential person in our network and ask her or him to stay home while contagion is still a concern. Naturally, this would change the topology and we must compute the rankings once again to find the next highest rank, and so on. This means that we need to go through a series of computations, one after every single action. Moreover, beware that measuring betweenness centrality is computationally highly intensive. To add to the complexity, note that I was describing a greedy approach to select the next highest rank. But does a greedy selection always yield the best result? Unfortunately the answer is no. In fact, we are dealing with an NP-hard problem and whatever solution we take is only best effort and not necessarily an optimum solution. I am not going to dive deeper into the selection process here, but I would like to acknowledge that a complete solution also requires an appropriate process and the right heuristics for utilizing this metric in a way such that the dynamics of the topology are well considered.

All in all, graph analytics provides a great tool for controlling virus outbreaks and for applying preventative measures, regardless if it is an actual or virtual virus, spam or misinformation. While the first intuition could lead us towards targeting the high degree nodes in the network, betweenness centrality could provide a more effective metric. If you are part of a huge community with a lot of internal interactions, you are of course playing a role — potentially — to spread the virus. But if you have only few connections with several communities that have no or limited overlap, you are a much more influential person in the dissemination process.

--

--