Taking Control of Information Spread in a Network Graph

Brad Chattergoon
The Renaissance Economist
10 min readAug 20, 2018

In 2015 as part of the Networks: Structure and Economics course taught by Prof. Adam Weirman at Caltech, I worked with Michael Malek, Sharjeel Aziz, and Ronnel Boettcher on the Pandemaniac Challenge for the course. We won first place and are listed on the course webpage here. This is a reproduction of the report for the challenge and may provide some insight on a way to think about spreading information across a network graph.

Note: For Caltech students taking the course, use of this information would be a violation of the Honor Code.

What is a Pandemaniac?

Introduction

As global internet usage has exploded in the past decade, so too has digitized social interaction. This increased use of technology to facilitate sharing of information has given researchers unprecedented amounts of data in “cascades: ” the rapid spreading of information through a network. Given the natural financial incentives also at play (i.e. advertising potential, etc…) cascading is a hot topic in computer science.

As realistic networks can be intractably complex, researchers have instead formulated a simpler problem to study: influence maximization. Formally, the idea is that in a graph modeling a social network, a node is “active” if it has adopted a certain target idea, and “inactive” if it has not. Given a probabilistic function for how a node becomes activated based upon the state of its neighbors, how does one choose a set of seed nodes so as to maximize the final number of active nodes? Though countless papers have been written on the subject with strong theoretical results, consistently instigating cascades at will, real social networks have proven to be beyond the scope of influence of any private organization currently in existence.

The Game

In the pandemaniac project, we were tasked with a simplified version of the influence maximization problem: rather than the pandemic spreading probabilistically, it spread deterministically. The twist, however, is that we had to compete against other players. More precisely, we were presented a graph in which all nodes start off uncolored. Players were then able to choose a small set of nodes with which they seeded their team’s color; if two or more teams chose the same seed node, that node would be left as uncolored. After seeding a series of “rounds”, in the form of discrete time evolution, are played out until the pandemics converge to a stable state: at each round, a node is converted to a new color if a plurality of its neighbors are of that color. If the node is already colored, then it votes for its own color, but with one and a half votes rather than one. Once the pandemics have converged, the team with the most nodes bearing its color is declared the winner.

One detail, which was to become key in our victory, was the nature of scoring for a given day: for each round, there were a variety of different graphs, and on each graph many iterations were run. For each iteration, scoring was not “winner take all;” a considerable number of points were awarded to second, third, and fourth place finishers as well.

A Class Four Brainstorm

Initially, two algorithms were proposed by the team. Neither algorithm was used in our final implementation, for reasons detailed below.

Algorithm One

Overview

From a standpoint of the virus spreading throughout the graph, the centrality measure which is most relevant is closeness centrality. Recall that this measures the average distance of a node to every other node in the graph: the closer it is on average, the higher a measure it has. Therefore, unlike a seed node chosen for its high degree, which is only guaranteed to spread the pandemic fast to neighboring nodes, a seed node chosen for its high closeness centrality is guaranteed to spread the pandemic to all nodes in the graph, on average, quickly. The algorithm chooses the best node based on this measure as a seed node, removes it from the graph, and repeats so long as we still need seed nodes. The reason for removing the node and recomputing is that we want our seed nodes to “spread” the pandemic over the entire graph: that is to say, we don’t want them to overlap and therefore spread inefficiently.

Pseudocode

Review

Very simply, the time complexity was too great. In order to compute this measure for a single node, all shortest paths from the node to every other node need to be computed: basically, Bellman Ford needs to be run, or something similar, which is quadratic time complexity. For a graph with 10k nodes, this was simply infeasible, even in C++ graph libraries.

Algorithm Two

Overview

Network graphs often have a large connected component but the graphs designed for the pandemaniac game may not have been constructed in order to replicate this characteristic. Even within the connected component, the existence of strong connectedness versus weak connectedness and the relative sizes of each component leaves some uncertainty in the graph. This algorithm was designed with these uncertainties in mind.

The algorithm placed the graph into one of three categories depending on the structure and size of the largest component. The graph was labelled as disconnected if the largest component of nodes was less than the majority of the graph. In a disconnected graph, the optimal strategy would be to identify connected subcomponents that made up a majority of the graph and seed nodes into these subgraphs using a centrality measure.

If connected, with the largest component being more than three quarters of the nodes, then the largest component was classified as either being a “continent” or “islands”.

A continent was defined by a dense connectivity. In a continent, nodes would be expected to form several connections with other nodes and there should be a relatively high degree of clustering. In a continent, the optimal strategy would be to seed the continent according to a centrality measure, but for the purpose of diversification some seed nodes would be used to stake a claim to any islands as defined below.

An island style connectivity was defined by smaller clusters of nodes connected to other clusters by a small number of edges. This type of construction meant that there would lots of clusters which, even when colored, would have little impact on the other clusters by nature of the structure of the graph and the majority voting rule for neighbors. The optimal strategy for an islands network would be to grab the “choke points”.

Pseudocode

Review

The exhaustiveness of this algorithm proved to be its downfall given the three minute runtime constraint. Like algorithm one, the runtime was simply too high. In particular relying on “load_centrality” as a measure of centrality was ill-purposed for optimal computational time. While load_centrality is an excellent measure for a pandemic spread since it ranks nodes according to how many paths runs through it, this necessitates looking at all possible paths in the graph which is exceptionally computationally heavy.

However, one benefit of this algorithm was that it set the stage for the final, and winning, algorithm by introducing a focus on articulation points.

Uncomplicated Algorithm, Simple Life

A new algorithm was needed, and quickly. The regular season was almost done and all of our ideas had essentially blown up in smoke. Re-examining the structure of the competition, we realized several key points. Firstly, everyone was constrained as we were in terms of centrality computations. This means that they would all either be choosing high degree nodes, or nodes based upon approximate centrality measure computations. Secondly, the competition structure was such that many teams would be competing at once. This meant that going for nodes with high centrality would likely fail, given that all teams would probably be running a variation of this method. Thirdly, our algorithm would need to essentially operate in linear time, due to the enormous graph sizes we were expected to handle.

Proposed Pseudocode

Review

First and foremost, it’s important to note that determining the articulation points essentially comes down to a depth first search run, which is linear time in the number of nodes/edges of the graph, and therefore much faster than our previous ideas.

Initially, we were hoping for our strategy to capture central “choke points” of the graph, since by definition articulation points are points which, when removed, disconnect the graph. The idea was that by controlling important pathways, we could hold off the spread of other viruses whilst maximizing the spread of our own (choosing articulation points of maximal degree). This is indeed what ended up happening in our victory, but not exactly as we had planned out. Rather than being concentrated near the center of the graph, most of the points chosen were near the edge. In retrospect this makes sense, since large graphs tend to consist of one very large highly connected component, which of course cannot be disconnected easily. Therefore the articulation points were usually points connected to outlying nodes, which only disconnected a very small portion of the graph.

So given that our nodes were usually on the outer edges of the graph, how did our strategy manage to work so well? With only being able to see the final results of every run, it is impossible to say for sure. One possible explanation is that, similar to one army surrounding another, we were able to control the outer “circle” of the graph and therefore attack inwards with greater surface area. This explanation is somewhat liable to suspicion, however, since the number of outlying nodes and the surface area in the graphical representation are two entirely different things. What is more likely is that we simply chose the best possible “non-obvious” nodes. Since the articulation points disconnect the graph, they are not completely on the fringes, but not at the center either. While everyone else was fighting for highly clustered nodes in the center of the graph, and therefore reducing their total number of seed nodes chosen and coming into conflict with one another immediately, we simply hung around on the edges, expanding until we no longer could. Often times, actually, we would not win first place: another team would win the battle for the center, and we would only hang on to our edge nodes. But because the scores were averaged over many runs, and all the other teams essentially seemed to be using the same strategy, we managed to rack up a great deal of points by consistently attaining second or third place.

“If it ain’t broke don’t fix it.”

On the first day we won the first two stages, but came in 3rd place in the final stage. A natural question arises: How could we lose to teams we had previously beaten?

The answer is clearer than one would expect. When the number of teams competing was high, we were winning with ease. But as the numbers dwindled, so did our edge. A great part of our strategy involved allowing other teams to weaken each other to the point where they could no longer compete with us: if only a few teams were competing, the chances of this happening were reduced.

Thus, we decided to alter our implementation for the final round only. Rather than uploading all articulation points in this four person contest, we would instead upload 50% articulation points, and 50% of nodes chosen uniformly at random over the set of the num_seeds highest degree nodes in the graph. The idea here is rather than putting all of our eggs in one basket, we would compete for high degree nodes and therefore launch a two-pronged assault on the graph.

This strategy did not turn out in our favor, as we got 4th place in the final round. The reason for this, we conjecture, is simply because we were spreading ourselves too thin and competing in areas in which we were hopelessly out-smarted: thus not only did we quickly lose the central nodes which we competed for, but our articulation point strategy became less effective as well because we had less seed nodes to solidify our outer stronghold.

A winning algorithm

Given the poor performance of our modified algorithm in the previous final round, we decided to revert to our original, simple, deterministic algorithm and go down fighting. We had no chance to compete for the center points: other winners were using far more sophisticated algorithms than we could muster. And, as it turned out, our reversion worked. Because everyone else optimized their algorithms as highly as possible to go for the win, they optimized their way right into each other, and to their dooms. We simply won the battle for nodes nobody else wanted, every time.

Conclusion

As it turns out, less was more. Although our algorithm was undoubtedly not the most sophisticated, it was the most effective because it played to the rules of the game, and more importantly to other players’ strategies. There is no doubt amongst the team that had we even played some of the last place finishers 1v1, we would have lost. But in large groups of players using the same strategy, ours came out on top.

--

--

Brad Chattergoon
The Renaissance Economist

Caltech BS, Yale SOM MBA, Harvard MS. I write about Economics, Statistics, and Data. Very active on Twitter! @bradchattergoon