Link Prediction on Social Network in Field of Breast Cancer on Twitter

--

In the last article “Social Network Visualization in Field of Breast Cancer on Twitter”, we talked about how to generate social network from tweets extracted by applying Twitter API. At first we found out the nodes of Twitter users and edges among them, and then calculated PageRank for each node. The nodes with high PageRank were plotted on network graph. In this article, we will continue to research on what can be inferred, in another word, the attributes of this network.

When look at the network graph below, you may want to ask “is it possible to predict that if Adam and David will build connection with Sophia?”. As we can see, both Maya and Maria have connection with Sophia. Adam linked with Maria and David linked with Maya. We could say that at a certain time (>t) connections may build through either Maya or Maria. This situation in general is Link Prediction, which is used to predict future possible links in the network.

Why we need to do link prediction? Considering an E-commerce company advertises its product online, link prediction can provide potential valuable customers to target. On the other hand, link prediction gives researchers a novel angle to see how social networks evolve in the future.

Data Processing

The numbers of nodes and edges are 1,991 and 2,066 respectively from network in last article. We can draw a basic graph to show the original network. Blue points are nodes and black lines are edges.

Basic Network

In order to applying models on our data, we need to figure out what dependent variable we can extract from the network above. The graph below is a simple example from above network. If we only regard connection as directly connected, we can easily know that AC is connected but AB is not connected. Our objective here is to predict if there is a connection between A and B. At time t, AB, AD and CB are not connected.

Graph at time t

At time t+n (n>0), from the graph below we know that A somehow connected with B. Therefore, we can assign value 1 to AB, and assign value 0 to AD and CB.

Graph at time t+n(n>0)

What if we only have a graph at time t? How could we know which edges can be assigned value 1? The trick is that if we go back at time t-n (n>0), we could find out the links that will generate at time t. The graph below is at time t-n. Link between B and D is an edge that generated in graph at time t above. By applying this strategy, we can get rid of the edges that won’t cause radical change of a network, and assign them as value 1.

Graph at time t-n(n>0)

What is radical change of a network? Basically, after removing a link, the network has no isolated node or sub-network. For example, CD cannot be removed because that will cause two sub-networks. Similarly, AC cannot be removed as node A will be isolated.

two sub-networks

Now we know the concept to assign our target variable. We need to find out and assign those unconnected nodes to value 0. It is used to represent whether pairs of vertices are adjacent in the graph. The graph below is an example of adjacency matrix. If AB is connected, the value of AB in adjacency matrix is value 1. Otherwise the value is 0. To calculate the unconnected nodes, we can easily calculate the nodes that do not connect to node A, in other words, the value 0 in first column. Then we can remove node A (first column and row), and record value 0 in the first column of the new matrix, which are the unconnected nodes to node B.

After going through the whole matrix, we get 13,023 pairs of unconnected nodes in our network. They are the value 0 for our target variable. Next, we need to find out target value 1. As what we discussed above, the number of connected components and number of nodes must keep the same to identify if an edge is removable. Our initial number of components is 410 and number of nodes is 1,991. By applying these two conditions, we found 485 edges that can be removed, and they are all assigned to value 1. So far we have node pairs that represent edges and value 1 or 0 that present if they are removable.

Feature extraction

Have target variable is not enough. We still need features to complete our dataset for model testing. Why we have to do this? Because the nodes and edges we have are graph type data, which, by nature, are discrete. Feature extraction process is used to transfer discrete data to continuous data. The algorithm that gains most popularity is DeepWalk by Perozzi et al. It is one of graph embedding techniques that uses walks, which enable the traversal of a graph by moving from one node to another if they are connected to a common edge. DeepWalk has two stages. In first stage it discovers neighborhood in the network with random walks. In second stage, it applies SkpGram algorithm to update the node embeddings by gradient descent to maximize the probability of the neighbors of a node. More details are in this post.

In this article, we will use Node2vec algorithm. It is very similar to DeepWalk but not transductive. Node2vec assigns a bias variable to each walk to prioritize the breadth and depth, which are called breadth-first-search (BFS) and depth-first-search (DFS) procedures. The graph below is an example of BFS and DFS. Hence, the walk is no longer random, it is influenced by probabilities.

After training node2vec model on the graph data, we obtained features data that can be used for the final model training.

Model Building and Validating

This section involves pure machine learning techniques. Multiple models can be applied on our dataset. Dataset is split into two parts — 75% train data and 25% test data. In order to validate the performance of our models, we can calculate the model accuracies and plot ROC curves.

Random Forest Model:

accuracy = 0.9869

Logistic Regression Model:

accuracy = 0.9766

Neural Network Model:

accuracy = 0.9769

Limitation

There are bias in our dataset because it is highly unbalanced. There are only 485 observations of class 1 but 13,023 observations of class 0. Besides, considering there are only 2,066 edges in 1,991 nodes, our network is pretty sparse. Connections of this network are not tight, which results to 410 isolated communities.

Conclusion

Link prediction is a burgeoning field that many data scientists and network analysts invest effort on. From current network connections we are able to predict what future connections will generate. This prediction gives us a chance to shift our concern on network in advance. However, link prediction also brings ethical issue to researchers. Private connections or potential connections can be exposed by link prediction algorithms, a situation that breaks private information security in online society. More researches should be done to rich the online network while hide private information.

--

--