Can we predict who is going to join the Global warming community on Twitter?

Social networks are ever-evolving with time. On Twitter, people make new connections and lose some old ones or new users join Twitter and some existing users quit it almost every day. Like all other social networks, the networks on Twitter do not stay stable. The edges and nodes in these networks are constantly changing. Therefore, the decisions that we make today based upon the network we see may not be viable tomorrow. In this scenario, if we are able to predict the changes happening in the network, it will give us time to adjust our decisions and plans accordingly.

When we plan to organize a campaign related to global warming, we would like to find the communities in the network which can help us move forward with the campaign or find the influencers in the network who can spread the information faster. If we can predict links in the network that will be formed in the future, we can detect new network structures early and make our communication plan accordingly to spread information about upcoming events more effectively. Hence, in this study, I have built a machine learning model that can predict relationships between the nodes. Using this, we will be able to infer new edges that will form in the future.

The first step in the study was to collect tweets related to global warming using Twitter API. The tweets with hashtags #globalwarming #GlobalWarming #ClimateChange #climatechange were collected. After this, the relationships between Twitter users were extracted using their interactions on Twitter. Four main interactions between users — Retweets, Mentions, Quoting, and Reply were used. The codes for extracting the relationships from tweets data are provided in my previous post. The nodes with those relationships were then extracted using the following code:

#extract the interations between users
def interactions(tweet):
# Get the tweeter
tweeter = users(tweet)
if tweeter[0] is None:
return (None, None), []
interactions = set()
interactions.add(replied(tweet))
interactions.add(retweet(tweet))
interactions.add(quoted(tweet))
interactions.update(mentions(tweet))
interactions.discard(tweeter)
interactions.discard((None,None))
return tweeter, list(interactions)
#extract nodes that are connected
def connections():
connected_nodes = []
for tweet in tweets:
tweeter, interactor = interactions(tweet)
tweeter_id, tweeter_name = tweeter
for interact in interactor:
interact_id, interact_name = interact
connected_nodes.append([tweeter_id, interact_id])
return connected_nodes

After the nodes connected to each other were extracted, the list was then converted to a data frame for further processing. Each row in the data frame consisted of two user IDs in two columns who interacted on Twitter about global warming using any of the four interaction process mentioned above.

#Extract the nodes with connections and covert data into dataframe
connected_nodes = connections()
df = pd.DataFrame(connected_nodes, columns = ['node1', 'node2'])
#Replace the UserIDs with different numbers
node_list = [node for sublist in connected_nodes for node in sublist]
node_list = list(dict.fromkeys(node_list))
node_list.sort()
result = {key: index for index, key in enumerate(node_list)}
df = df.replace({"node1": result, "node2" : result})
#Build the graph from the nodes and edges
Graph = nx.from_pandas_edgelist(df, "node1", "node2", create_using =nx.Graph())
sp = nx.spring_layout(Graph)
plt.figure(figsize=(8,8))
nx.draw_networkx(Graph, with_labels = True, node_size=30, width=1, font_size = 5, font_color = 'yellow', node_color='steelblue', edge_color ='red')

The undirected graph formed using the User IDs as the nodes and their interactions as edges show that the links in this graph are sparse. Most of the nodes are not connected to each other. In this sort of graph, if we are able to predict if links would be created in the upcoming days, it will give us more ideas about how the information will flow in the network in the future.

Therefore, the next step was to build a machine learning model that can make a link prediction between any two nodes. Since we plan to use supervised learning, we need samples related to nodes with the links and without the links.

The nodes pair without the links were selected such that there is a path between two nodes and the shortest path length between them is no more than 2 using the following code. It was found that there were 7,968 nodes that qualify the condition.

#Extract nodes that do not have links
unconnected_nodes = []
offset = 0
for i in range(adj_matrix.shape[0]):
for j in range(offset,adj_matrix.shape[1]):
if i != j:
if nx.has_path(Graph, source=i, target=j):
if (nx.shortest_path_length(Graph, source = i, target = j)) <= 2:
if adj_matrix[i,j] == 0:
unconnected_nodes.append([node_list[i],node_list[j]])
offset = offset + 1
#Create data frame with unconnected nodes and set target value to 0
unconnected_data = pd.DataFrame(unconnected_nodes, columns= ['node1', 'node2'])
unconnected_data['Link'] = 0

The target variable ‘Link’ for unconnected nodes was then set to 0 as there were no links between them. The UserIDs were replaced with different numbers to preserve their privacy. For example, the smallest ID number was replaced with 0, the second smallest ID number was replaced with 1, and so on.

The Dataframe with unconnected pairs of nodes

The nodes that had interactions between them are the nodes that are linked. The target variable ‘Link’ for connected nodes was set to 1.

connected_nodes['Link'] = 1
data = unconnected_nodes.append(connected_nodes[['node1', 'node2', 'Link']], ignore_index = True)
The DataFrame with connected pairs of nodes

The two types of data were then combined. Upon investigation, it was found that we have 7968 unconnected nodes and 1208 connected nodes. The dataset is unbalanced.

After we have the dataset, the next step was to extract features of the graph. The node2vec algorithm was used for the purpose. The node2vec is an algorithmic framework that is used to learn low-dimensional feature representations for the nodes in a graph. The features can then be used for various machine learning tasks.

Graph_data = nx.from_pandas_edgelist(data, "node1", "node2", create_using=nx.Graph())
node2vec = Node2Vec(Graph_data, dimensions=20, walk_length=16, num_walks=100)
model = node2vec.fit(window=10, min_count=1)
data_with_features = [(model[str(i)]+ model[str(j)]) for i,j in zip(data['node1'], data['node2'])]

The 20 features were created for each pair of nodes in the dataset. After that, the dataset was split into train and test set for the validation. 20% of the data was kept in the test set and 80% of data was kept in the train set.

xtrain, xtest, ytrain, ytest = train_test_split( np.array(data_with_features), data['Target'], test_size = 0.2, random_state = 35)svc = SVC(random_state=11)
svc.fit(xtrain,ytrain)
ypred = svc.predict(xtest)
roc_auc_score(ytest, ypred)

The support vector machine classifier was then built using the training set and it was tested using the test set. Since the dataset is unbalanced, the AUROC score and average precision score was used for the evaluation. The AUC-ROC score and average precision score are the metrics for checking the performance of any classification model. Both scores vary from 0 to 1 with 0 being the lowest performance and 1 highest performance. The AUROC score for the SVM classifier was 0.72 and the average precision score was 0.75.

The confusion matrix for the classifier’s result

From the confusion matrix of the classifier shown above, we can see that the model was able to infer “no links” between the nodes more accurately than the “links” between the nodes. The reason may be limited data and a sparse network. We had very few samples for the nodes with links.

Ethical issues and limitations

As we can see from the confusion matrix above, the true positive rate is higher for the ‘0’ class than for the ‘1’ class. It might have occurred because we had a highly imbalanced dataset with 86% unconnected pairs. The dataset itself was also very small with only 9176 samples. If we were to collect more related data from Twitter, the performance can be increased.

Though the Twitter protocol for collecting tweets was followed for the data collection, there was no direct consent from the users for studying the social relationships between them. Therefore, all userID was replaced with different numbers for protecting their privacy.

People who want to maintain privacy may not like it when others infer their social behaviors online. The inference can also be exploited for unsolicited advertisement and product recommendations. However, if the inference is done to pass positive information such as creating awareness about global warming, then it adds value to society. In my opinion, this kind of inference from social networks is ethical if the reason is justifiable.

--

--