Social Network Visualization in Field of Breast Cancer on Twitter

Social Network Analysis (SNA) has existed for decades, but until recent years companies and organizations are beginning to appreciate the meaningful insights. The rising popularity of SNA closely associated with the explosion of online social media. Using mathematical tools to systematically understand networks, SNA provides the most important influencers in a network, subgroups of people who tightly connected, and overview of the network structure. The network data is represented as nodes (people, accounts, etc.) and edges (friendship, partnership, etc.). Many interesting insights can be identified through the visualization of social network.

The objective in this article is to visualize the interactions of Twitter users who have opinions in field of breast cancer. As October is the Breast Cancer Awareness Month in 2020, this topic is getting growing popularity recently. Other than skin cancer, breast cancer is the most common cancer among American women. Spreading the awareness of breast cancer can raise concerns for breast cancer risk demographic.

Data Collection

In order to download data on Twitter that related to breast cancer, we can use tweepy to connect with Twitter API. Tweepy provides documentation of how to stream with tweepy. Basically we need to get API authorization first and then create a stream listener.

# creating stream listener
class MyStreamListener(tweepy.StreamListener):

def on_status(self, status):
print(status.text)
# creating a stream
myStreamListener = MyStreamListener()
myStream = tweepy.Stream(auth = api.auth, listener=myStreamListener)

We can search on a specific topic by using this stream. Considering the topic of breast cancer, hashtags such as ‘#BreastCancer’, ‘#bcsm’, ’#MetastaticBreastCancer’, ’#BreastCancerAwarenessMonth’ are set as keywords.

mySaveStream.filter(track=['#BreastCancer','#MetastaticBreastCancer','#bcsm','#BreastCancerAwarenessMonth'])
mySaveStream.disconnect()

2000 tweets and related features were extracted through tweepy and wrote into a text file. Because the file contains raw data of json format, it is very hard to read and utilize. We must parse the raw data into applicable format.

an example of json file

Data Processing

Python has the json module that can easily import json file into python dictionary. There are some features of a tweet we need to take into account for later network building:

id and name of user who posted this tweet

id and name of user who posted the original tweet if this tweet is a retweet

id and name of user who posted the original tweet if this tweet is a reply

id and name of user who posted the original tweet if this tweet is a quote

id and name of users who are mentioned in this tweet

Let’s check what attributes are included in each tweet:

tweet.keys()dict_keys(['created_at', 'id', 'id_str', 'text', 'source', 'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'retweeted_status', 'is_quote_status', 'quote_count', 'reply_count', 'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted', 'filter_level', 'lang', 'timestamp_ms'])

Among attributes above, ‘user’, ‘retweeted_status’ and ‘entities’ themselves are dictionaries. To save space, we don’t expand each dictionary here. To sum up, user id and name of a tweet come from ‘user’ dictionary, retweet source user id and name come from ‘retweeted_status’ dictionary, reply source user id and name come from ‘in_reply_to_user_id’ and ‘in_reply_to_screen_name’, and mentioned user id and name come from ‘entities’ dictionary.

As we know the network is made up of two components — nodes and edges. In this network the nodes are Twitter users who posted tweets with topic of breast cancer, and the edges are interactions between any two users. Interactions are defined as Twitter functions such as retweet, reply, quote and mention. After extracting the interacted user ids and names for each tweet, we can build a list contains both user id and its corresponding interacted user id, which basically are the edges mentioned above.

As it is possible that those interacted user ids and names extracted from a specific tweet may contain duplicates, Python set() can be used to perfectly counter this issue. After adding all the interacted user ids and names corresponding to a specific tweet, we can have a list contains interactions and an user corresponding to those interactions. See the example below, both user ‘morgossweets’ and ‘drinkelement_’ have interactions with user ‘Kaity_Janee’, but it is not necessary that ‘morgossweets’ has interaction with ‘drinkelement_’.

user
(2960305339, 'Kaity_Janee')
interactions
[(927499828799655936, 'morgossweets'), (1260208505665003522, 'drinkelement_')]

Network Building

As one of many popular network analysis packages, NetworkX is used for drawing complex networks. It provides us graph methods such as nodes, edges, weighted edges and degree. The job here is to transfer user and interactions information into NetworkX nodes and edges for graph drawing.

import networkx as nxG = nx.DiGraph()
G.add_edge(user_id, interactions_id)

NetworkX can automatically create new nodes if the nodes are not already in the network. After generating nodes and edges for every tweet, we got 1,991 nodes and 2,066 edges. At the meantime, we also introduced the degree concept for network. The node degree is the number of edges adjacent to the node. Similarly, in_degree and out_degree are the number of edges pointing in and out of the node respectively. We can find out the user with highest degree in this network by sorting the degree result.

(26308916, 'BreastCancerNow', 110)

User ‘BreastCancerNow’ has 110 degree which is the highest among all the users.

Now we have nodes, edges and degrees, how to define the most influential users in this network? Google founders developed a system for ranking web pages — PageRank. It is an algorithm to calculate the importance of a webpage to be measured by analyzing the quantity and quality of the links that point to it. Scores are assigned to each webpage in order to rank them from high to low. Here, the webpages can be regarded as nodes in network, in other words, the Twitter users. NetworkX provides pagerank algorithm which can return the PageRank of the nodes in the graph.

nx.pagerank(G)

If we plot the pagerank values from high to low, we can get the graph below:

The nodes with rank after 750 represent very little influence on this network. Therefore, we only draw the network with top 750 pagerank values and highlight the top 20 nodes because they are the most important users in this network. The graph below shows us the strong interactions among users.

Community Detection

Identifying meaningful communities in a network is a hard problem, especially when the size of network is big. Community detection is interesting because those subgroups may tell us the potential structure of social network. Besides, community detection provides us a dynamic overview of people’s opinion in social network. Python has a module community which can be used for generating community graph.

import community as community_louvain
import matplotlib.cm as cm
# transfer G to undirected graph
G = G.to_undirected()
# compute the best partition
partition = community_louvain.best_partition(G)

# draw the graph
pos = nx.spring_layout(G)
# color the nodes according to their partition
cmap = cm.get_cmap('viridis', max(partition.values()) + 1)
nx.draw_networkx_nodes(G, pos, partition.keys(), node_size=20,
cmap=cmap, node_color=list(partition.values()))
nx.draw_networkx_edges(G, pos, alpha=0.5)
plt.show()

Limitation

In this study, there are several limitations that affect results. First of all, four interaction types have different affect on social network. ‘quote’ is a stronger interaction than ‘retweet’, and ‘mention’ is weaker than ‘reply’. These strong and weak relationship can be represented by giving weight to edges. In addition, the sample for this specific topic may not big enough to generate meaningful communities. There are many other network visualization tools such as Gephi and NodeXL, and they can build networks that are more beautiful and intuitive.

--

--