Graph Analysis, Using PageRank and NetworkX for Twitter Account

Published in

Web Mining [IS688, Spring 2021]

8 min readFeb 26, 2021

In my last Medium article to analyze Elon Musk’s Twitter information.

(https://medium.com/web-mining-is688-spring-2021/using-tweepy-to-retrieve-elon-musks-tweets-and-analysis-d2b06e8cb780)

In this article, I will analyze and visualize my own Twitter account(https://twitter.com/YTshining) by using PageRank and NetworkX package in Python. Since I had 131 followings, I also want to find how those Following were related and how the PageRank can be applied to those Twitter account.

Before we get into further the NetworkX and PageRank, let me see what is graph nodes and edges. In mathematics, graph theory is the study of graphs, which are mathematical structures used to model pairwise relations between objects.

What are graphs and nodes in social media networks?

*Components of a graph (undirected in this one) are nodes and edges*

A graph in this context is made up of vertices that are connected by edges. Point and line diagrams are used to represent graphs, these in their turn represent the real social network. In these diagrams, nodes are drawn as a circle. In social network analysis, nodes(actors) are often either an individual or an organization, but in wider applications of the network imagery in the physical and biological sciences nodes can represent anything that links up to other similar entities in a larger system. These could include power generation stations and homes, servers and computers, animals in an ecosystem, towns, really anything of substance that we can define some kind of relation on, or from which some type of content can be said to be exchanged.

*Visualization of a paper citation network*

Edges represent the presence of a connection or relationship between two nodes. In social network analysis, these are usually some type of social tie. We will define what social ties are, how many types exist, and what their properties are in a later chapter. For now, we can say that in social network analysis, these connections are relationships between nodes, and edges in a graph are meant to represent them.

In graph theory, edges are best thought of as a collection of pairs of nodes, where the two members of the pair are the nodes involved in the focal relationship. So if node A is related to node B via some relationship R, then AB is an edge in the relevant graph.

What are NetworkX and Pagerank? How to use and implement them in the social media platform?

NetworkX is a graph theory and complex network modeling tool developed in Python language. It has built-in commonly used graphs and complex network analysis algorithms, which can perform complex network data analysis and simulation modeling. NetworkX supports the creation of simple undirected graphs, directed graphs, and multigraph. Many standard graph theory algorithms are built-in, and the nodes can be arbitrary data. NetworkX uses graphs as the basic data structure. Graphs can be generated by programs, online data sources, or read from files and databases.

So what does Pagerank mean in a social network? Twitter as widest micro-blogging and social media proves a billion tweets from many users. Each tweet carries its own topic, and the tweet itself is can be retweeted by another user. Social network analysis is needed to reach the original issuer of a topic. Representing a topic-specific Twitter network can be done to get the main issuer of the topic with the graph-based ranking algorithm. One of the algorithms is PageRank, which ranks each node based on a number of in-degree of that node, and inversely proportional to the out-degree of the other nodes that point to that node. In the proposed methodology, a network graph is built from Twitter where the user acts as a node and tweet-retweet relation as a directed edge. The user who retweet the tweet points to the original user who tweets. From the formed graph, each node’s PageRank is calculated as well as other node properties like centrality, degree, and followers, and average time retweeted. The result shows that the PageRank score of a node is directly proportional to closeness centrality and in-degree of the node. However, the ranking with PageRank, closeness centrality, and in-degree ranking yield different ranking results.

So I will use my own Twitter account as an example. I will extract all the needed data using Twitter API, Tweepy, etc.

Data preparing and cleaning

To get my network I use tweepy, a Python library that taps into twitter’s API. This part ‘Import tweepy’ is like what I discussed in the Elon Musk article.

import tweepy
import csv
from tqdm import tqdm
import os

Since some of the accounts of my followings on Twitter are having extremely followings, so instead of having a Twitter user account, I am retrieving the User_ID as a representation of each account. For example, instead of having@elunmask, I will retrieve and store his user_id to store the data.

In this part, at beginning, I can not get the whole relationship of my following because I can not get the whole network with my following. Then I did some search, I need to have this part to have full relationship downloaded to csv file.

friends = api.friends_ids(user_id=user_id)

self.user_id = user_id

self.friends = friends

Also, I want to check when the following user send out the last tweet. So I use this to download the latest tweets they sent:

last_tweet = api.user_timeline(user_id=user_id, count=1)[0]

def analyse_user(user_id='None'):"
   try:
       friends = api.friends_ids(user_id=user_id)
       try:
           last_tweet = api.user_timeline(user_id=user_id, count=1)[0]
           return friends, last_tweet.created_at
       except IndexError: 
           return friends, False
   except tweepy.error.TweepError:
       return [], False


class Friend(object):
   def __init__(self, user_id, friends, last_tweet_date):
       self.user_id = user_id
       self.friends = friends
       self.last_tweet_date = last_tweet_date

   def write_to_csv(self, file="friends222.csv"):
       with open('friends222.csv', 'a') as f:
           writer = csv.writer(f)
           row = [self.user_id, self.last_tweet_date,
                  len(self.friends)] + self.friends
           writer.writerow(row)

This creates a .csv file and will be used to read the data in the next step.

raw data screenshot of my Twitter account

Data Processing and clean data

In this stage, I keep having the error when I am trying to using the NetworkX package to visualize the dataset I had been preparing. It turns out I need to exclude myself from the dataset. So I am using this code:

network_nodes = [Friend(row[0], row[1], row[2:])for row in csv.reader(f)][1:]

Then I can use the dataset to do the analysis.

By using the code, we can look at the distribution of “friends”:

with open("friends222.csv", "r") as f:
 network_nodes = [Friend(row[0], row[1], row[2:])for row in csv.reader(f)][1:]
 network_ids = [n.user_id for n in network_nodes]plt.hist([len(f.friend_ids) for f in network_nodes], bins=40)for friend in network_nodes:
 friend.friend_ids_in_network = friend.friend_ids_in_network(network_ids)
plt.hist([len(f.friend_ids_in_network) for f in network_nodes], bins=40)
plt.title("Friend numbers of people I follow in network");

Friend numbers of people I follow in Network

From the picture, we can see the peak at 5000. This is because every Twitter account can follow up to 5,000 accounts. Once you reach that number, you may need to wait until your account has more followers before you can follow additional accounts. This number is different for each account and is automatically calculated based on your unique ratio of followers to following.

Build the Network

To achieve this we use the NetworkX package.

network_map = {f.user_id:f.friend_ids_in_network for f in network_nodes}

import random
from networkx.drawing.nx_agraph import write_dot
random.seed(0)
number_of_nodes = 120
sampled_nodes = random.sample(network_map.keys(), number_of_nodes)
sampled_sub_network_map = {key: [node for node in network_map[key] if node in sampled_nodes]
      for key in sampled_nodes}G = nx.Graph(sampled_sub_network_map)
plt.figure(figsize=(18,18))
nx.draw(G)
plt.show(G)
plt.savefig("plot.png", dpi=1000)

My own Twitter Network by using NetworkX to show the graph

In this graph above, a node will be defined as a User. An edge connects 2 users (nodes) together based on some relationship. In this sense, if User A follows User B, an edge will go out from Node A and go into Node B. So in the center of this graph, we could see lots of interactions between Node A, B, C, D, E……Because certain users like me will have a potential interest in some areas. For me, it’s political, movie, and sports areas. So the center interaction will be more intense and complicated than the outer space of the nodes. Those are some accounts that will have less relationship with my main interests.

Analyzing the center of the network:

G = nx.Graph({key:value for key, value in network_map.items() if key in components[0]})

It will return the center of graph G. The center is the set of nodes with eccentricity equal to the radius.

This will return to us the center user_id.

['8.1672E+17',
 '',
 '970207298',
 '109579534',
 '1.08122E+18',
 '150078976',
 '2836421',
 '92186819',
 '22771961',
 '8.30908E+17',
 '13850422',
 '29501253',
 '218975278',
 '180107694',
 '19682187',
 '467823431',
 '357606935',
 '2569743630',
 '242555999',
 '1.33899E+18',
 '2417586104',
 '574795929',
 '36711022',
 '1.32373E+18',
 '1.34917E+18',
 '29442313',
 '32871086',
 '50769180',
 '41634520',
 '1917731',
 '432895323',
 '1.34915E+18',
 '1.35128E+18',
 '19739126',
 '39344374',
 '1058807868',
 '19568591',
 '2813466810',
 '58579942',
 '521747968',
 '4091551984',
 '8.03694E+17',
 '15764644',
 '216881337',
 '2467791',
 '1339835893',
 '742143',
 '30354991',
 '1330457336',
 '14361155',
 '16573941',
 '17137628',
 '15745368',
 '29873662',
 '216776631',
 '783792992',
 '1652541',
 '939091',
 '2455740283',
 '409486555',
 '138203134',
 '44196397',
 '90484508',
 '46678824',
 '14159148',
 '11348282',
 '50940456',
 '50393960',
 '15492359',
 '2097571',
 '5988062',
 '15012486',
 '28785486',
 '759251',
 '807095',
 '428333',
 '23083404']

PageRank is an algorithm used by Google Search to rank web pages in their search engine results. PageRank is a way of measuring the importance of website pages. According to Google: PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is.

By using the PageRank function to see which accounts have the better rank.

pr = nx.pagerank(G)

sorted_nodes = sorted([(node, pagerank) for node, pagerank in pr.items()], key=lambda x:pr[x[0]])


users = api.lookup_users(user_ids=[pair[0] for pair in sorted_nodes[:10]])
for u in users:
    print(u.screen_name)

It will give us this:

bigbangtheory
jing_lyu
NY1weather
cultofmac
SportsOnPrime
RutgersSA
FOXTV
Adele
BBCBusiness
nikestore

Conclusion

In this article, I used my own Twitter account as an example to show the nodes and edges in the social media graph. It only gives a general description of how NetworkX and PageRank are working in those situations. If we had a more complicated dataset and involving with more features, the case will be much more complicated. This will also be the further improvements.

Another limitation of this analysis is the understanding of Twitter API, since I am not an expert in Twitter API, some of the limits, like 5k followed accounts for a Twitter account. This does not capture enough and accurate weights in my graph analysis. Next step, I need to have that dataset more robust and did a more deep analysis by using NetworkX and PageRank. Moreover, I could combine some features together into account to see how the weight might change and how do those changes affect the Nodes and Edges.

Finally, in social media, if two accounts on Twitter are the nodes of a network. One retweets the other, and this is called the edge. If they retweet each other multiple times, the weight of their relationship will be higher. I use my own Twitter dataset to represent this concept of how the nodes and edges are related in Social Media platforms. By using NetworkX, we can easily visualize those graphs. By using PageRank, it will give you which nodes take more weight in your social media following account.

Reference

NetworkX, network analysis in Python. https://networkx.github.io/
Graph Data Structure And Algorithms. https://www.geeksforgeeks.org/graph-data-structure-and-algorithms/#:~:text=A%20Graph%20is%20a%20non,two%20nodes%20in%20the%20graph.&text=A%20Graph%20consists%20of%20a,connect%20a%20pair%20of%20nodes.
Social network analysis with NetworkX, Manojit Nandi on July 14, 2015. https://blog.dominodatalab.com/social-network-analysis-with-networkx/
About following on Twitter — Twitter Help Center. https://help.twitter.com/en/using-twitter/twitter-follow-limit

Graph Analysis, Using PageRank and NetworkX for Twitter Account

Written by Tao Yao