Analysis Of Twitter Social Network

--

Twitter is a major platform people use to share their opinion. It is like letting out your thoughts and opinion on something with just 280 characters. One of the key features of Twitter is that you can communicate and add to a topic using hashtags, emojis, etc. But what is most interesting about social media, and particularly in the context of this post about Twitter, is that it creates connections; networks that can be studied to understand how people interact or how news and opinions get spread. Twitter has a lot of data that can be used for many purposes. In order to use this data first, they have to be extracted.

Formula 1 racing is the most-watched motorsports event in the world. Marques such as Ferrari, McLaren, Mercedes, Honda have had competed in the event since its inauguration in 1950

I got interested in extracting hashtag “f1” which relates to Formula 1 which is the most-watched motorsports event in the world. Formula 1 has been one of the premier forms of racing around the world since its inaugural season in 1950. The word “formula” in the name refers to the set of rules to which all participants’ cars must conform. A Formula One season consists of a series of races, known as Grands Prix which is French for ‘’grand prizes’ or ‘great prizes’’, which take place worldwide on purpose-built circuits and on public roads. Formula One cars are the fastest regulated road-course racing cars in the world, owing to very high cornering speeds achieved through the generation of large amounts of aerodynamic downforce.

Tools

  • Python — a programming language
  • Tweepy — a type of RESTful API specifically for Twitter
  • NetworkX — a Python library for studying graphs and networks.
  • Pandas — data manipulation and analysis library
  • Matplotlib — plotting library
  • JSON — file type
  • Gephi — an open-source network analysis and visualization software package

In this article, I’m going to explain the steps I went through to extract data from Twitter. First of all, you have to obtain Twitter API credentials from the Twitter Developer website, which are API key, API secret key, Access token, and Access token secret.

https://developer.twitter.com/en

After receiving the approval, we go on and create a new app filling out the details, and lastly, we create the access tokens keeping them in a safe place.

Twitter authentication

The followings are the keys and tokens I obtained.

I have used Tweepy for extracting twitter data. In a Jupyter notebook, we can use the Tweepy Python library to connect with our Twitter credentials and stream real-time tweets related to a term of interest and then, save them into a “.txt” file.

I used a simple Twitter stream listener to collect 1000 tweets with the hashtag “f1” in it. I had the stream directly save to the “.txt” file.

Now, we can read all the data we gathered, that’s stored in the “.txt” file, into a pandas DataFrame.

We will use this information to graph how the people that tweet about F1 interact with each other. There are three types of interactions between two Twitter users that we are interested in: retweets, replies, and mentions. The JSON file retrieved representing the Tweet object will include a User object that describes the author of the Tweet, an entities object that includes arrays of hashtags and user mentions, among others.

The dataset for this social network analysis taken from Twitter is then stored in a DataFrame. When you’ve collected your data, you set up a pandas DataFrame containing information of interest: who tweeted, how many followers they have, was it a retweet, who it was a retweet of, if it was one, and to whom it was a reply, if it was one, etc.

You can see the information we collected after stream listening.

From the display column, we are interested in:

  • Author of the Tweet: Name(screen_name) and Id(id ).
  • Twitter users mention in the text of the Tweet: Name and Id can be found as screen_name and id in user_mentions.
  • Account taking the retweet action: screen_name and id inside user object of the retweet_status.
  • User to which the tweet replies to: in_reply_to_screen_name and in_reply_to_id
  • Tweet to which the tweet replies to: in_reply_to_status_id.

After setting up and organizing the DataFrame data we can see how it looks.

The resulting DataFrame will look something like this. I first displayed the amount of tweets data. note that rows 0–3 are retweets, and row 4 is a reply; “age” is days since the Twitter ID was created:

We can easily find out the most-retweeted IDs in the DataFrame.

I guess not surprisingly, Formula 1’s official F1 Twitter page has the most retweets.

Now we’ll extract all of the information we’ll need to use NetworkX to create a directed or undirected graph that we can visualize in Gephi. We will be using this information to create a Graph or Network. of who’s retweeting whom, keeping track of the age in days and the number of followers that each user has so we can filter on those factors if we like.

Graph has two main elements, nodes and edges, lines that connect two nodes. The possibility of finding one node by following edges or paths is what makes Graph so powerful to represent different networks. Graph can also be classified as directed or undirected. Directed is when the edges have a specific orientation, normally represented by an arrow to indicate direction, and undirected is when the edges don’t follow any orientation.

In my analysis here, users represent the nodes. If there is any sort of interaction between them retweets, replies, or mentions, an edge will be created to connect the nodes. We can work with directed graph if we are interested in which user retweets another user. If we only care about the interaction present without the orientation then we can use the undirected graph. I decided to go with directed graph to see which users are retweeting who.

We will use NetworkX, which is a Python library, for the creation and study of the structure of complex networks, such as a social network. We initialize the Graph by calling the function .DiGraph() of NetworkX.

So I write code to iterate through the data we had pulled into the DataFrame earlier, row by row, and construct a directed graph of who’s retweeting whom. Each directed edge represented the relationship “is retweeted by”, the higher the weight of an edge, the more person B is getting retweeted by person A. Each node represents an individual ID on Twitter, and has attributes to track the number of followers and the age of the ID in days.

We can now check the number of nodes and edges of the Graph created.

The last thing I did was to save out a GRAPHML file we can then read into Gephi . Start Gephi up, and open our file.

First, we import the graphml file into Gephi and choose directed. After we successfully import the graph, this is how it looks.

You can now look at the information it contains by clicking on the “Data Laboratory” button at the top.

Then we can click on “Overview” button. Initlaly the data looks messy. Now, we run a visualization on our data. From the “Layout” section I choose “ForceAtlas″ as it’s fast and good at showing relationships in a network.

There are several clusters in the network, and we can also see the nodes and edges in each group. However, we still cannot see which nodes ar eat the center of the bigger clusters and with whom they interact with. The next step I see the name of each nodes and give it the color to see it more clearly.

Now I look at network centrality which captures the importance of a node’s position in the network. I also go ahead and find the Betweenness Centrality which shows the “strength” or “influence” of a node in social networks. We can see that there are different sizes and the edge also has different level of thickness.

The big nodes have high controls collaboration between, disparate clusters in a network We cannot see which user that belongs to the nodes and how strong they interact with each other. We implement Modularity reports to the graph. The colors in Modularity Reports indicate that different communities determined by this algorithm and basically it will show which users are being retweeted and are more densely connected between each other than to the rest of the network. I also added labels to them so we can see which users they are.

In the graph above, we see that, not surprisingly, F1, Formula 1’s official Twitter page, is the center of one of the major node clusters/communities (in pink).

We can see that there are 424 communities and the users retweeted the most are F1 and F1Gate, which is a Formula 1 news site.

The group with pink as the nodes color is 20.1% of the total data. Members of this group are people who intensely retweet with Formula 1’s Twitter account. The group with green as the nodes color is 12.08% of the total data. Members of this group are people who intensely retweet with F1-Gate’s Twitter account. The other groups are much smaller compared to those two accounts in terms of those being retweeted with the hashtag “f1”.

In the end with social network analysis we can learn so much about what goes on in social media, how users interact and what sort of interactions are going on. I feel like this is a very powerful tool for analysing social media data and this data can be used to shape the way users interact and how they interact on certain topics and certain users. This can be beneficial especially when promoting something or building a brand.

--

--