Whose Tweet Was It Originally?

Using Tweepy (Twitter API v2) to Analyze Original Tweets and Retweets and Visualizing Them Using Squarify

Abraham Setiawan
CodeX
5 min readMar 15, 2022

--

Photo by Souvik Banerjee on Unsplash

Twitter, that name should be familiar to you. With more than 300 million total users and more than 200 million daily active users, it is one of the biggest social media platform in the world. One of the features that Twitter has is Retweet (RT) — the ability for a user to share someone else’s tweet on their feed, while preserving the credit to the original source.

But what is actually the proportion of original tweets and retweets on a Twitter account? My initial guess was that an account should have more original tweets than retweets. To be honest, I never really thought about it before. But someone dear to me came to me one day saying that the Media and Communications Department of Uppsala University (@uu_media_comms) might have more retweets than original tweets. This conversation led me to this Twitter analysis with Tweepy.

According to its documentation, Tweepy is an easy-to-use Python library for accessing the Twitter API. To get started with Tweepy, we first have to create a Twitter developer account to get access to its API. Fortunately, this is easily done on the Twitter developer page, with possibility to get it for free. The free option gives 1 environment per project and 500K tweets per month per project, which is more than enough in this case.

Once we have access to Twitter API, let’s start by installing Tweepy.

pip install tweepy

Once we have it installed, we can import it and create a client object. All the parameters to create the object can be obtained from your Twitter developer account.

import tweepy# create Client objectclient = tweepy.Client(bearer_token, consumer_key, consumer_secret, access_token, access_token_secret)

Then we use this function to fetch all tweets from a Twitter account.

I’ll explain the main operations of this function. First, we need to get the user ID (numerical) as the input to fetch the tweets. Since we know the username, we can fetch the user id by using this line of code.

user_id = client.get_user(username=username)[0].id

Then, we can fetch the tweets with this line of code.

tweets = client.get_users_tweets(user_id, max_results=100, pagination_token=tweets.meta.get('next_token'))

Note that Twitter API only allows us to fetch 100 tweets at a time. To fetch the following pages, it’s important to define the pagination_token parameter.

Afterwards, we put it inside a pandas dataframe, with id attribute inside tweets.data as the tweet ID and text attribute as the tweet content.

Here’s the outcome of the function. We see some RTs already but we can’t be sure yet.

Fetching tweets from @uu_media_comms (Image by author)

To further analyze whether it’s a retweet, we use this function.

On this function, we add 2 columns in our dataframe. The first column isRT will show whether the tweet is a Retweet or not. The second column RT_from will show the original account that posted that tweet.

Here’s the outcome of the function. I also check the retweet counts from this account and now we can conclude that this account has more RTs than original tweets.

Showing RTs from @uu_media_comms (Image by author)

While we already know the answer to the initial question, there might be more interesting insights if we dig deeper. Let’s find out who this account is retweeting from using this function.

On this function, we show the original source of the tweet using this line.

df_plot = df['RT_from'].value_counts(dropna=False).rename_axis('username').reset_index(name='counts')

We use dropna=False to also count the None value. This is important because None value means that it’s an original tweet.

Here is the outcome of the function.

Showing RT counts from @uu_media_comms (Image by author)

It looks like @mwkrzyzanowski accounts for a whopping 56% of all tweets. Seemingly, this person must be someone very important for @uu_media_comms since more than half of the tweets are retweets from them. A quick peek on Twitter shows that this person is the Chair of the department. We also see that original tweets are only less than a quarter of all the tweets. I also picked up a small detail where this account retweeted from itself 3 times.

We have found an interesting insight, but visualizing it might make it pop more to the eyes. This time, I’ll plot the data as a Treemap with Squarify. As usual, first we need to install Squarify.

pip install squarify

Then we use this function to plot the data.

I’m testing a lot with various Twitter accounts, which means I found various edge cases that I fixed. But here is the main part. Just like Seaborn, Squarify is built on Matplotlib, so we have the usual plt syntax. To plot using Squarify, we use the code below. I also made the color palette different for the accounts with more original tweets than retweets on the color parameter.

ax = squarify.plot(sizes=df['counts'], label=ax_label, alpha=0.5, color=ax_color, text_kwargs={'size': 9})

Since I haven’t found a way to make the font size of the label proportional to the rectangle size, I decided to play around with the legend. With auto_legend parameter set to True, we only show the labels on rectangles with more than 2% retweets. If auto_legend is set to False, then we use the num_show parameter to manually decide how many labeled rectangles we are going to show.

Without further ado, here is the Treemap of @uu_media_comms.

Treemap of @uu_media_comms (Image by author)

As you can see from the Treemap, this is very interesting to observe. I wondered how it looks for other accounts so I compared it with a few other educational institutions (@HyperIsland, @KTHUniversity, @UppsalaUni).

Treemaps of educational institutions (Image by author)

Okay, it seems that other educational institutions have mostly original tweets. Let’s find out some unrelated accounts like the band @Coldplay, the actor @RobertDowneyJr, and the US President @POTUS.

Treemaps of assorted Twitter accounts (Image by author)

Alright, it looks like more accounts have original tweets more than retweets, as I initially suspected. Although, there is a group of Twitter accounts with more retweets than original tweets. Yes, that would be Twitter Bots.

Treemaps of bot accounts (Image by author)

I like how Tweepy made it easy to access Twitter data. This time, I only scratched a tiny portion of what Tweepy can do. I hope to dig into Tweepy’s other functionalities in the future. Nevertheless, you can find the full code on my GitHub.

Cheers!

--

--

Abraham Setiawan
CodeX
Writer for

Data Analyst student at Hyper Island with experience in product and innovation. I write about my journey in the data world. Website: abrahamsetiawan.com