Analyzing #justiceforcarry Using Twitter API

6 min readMay 27, 2020

Setting the context

CarryMinati, the 2nd biggest individual YouTube creator of India(at the time of writing this article i.e., 27–05–2020) became immensely popular, trending everywhere, and was covered in mainstream media when his video where he gave reaction to a video posted on Instagram by a Tik-Toker named Amir Siddique went viral. Though Carry was famous before this whole drama, this video pushed his subscriber count from 10.9 million (8th May 2020) to 19.5 million (27th May 2020). He was largely supported by the memer’s community on Instagram and other social media platforms. Many big celebrities like Guru Randhawa came in his support on Twitter and all things were going perfectly until 16th May midnight when his video with 75 million views was removed from YouTube on the account of ‘harassment and cyberbullying’. Lakhs of tweets were made for YouTube, Carryminati, and other influential people by his fans in the hope of bringing his video back (though it won’t be). The hashtag #justiceforcarry was trending on twitter for many days and people had different opinions about this controversy. I being a die-hard fan of Carryminati decided to analyze these tweets and found many similarities in them. So let’s take a look at how you can carry out this analysis.

Required libraries

We will be using Jupyter Notebook for this analysis. Let’s import the required dependencies for our task.

import pandas as pd
import numpy as np
import tweepy
import matplotlib.pyplot as plt
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from wordcloud import WordCloud
import json
import nltk

NumPy and Pandas are like brothers, they both are used in-hand and Pandas is built over NumPy. Tweepy is the star of the evening which will fetch the twitter tweets. Matplotlib is used for plotting various types of graphs, charts, and other interesting visualizations (I know seaborn is more beautiful but matplotlib offers more access to artistic layer). Re is a regular expressions library for searching patterns in textual data. NLTK is the Natual Langage ToolKit used for building up the corpus and performing stemming. Wordcloud is the cluster of the most frequent words based on their frequency. And lastly, we will require JSON library as we will be dealing with a lot of JSON format data as tweepy returns tweets in JSON format

Authorizing the API

For using tweepy or Twitter API, you need to have a twitter developer account. Read more about it here. After you gain access to a developer account, create a new app, and generate the access token for it. These should not be exposed to the public and therefore you should store in a separate file. I stored it in a JSON file and then loaded it in a variable to be passed in the authentication. You can use any other method or directly pass them, it depends on you.

with open('login.json', 'r') as f:
    parameter = json.load(f)["details"]

After loading the credentials, pass the parameters to authentication and our API is ready to use.

auth = tweepy.OAuthHandler(parameter['API key'], parameter['API secret key'])
auth.set_access_token(parameter['Access Token'], parameter['Access Token Secret'],)
api = tweepy.API(auth, wait_on_rate_limit=True)

Collecting, downloading, and creating a data frame of tweets

After setting up the API, now it’s time to dig in the twitter and extract the relevant tweets. As I told earlier that Tweepy is the main component of our analysis and now it comes into play. We need not define any function to scarp the data as it has a built-in functionality called tweepy.Cursor(). This is implemented as shown:

my_qu = 'justiceforcarry'
max_value = 2000
searched_tweets = [t for t in tweepy.Cursor(api.search, q=my_qu).items(max_value)]

Use a list comprehension to store the result of the returned tweets. The length of the searched_tweets is 2000 and if we print only the first element, you can see the Data is too much unstructured.

Therefore to make it more readable, we will extract the JSON part in a dictionary and then dump that dictionary as a text file to increase further read/write operations.

tweets = []
for i in searched_tweets:
    tweets.append(i._json)with open('tweets.txt', 'w') as f:
        f.write(json.dumps(tweets, indent=4))

The text file looks like this:

It has many unwanted features from our current analysis perspective and that’s why we will extract only the required features from this file. We will focus on the tweet textual part and therefore we extract that feature. For identification purposes, we will also include the tweet id. See the code below to implement this:

final_tweets = []
with open('tweets.txt', encoding='utf-8') as f:  
    data = json.load(f)
    for i in data:
        tweet_id = i['id']
        text_present = i['text']
        final_tweets.append({'tweet_id': str(tweet_id),
                             'text': str(text_present)
                            })
        
        dataset = pd.DataFrame(final_tweets, columns = 
                                  ['tweet_id', 'text'])dataset.to_csv('data.csv')
df = pd.read_csv('data.csv')

We have successfully created a data frame df with tweet id and text corresponding to it.

Cleaning the tweets

Let’s look at the first 5 rows of the data frame.

We have ‘@’ symbol in each tweet which corresponds to the twitter handle tagged in the tweet. We need to get rid of this as it plays no major role in our analysis. We will accomplish this task using regular expressions. For an input like ‘@Kaustubh1828’, we can throw a pattern as “@[\w]*”.

def remove(txt, pattern):
    r = re.findall(pattern, txt)
    for c in r:
        txt = re.sub(c, '', txt)
    
    return txtdf['text'] = np.vectorize(remove)(df['text'], "@[\w]*")

Next, the text has ‘RT’ and links starting with HTTPS or HTTP.

for i in range(0, 2000):
    tweet = re.sub('[^a-zA-Z0-9]', ' ', df['text'][i])
    tweet = tweet.lower()
    tweet = re.sub('rt', '', tweet)
    tweet = re.sub('http', '', tweet)
    tweet = re.sub('https', '', tweet)
    df['tweets'].iloc[i] = tweet

Forming the corpus

All the text cleaning part is done and now it’s time to extract features from the text. It means that we will do text analysis using NLTK to make a list of words that have more importance than ordinary words. Words like ‘in’, ‘is’, ‘a’ are not of great significance and therefore should be removed. Technically, they are called stopwords, and NLTK has a huge list of them. Additionally, we don’t want to differentiate between different forms of words. We will use PorterStemmer to resolve this issue. Here is the code snippet for this:

corpus_words = []
for i in range(0,2000):
    tweet = df['tweets'].iloc[i]
    tweet = nltk.word_tokenize(tweet)
    ps = PorterStemmer()
    tweet = [ps.stem(word) for word in tweet if not word in set(stopwords.words('english'))]
    tweet = ' '.join(tweet)
    corpus_words.append(tweet)

The word cloud

Here we reach the final stage of the analysis where we will summarise the results obtained in the form of a cloud of words. Each word size indicates the frequency or importance in the given data. We will combine all the words we have stored in corpus_words and then directly pass it to the function WordCloud which returns the image formed. Using matplotlib, we can plot that image in the notebook itself or save it locally.

all = ' '.join([i for i in corpus_words])
cloud = WordCloud(width=800, height=500, random_state=0, max_font_size=110).generate(all)
plt.figure(figsize=(10, 7))
plt.imshow(cloud, interpolation="bilinear")
plt.axis('off')
plt.savefig('new.png')
plt.show()

The final image formed is this:

Learnings from this image

BAM! we finally arrived at the end of this exciting journey. Looking at the image it’s pretty clear that the word ‘carryminati’ was leading the charts. ‘justiceforcarri’, ‘tik tok’, ‘ban TikTok’, ‘roast Karega’ etc are some more frequent words in this cloud.

Thank you so much for reading it till the end and don’t forget to give this article claps so that it reaches more and more audience.

This is Kaustubh signing off.