Solving Twitter Sentiment Analysis problem on Analytics Vidhya

Published in

Coinmonks

6 min readJun 29, 2018

I’ve been learning some data science skills in the past 6 months and I thought I should document and share my progress to…

Keep me more accountable.
Potentially help anyone else that is trying to learn the same things I am!

So here I am going to explain how I have solved the Twitter Sentiment Analysis problem on Analytics Vidhya.

You can refer to my Github repository to find the source code and also the dataset for this post. Please give repo a star and if it did help you ✌️

Problem Statement:

The objective of this problem is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.

Formally, given a training sample of tweets and labels, where label ‘1’ denotes the tweet is racist/sexist and label ‘0’ denotes the tweet is not racist/sexist, your objective is to predict the labels on the test dataset.

Solving the Problem:

First of all, let us import few libraries which are required: Pandas, Numpy, and Scikit-Learn. If this is your first exposure to Data Analysis and Machine Learning, these libraries can be installed with the pip install command from command prompt or within your IDE. Like the following:

pip install pandas
pip install numpy
pip install sklearn

By the way I am using Python 3.6 and Jupyter Notebook as my development tool.

Make sure you install all three! Now we are good to go. Let us import pandas and numpy like the following so we can use them.

import numpy as np
import pandas as pd

Obtaining the Data:

Our first step in solving this problem is to load our data. Download the dataset from either Analytics Vidhya or from my GitHub repo and make note of where it’s saved and what it’s called.

Then, use pandas and the read_csv command to load the data. While we’re at it, let’s use .head() to look at the first 5 rows and make sure everything looks okay.

train_data = pd.read_csv("C:\\your\\drive\\here\\train_tweets.csv")
train_data.head()

The output looks like this:

Let’s do some exploratory analysis on the train data that we have obtained.

train_data.info()

This gives us a brief overview of the data like number of columns, number of rows, data types of the values of each column and if we clearly observe we can also tell which columns have the missing values. In our data there are 3 columns (id, label, tweet), 31962 rows and there are no missing values in the data.

The above code gives us a breakdown of how many tweets are ‘0’s and how many tweets are ‘1’s.

Now let’s look at our tweets

train_data.head(5)['tweet']

If you look at the tweets carefully, we have lot’s of junk data that is not useful for us in predicting whether the tweet is Positive or Negative. So I have written a Utility function to clean tweet text by removing links, special characters etc. using simple regex statements.

## importing regular expression library ##
import re
def process_tweet(tweet):
    return " ".join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])", "",tweet.lower()).split())

Now after applying this function on our train data. Our train data looks something like this:

I have added one new column “processed_tweets” which contains cleaned tweets i.e after removing special characters, hyperlinks etc.

Now to make predictions we do not need the columns “id” and “tweet”. After removing these columns our final dataframe contains only 2 columns (“label”, “processed_tweets”) looks something like this:

Things get a bit more complex here! In order to reduce overfitting problems we split our dataset into test and training sets. We will ‘train’ a model using the training set and then ‘test’ the model using the test set. We’re using sklearn for this, so go ahead and add the following code.

x_train, x_test, y_train, y_test = train_test_split(train_data["processed_tweets"],train_data["label"], test_size = 0.2, random_state = 3)

Not only are we splitting our data into training and testing datasets, we’re also dividing it up into X & Y datasets. This will make it more simple to train and test our model! Also, in order to ensure we are using the same ‘split’ of data each time we come back to this code, we use a particular random state, 3. This number relates to how the data is split up.

We also print out our new dataset’s shape to demonstrate that they are all the same! In this case the train datasets have the shape (25569,) and the test (6393,).

To perform further analysis we need to transform our data into a format that can be processed by our machine learning models.

count_vect = CountVectorizer(stop_words='english')
transformer = TfidfTransformer(norm='l2',sublinear_tf=True)## for transforming the 80% of the train data ##X_train_counts = count_vect.fit_transform(X_train)
X_train_tfidf = transformer.fit_transform(X_train_counts)## for transforming the 20% of the train data which is being used for testing ##x_test_counts = count_vect.transform(x_test)
x_test_tfidf = transformer.transform(x_test_counts)

My understanding of what we’re doing here is the following:

CountVectorizer does text preprocessing, tokenizing and filtering of stopwords and it builds a dictionary of features and transform documents to feature vectors.
TfidfTransformer transforms the above vector by dividing the number of occurrences of each word in a document by the total number of words in the document. These new features are called tf for Term Frequencies.

Another refinement on top of tf is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.

This downscaling is called tf-idf for “Term Frequency times Inverse Document Frequency”.

Most interesting step:

Now we can actually train our model. I’m going to use RandomForestClassifier. Using RandomForestClassifier is simple with sklearn. All we need to do is to import the model and fit it with our data.

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=200)
model.fit(x_train_tfidf,y_train)

We have trained our model with the training set. Now it’s time to test our model. The predictions made by the model will have the values 0’s and 1’s.

Now we got our predictions. Let’s test our model’s accuracy by using accuracy_score function in sklearn.

from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,predictions))

The code results:

0.957766306898

Wow 95.7% is a great score. Out of 6393 text’s we got 270 wrong predictions.

Till now we have tested our model on 20% of the train data. Now let’s load the real test data given in the Analytics Vidhya competition and make the predcitions.

test_data = pd.read_csv("C:\\your\\drive\\here\\test_tweets.csv")

Apply all the above Data Prepossessing steps on this test data as well and fit our model with the whole train data and predict the results on the test data.

## for transforming the whole train data ##
train_counts = count_vect.fit_transform(train_data['processed_tweets'])
train_tfidf = transformer.fit_transform(train_counts)## for transforming the test data ##
test_counts = count_vect.transform(test_data['processed_tweet'])
test_tfidf = transformer.transform(test_counts)## fitting the model on the transformed train data ##
model.fit(train_tfidf,train_data['label'])## predicting the results ##
predictions = model.predict(test_tfidf)

Write these predictions into a CSV file in the format specified in sample_submission.csv file.

final_result = pd.DataFrame({'id':test_data['id'],'label':predictions})
final_result.to_csv('output.csv',index=False)

Now submit your output.csv file to Analytics Vidhya competition and get the score. See you on the leader board.

Going forward you can improve the models accuracy by tuning the hyper parameters of the model using GridSearchCV.

Further Learning:

If you have any queries, comment here or you can reach me at kommaraviteja@gmail.com

Happy coding :)