Twitter Sentiment Analysis

Saket Garodia
Analytics Vidhya
Published in
10 min readDec 24, 2019

--

Using Text mining techniques and NLP to classify tweets as negative or positive.

Most of the data in the real world are in unstructured text format and therefore it is imperative for a data science enthusiast to learn text mining and natural language techniques to use those text for useful insights.

Through this blog, I will be explaining how to conduct sentiment analysis on a given supervised dataset. The problem is taken from one of the contests of Analytics Vidhya. Here’s the link to the problem — https://datahack.analyticsvidhya.com/contest/linguipedia-codefest-natural-language-processing-1/

Problem Introduction

Sentiment analysis is contextual mining of text which identifies and extracts subjective information in source material and helping a business to understand the social sentiment of their brand, product or service while monitoring online conversations. Brands can use this data to measure the success of their products in an objective manner. In this challenge, we have been provided with tweet data to predict sentiment on electronic products of netizens. Given the tweets from customers about various tech firms who manufacture and sell mobiles, computers, laptops, etc, the task is to identify if the tweets have a negative sentiment towards such companies or products.

Approach

First, we will import all the necessary libraries we will be using during our analysis.

import numpy as npimport pandas as pdimport reimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.linear_model import LogisticRegressionfrom sklearn.svm import SVCfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import f1_scorefrom sklearn.model_selection import GridSearchCVfrom sklearn.model_selection import cross_val_scorefrom sklearn.model_selection import KFoldfrom sklearn.metrics import roc_curvefrom sklearn.metrics import precision_recall_curvefrom sklearn.metrics import roc_auc_scorefrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import classification_reportfrom sklearn.metrics import confusion_matrixfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.ensemble import GradientBoostingClassifierfrom sklearn.model_selection import RandomizedSearchCVfrom sklearn.metrics import f1_score, make_scorer

Now, let us import our given training and test data that contains different tweets by the customers.

train = pd.read_csv(‘train_2kmZucJ.csv’).drop(columns = [‘id’])test = pd.read_csv(‘test_oJQbWVk.csv’).drop(columns = [‘id’])train.head()
Training data ( head )

Text Pre-Processing and Visualizations

Clearly, the tweets contain a lot of noisy data that needs to be removed before we move forward with the analysis. The label ‘0’ corresponds to the tweets that have a positive sentiment and label ‘1’ corresponds to the tweets with negative sentiments. Now, let us check the percentage of positive and negative tweets in the training data.

#Printing percentage of tweets with +ve and -ve sentimentsprint(‘Percentage of tweets labeled as a negative sentiment ‘, end = ‘’)print(sum(train[‘label’]==1)*100/train.shape[0], end =’%\n’)print(‘Percentage of tweets labeled as a positive sentiment ‘, end = ‘’)print(sum(train[‘label’]==0)*100/train.shape[0], end =’%\n’)ax = train[‘label’].value_counts().plot(kind=’bar’,figsize=(10,6),title=”Distribution of positive and negative sentiments in the data”)ax.set_xlabel(“Sentiment ( 0 == positive, 1 == negative)”)ax.set_ylabel(“Count”)
Positive vs Negative labeled tweets

The bars and the percentage above show that approximately 75% of the tweets are positive whereas 25% are negative. hence, we can infer that the data is imbalanced. We will use a weighted F1 score to analyze our models.

Now, the first step is to remove the noisy data like punctuations, hashtags, @ and others that are not alphanumeric. Only alphanumeric data are meaningful data that can help us in identifying the sentiments. To remove the noisy data, we will import RegexpTokenizer which will split the strings into substrings based on a regular expression. The regular expression we will use is ‘\w+ which will tokenize all the alphanumeric data and remove all other noises from the tweets. You can go through this link to know about various tokenizers — https://medium.com/@makcedward/nlp-pipeline-word-tokenization-part-1-4b2b547e6a3.

from nltk.tokenize import RegexpTokenizerregexp = RegexpTokenizer(r”\w+”)#applying regexptokenize to both training and test setstrain[‘tweet’]=train[‘tweet’].apply(regexp.tokenize)test[‘tweet’]=test[‘tweet’].apply(regexp.tokenize)train.head()

The processing will be the same for both training and test sets.

Now that we have a tokenized version of the alphanumeric data, our next step will be to remove all the common words which aren’t useful for sentiment analysis. Words like about, above, other punctuations, conjunctions, etc are used a lot in any text data but aren’t useful especially for our purpose. These words are called stopwords. We will now remove the stopwords and make our tweets cleaner for analysis.

import nltkfrom nltk.corpus import stopwordsnltk.download(‘stopwords’)#remove stopwords from both training and test settrain[‘tweet’] = train[‘tweet’].apply(lambda x: [item for item in x if item not in list_stop_words])test[‘tweet’] = test[‘tweet’].apply(lambda x: [item for item in x if item not in list_stop_words])

After removing the stopwords, we will remove all the words that have a length <=2. In general, small words (length <=2 ) aren’t useful for sentiment analysis because they have no meaning. These most probably are noise in our analysis. Apart from removing small words, we will convert all the tokens into lowercase. This is because words like ‘apple’ or ‘Apple’ have the same meaning in the sentimental context.

train[‘tweet’] = train[‘tweet’].apply(lambda x: ‘ ‘.join([w for w in x if len(w)>2]))test[‘tweet’] = test[‘tweet’].apply(lambda x: ‘ ‘.join([w for w in x if len(w)>2]))train[‘tweet’] = train[‘tweet’].str.lower()test[‘tweet’] = test[‘tweet’].str.lower()

Now, its time to build a WordCloud and gain some insights on the most common words.

from wordcloud import WordCloudall_words = ‘’.join([word for word in train[‘tweet’]])#building a wordcloud on the data from all tweetswordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(all_words)plt.figure(figsize=(10, 7))plt.imshow(wordcloud, interpolation=”bilinear”)plt.axis(‘off’)plt.show()

The next step is to use stemming or lemmatization methods and is very important for any text mining problem. Stemming and Lemmatization are Text Normalization (or sometimes called Word Normalization) techniques in the field of Natural Language Processing that are used to prepare text, words, and documents for further processing. This is done to build common words for the words with similar root and context which makes it easier to model using classification algorithms. Like for example, play, playing and played all mean the same but are different words. If we don’t normalize all to a common word ‘play’, each of the classification models will consider these three words as different and building models with these data will lead to overfitting.

The main difference between stemming and Lemmatization is that stemming works by cutting off the end or a beginning of a word whereas lemmatization works by changing the word to a root word that has a meaning and which it does by taking into consideration the morphological analysis of words using WordNet.

For our purpose, we will use lemmatization as it brings in common words that are meaningful and thus will be better for sentiment analysis.

nltk.download(‘wordnet’)from nltk.stem import WordNetLemmatizerwordnet_tokenizer = WordNetLemmatizer()train[‘tweet’] = train[‘tweet’].apply(wordnet_tokenizer.lemmatize)test[‘tweet’] = test[‘tweet’].apply(wordnet_tokenizer.lemmatize)

Now, its time to have a look at the top 20 highest occurring words for the positive as well as negative tweets.

pos = train[train[‘label’] == 0]neg = train[train[‘label’] == 1]pos_sentiment_words = ‘’.join([word for word in pos[‘tweet’]]) #words from the tweets that are positiveneg_sentiment_words = ‘’.join([word for word in neg[‘tweet’]]) ##words from the tweets that are negative#top 20 words on positive tweetslist_pos_words = [ x for x in pos_sentiment_words.split()] #list of positive sentiment wordsfreq_dis_pos = nltk.FreqDist(list_pos_words) #number of occurances of each wordfreq_dataframe = pd.DataFrame({‘Words’: list(freq_dis_pos.keys()), ‘Count’: list(freq_dis_pos.values())}) #data frame of words and count# selecting top 20 most frequent hashtagsfreq_dataframe = freq_dataframe.nlargest(columns=”Count”, n = 20)plt.figure(figsize=(16,5))ax = sns.barplot(data=freq_dataframe, x= “Words”, y = “Count”)ax.set(ylabel = ‘Count’)ax.set(xlabel = ‘Top 20 words used in positive context’)plt.title(“Top 20 words in the tweets labeled as POSITIVE SENTIMENT”)plt.show()#top 20 words on negative tweetslist_neg_words = [ x for x in neg_sentiment_words.split()]   #list of positive sentiment wordsfreq_dis_pos = nltk.FreqDist(list_neg_words)   #number of occurances of each wordfreq_dataframe = pd.DataFrame({'Words': list(freq_dis_pos.keys()), 'Count': list(freq_dis_pos.values())})  #data frame of words and count# selecting top 20 most frequent hashtagsfreq_dataframe = freq_dataframe.nlargest(columns="Count", n = 20)plt.figure(figsize=(16,5))ax = sns.barplot(data=freq_dataframe, x= "Words", y = "Count")ax.set(ylabel = 'Count')ax.set(xlabel = 'Top 20 words used in negative context')plt.title("Top 20 words in the tweets labeled as NEGATIVE SENTIMENT")plt.show()
Top 20 words in positive tweets
Top 20 words in negative tweets

Let us also look into the WordClouds for positive and negative tweets.

#WordCloud for positive tweetswordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(pos_sentiment_words)plt.figure(figsize=(10, 7))plt.imshow(wordcloud, interpolation=”bilinear”)plt.axis(‘off’)plt.title(‘Wordcloud for positive tweets’)plt.show()#wordcloud for negative tweetswordcloud = WordCloud(width=800, height=500, random_state=21, max_font_size=110).generate(neg_sentiment_words)plt.figure(figsize=(10, 7))plt.imshow(wordcloud, interpolation=”bilinear”)plt.axis(‘off’)plt.title(‘Wordcloud for negative tweets’)plt.show()
Wordcloud for positive tweets
Wordcloud for negative tweets

From the above bar graphs and WordClouds, we can see that some of the words in the negative tweets are fuck, fucking, ipod, time, etc whereas some of the words used in positive tweets are life, day, photography, sony, instagram, etc.

Modelling

Now, that we have cleaned our data and gained a brief view through the WordClouds, we will prepare the data for modelling. Preparation of text data involves converting into some numeric format which the machine learning models can understand. The computers only understand numerical data and therefore this is necessary.

Generally, there are various ways to convert text data into numeric form like CountVectorizer, TfIdf, etc. Count Vectorizer is based on the bag of words model. It works by counting the words’ frequencies in each document (each tweet in this case).

With TfIdf( Term Frequency–Inverse Document Frequency), the numeric values increase with the increase in the count of the words but are offset by the occurrence of the same word in the different corpus. For example, if there is a word ‘apple’ with many occurrences in a document with its presence in about 80% of the tweets, count vectorizer will give a high value for apple but TfIdf will have a negligible value because it is a common word occurring in many documents and therefore isn’t a useful word to classify the documents(or tweets here).

In this case, we will use TfIdf.

from sklearn.feature_extraction.text import TfidfVectorizertfidf_vectorizer = TfidfVectorizer(min_df = 2, max_df = .9, max_features = 1000, ngram_range = (1, 1))tfidf_fit = tfidf_vectorizer.fit(train[‘tweet’])tfidf = tfidf_fit.transform(train[‘tweet’])tfidf_test = tfidf_fit.transform(test[‘tweet’])

We will now use train_test_split to create new training and test set to get the best model. We can then use the best model on our original test set to get the F1 score.

X_train, X_test, y_train, y_test = train_test_split(tfidf, train[‘label’], random_state = 4, test_size = .3)#converting sparse matrices to np.arrayX_train = X_train.toarray()X_test = X_test.toarray()y_train = np.array(y_train).reshape(-1,1)y_test = np.array(y_test).reshape(-1,1)

Now, we will try three models — Logistic Regression ( with hyperparameters tuning ), Support Vector Machine ( SVM ) and Naive Bayes Classifier’s MultinomialNB and look for the best model that we can use for our test set.

Logistic Regression

#Logistic Regression with GridSearchclf = LogisticRegression()# use a full grid over all parametersparam_grid = {“C”:np.logspace(-3,3,7), “penalty”:[“l1”,”l2"]}f1 = make_scorer(f1_score , average=’weighted’)# run grid searchgrid = GridSearchCV(clf, cv=5,scoring=f1, param_grid=param_grid)grid.fit(X_train, y_train)print(“Grid-Search with roc_auc”)print(“Best parameters:”, grid.best_params_)print(“Best cross-validation score (f1)): {:.3f}”.format(grid.best_score_))y_predict = grid.predict(X_test)print(‘The weighted F1 score with the best hyperparameters is ‘, end = ‘’)print(f1_score(y_test, y_predict, average=’weighted’))print (“Classification Report: “)print (classification_report(y_test, y_predict))

The weighted F1 score from Logistic Regression with the best hyperparameters is 0.8900 which seem to be pretty good.

F1 score:-

Weighted F1 score is the weighted average score of F1 scores for label ‘0’ and label ‘1’ which we can see from the above classification report. Now, let us check SVM and Naive Bayes models too.

Support Vector Machine

# Classifier — Algorithm — Support Vector ClassifierSVM = SVC(C=1.0, kernel=’linear’, degree=3, gamma=’auto’)SVM.fit(X_train,y_train)# predict the labels on validation datasety_predict = SVM.predict(X_test)# Use accuracy_score function to get the accuracyprint(‘The weighted F1 score ‘, end = ‘’)print(f1_score(y_test, y_predict,average=’weighted’))print (“Classification Report: “)print (classification_report(y_test, y_predict))

SVC gives us a weighted F1 score of 0.8916 which is a bit better than Logistic Regression.

Naive Bayes Classifier

clf = MultinomialNB()# use a full grid over all parametersparam_grid = {‘alpha’:[0,1] }f1 = make_scorer(f1_score , average=’weighted’)# run grid searchgrid = GridSearchCV(clf, cv=5,scoring=f1, param_grid=param_grid)grid.fit(X_train, y_train)print(“Grid-Search with roc_auc”)print(“Best parameters:”, grid.best_params_)print(“Best cross-validation score (f1)): {:.3f}”.format(grid.best_score_))y_predict = grid.predict(X_test)print('The weighted F1 score with the best hyperparameters is ', end = '')print(f1_score(y_test, y_predict, average='weighted'))print ("Classification Report: ")print (classification_report(y_test, y_predict))

Naive-Bayes model gives us the best weighted F1 score among the three. The score is 0.8918. Apart from the best score, in general, the Naive Bayes Classifiers words very fast and are thus computationally efficient algorithms.

Now, that we have got the best model, we can use the model on the preprocessed test data which is stored as tfidf_test to predict their sentiments.

tfidf_testsparse_test = tfidf_test#storing preprocessed test data into a arraytest_data = sparse_test.toarray()print(test_data.shape)#doing predictions using the best algorithm whicch was Naive Bayes in this casey_predict = grid.predict(test_data)

Thanks for reading and getting one step further in your data science journey. Keep learning and reading more blogs.

Please do post any feedbacks or improvements on the model or blog.

--

--

Saket Garodia
Analytics Vidhya

Senior Data Scientist at 84.51(Kroger), AI/Data Science, Psychology, economics, books; Linkedin — https://www.linkedin.com/in/saket-garodia/