Twitter Sentiment Analysis Using ULMFiT

Published in

The Startup

9 min readSep 17, 2020

Twitter Sentiment Analysis | Source: Analytics Vidhya

In today’s modern world, where we are suffering from overloaded data, companies often are gathering tonnes of data regarding customer feedback, shopping behavior, etc. Companies can dexterously change their digital profile, products, or services to best suit the new marketplace and customers by analyzing this data. However, it is still difficult for any human to interpret it manually without any mistake or bias.

Sentiment Analysis is a method to evaluate if a piece of writing or text is positive, negative, or neutral. Sentiment analysis lets data analysts around multinational corporations gauge public sentiment and product perception, and consider consumer perception.

Today, Deep Learning and Natural Language Processing (NLP) play a significant role in Sentiment Analysis. This blog focuses on applying sentiment analysis on twitter data scraped from every major U.S. airline to classify the tweets. Our goal will be to classify customer tweets into three categories positive, negative, and neutral. There are several Machine Learning algorithm for classification such example is K- Nearest Neighbor as explained in one of my blog Telecom Industry Customer Churn Prediction with K Nearest Neighbor. However, in this case we will use NLP since our data is unstructured (raw text). At the end of this blog, we will successfully build and train a State-of-The-Art (SoTA) Machine learning model to classify tweets based on sentiments.

This dataset is available on Kaggle: https://www.kaggle.com/crowdflower/twitter-airline-sentiment.

Transfer Learning: Leverage Insights from Big Data | Source: Datacamp.com

We will apply a supervised ULMFiT model to the Twitter data of major U.S. airlines. We will follow the ULMFiT approach of Howard and Ruder presented in the paper Universal Language Model Fine-tuning for Text Classification. ULMFiT stands for Universal Language Model Fine-tuning. It is an efficient Transfer Learning approach that can be extended to any NLP function to implement language model fine-tuning techniques.

Quantum transfer learning | Source: Pennylane

We will follow a step by step procedure to build a ULMFiT model, starting from Data Exploration then Text Preprocessing followed by building Language Model and then at the end building Classifier Model. Finally we will predict the accuracy of out ULMFiT model.

The complete Jupyter notebook for this can be found here: Twitter-Sentiment-Analysis-using-ULMFiT. So let’s begin.

Data exploration and processing

After performing, Exploratory Data Analysis (EDA) of the dataset, it conveyed the missing values in a few dataset columns.

The columns with more than 90% of the missing values were removed, which included tweet_coord , airline_sentiment_gold, negativereason_gold is missing.

Hence, it would be better to delete these columns because they will not provide any useful information on our model.

The majority of the comments are negative, which means people are generally dissatisfied with the airline company’s service.

From the above bar graph, it is evident that United Airlines is widely recognized on Twitter. Of course, we do not know if that popularity is positive or negative. Apart from that, the fact that there are very few tweets in Virgin America also gives the impression that perhaps their standard is neither good nor bad.

We will then define a new parameter ‘tweet_len,’ which will tell us the tweet’s length present in the ‘text’ column of our dataset.

There is not much connection between the amount of positive/neutral tweets and the tweet’s duration. However, for negative tweets, the distribution is strongly biased towards longer or longer tweets. This is plausible because the angrier the person who tweets, the more he/she has to say.
After a complete EDA, we can say that the airline’s sentiments vary significantly depending on the airline. The most positive is Virgin America, while the most negative is United considering the overall sentiment.

Text Preprocessing

Before building the model, we will process the column named ‘Text,’ which contains the raw text of the tweets posted by customers.

We will perform the text preprocessing by using the well known nltk library. To do this, we will import the necessary libraries into our notebook and create a new data frame containing just two columns that is Airline_sentiment and text.

tweet_senti = dataset[['airline_sentiment','text']]
tweet_senti

``We will clean the column text and create a list named corpus to store it. We will do this by:

Converting all the characters in Lowercase.
Removing characters apart from A-Z and a-z.
Removing the hashtags #.
Replacing ‘https://’ kind of URL into simple text ‘link’

nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

wordnet = WordNetLemmatizer()
ps = PorterStemmer()
corpus = []
for i in range(0, len(tweet_senti)):
    sntm= re.sub('[^a-zA-Z]', ' ', tweet_senti['text'][i])
    sntm = sntm.lower()
    sntm = sntm.split()
    sntm = re.sub(r'#([^\s]+)', r'\1', tweet_senti['text'][i])
    text = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','link',tweet_senti['text'][i])
    #sntm = [ps.stem(word) for word in sntm if not word in set(stopwords.words('english'))]
    #sntm = ' '.join(sntm)
    corpus.append(sntm)

Now we will replace this ‘text’ column from the data frame tweet_senti with the new values of the list corpus. We do not need the rest of the columns in the dataset; thus, we will only use the two relevant columns that are ‘newtext’ and ‘airline_sentiment’ and make this the new data frame.

tweet_senti['newtext']= corpus
tweet_senti.drop(["text"],axis=1, inplace = True)

We will then split the generated data frame into a Training set and Testing set, where 80% will be in the training set, and 20% will be in the testing set.

Building the ULMFiT Model

Universal Language Model Fine-tuning (ULMFIT) is a transfer learning technique that can assist with different NLP tasks. It’s been a State-of-The-Art (SoTA) NLP technique for a long time. The paper discusses applying ULMFiT to an IMDB sentiment problem. ULMFiT consists of three stages: a) LM Pre-training: The Language Model (LM) is trained on a general-domain corpus to capture the language’s general features in different layers. Transfer Learning and the ULMFiT method aims to align this pre-trained model with our problem. b) LM fine-tuning: Tweets’ language is different, and we need to fine-tune the language according to the dataset (tweets). Using discriminative fine-tuning and slanted triangular learning rates (STLR) to learn task-specific features, the full LM is fine-tuned on the target dataset. c) Classifier fine-tuning: The classifier is fine-tuned on the target task using gradual unfreezing, discriminative fine-tuning, and STLR to preserve low-level representations and adapt high-level ones.

Build the Language Model

We will make heavy use of the Fastai library to build this model; thus, we will import the fastai library to develop and train our ULMFiT model. The fastai library text module contains all the required functions to identify a dataset suitable for the different NLP (Natural Language Processing) tasks and generate models using them quickly. We will use class TextDataBunch, which is ideal for training the language model. Then we will create a ‘TextLMDataBunch’ from tweet_train.csv. We will then specify ‘valid=0.1’ to set 10% of our training data for the validation set.

from fastai.text import *
tweet = TextLMDataBunch.from_csv(path='',csv_name='tweet_train.csv',valid_pct=0.1)

So now we will build a language model, then we will train it and finally save the encodings (Encodings means the optimized weights of the trained model.) To build our language model, we will use the class language_model_learner() from fastai. We will pass in our ‘tweet’ object to specify our Twitter dataset. Along with it, we will give in AWD_LSTM to clarify that we’re using this particular architecture for our language model.
The AWD-LSTM dominates the State-of-The-Art modeling of languages. AWD-LSTM stands for ASGD Weight-Dropped LSTM. It uses a variety of well-known regularization techniques.

tweet_model = language_model_learner(tweet, AWD_LSTM, drop_mult=0.3)
tweet_model.model

We will use the learning rate finder class to find the optimum learning rate. The learning rate is a hyper-parameter that regulates how often you modify our neural network weights. It is an optimization algorithm that decides each iteration’s step size when advancing toward the global minima of the loss function.

tweet_model.lr_find()

tweet_model.recorder.plot(show_grid=True, suggestion=True)

As evident from the graph plotted above, we will take 3.98e-02 as the learning rate because, after that particular value, the loss becomes minimum. We must set the cycle length to 1 as we train only with one epoch. We will also use another Hyperparameter, i.e., moms. It refers to a tuple that has the parameters as (max_momentum,min_momentum).

tweet_model.fit_one_cycle(cyc_len=1,max_lr=3.98e-02,moms=(0.85,0.75))tweet_model.unfreeze()tweet_model.fit_one_cycle(cyc_len=5, max_lr=slice(3.98e-02/(2.6**4),3.98e-02), moms=(0.85, 0.75))

Building Classification Model

Once we have built a Language model, we will change the model accordingly in order to perform the classification task.

To do this, we will first create a new learner object by using ‘text_classifier_learner.’ The primary concept behind this learner object is similar to ‘language_model_learner’ because we will use the same architecture of AWD_LSTM. It can also similarly take in callbacks that allow us to train our model with unique optimization techniques. After that, we will load the encoders into this object i.e., learner.

Next, we will perform Gradual Unfreezing. It is the process to unfreeze the last layers as it contains the most general information. After fine-tuning unfrozen layers for one epoch, we go for the next lower layer and repeat until we complete all layers until convergence last iteration. We will unfreeze and train layers of our model one by one from top to bottom, which means from the previous layer to the inner layers. This is done to prevent the model from forgetting the features.

Similarly, how we did before, we choose a learning rate (optimized one) before the graph starts descending and reaches a minimum. Along with it, we will use the one cycle policy.

Now we will keep on unfreezing layer by layer.
As different layers capture different types of information, they should be fine-tuned to varying extents. Instead of using the same learning rate for all layers of the model, discriminative fine-tuning helps us apply specific learning rates to different layers.
Thus, we will then train with the next unfrozen layer and apply discriminative fine-tuning.

Prediction and Evaluating Model

After applying Gradual unfreezing and Discriminative fine-tuning, we will try to predict the accuracy of the model on the testing set that we generated by splitting the original dataset.

test['airline_senti_pred'] = test['newtext'].apply(lambda row:str(tweet_model.predict(row)[0]))from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrixprint("Accuracy of Model: {}".format(accuracy_score(test['airline_sentiment'],test[     'airline_senti_pred'])))#Plotting Confusion Matrixfrom sklearn.metrics import confusion_matrix
cf_matrix = confusion_matrix(test['airline_sentiment'], test['airline_senti_pred'])
print(cf_matrix)

Conclusion

After analyzing all the different learning rates and methods we used, the accuracy we got was 0.825 (82.5%). Language modeling can be viewed as the ideal source for NLP. This is because it encompasses many aspects of language, such as long-term interactions, hierarchical relationships, and sentiments. It provides data in almost unlimited quantities for most domains and languages. As evident from the gradual unfreezing, upon increasing the count of the unfrozen layer per epoch, the validation loss was increasing, resulting in over fitting the model. The best case we obtained was with two layers unfrozen because, in that case, both the validation loss and the training loss was considerably less.