How to Do Twitter Sentiment Analysis with a Pre-Trained Language Model [Python]

Learn how to fine-tune a pre-trained language model in TensorFlow for a downstream task.

12 min readSep 15, 2022

Python sentiment analysis for a pre-trained language model

In this tutorial, we will be working with a dataset made up of Spanish tweets about COVID-19 vaccines. The data were collected in the Summer of 2021, just after the rollout of several of the COVID-19 vaccines.

The data can be downloaded from here.

The Current State of NLP

There have been two major forces driving the recent development in NLP:

Firstly, the introduction of the self-attention mechanism, in particular through the Transformers architecture. And, secondly, the leveraging of huge amounts of unlabelled data through unsupervised pre-training methods. Thus, the winning strategy has been to first pre-train a transformer-based model with vast amounts of unlabelled and, consequentially, fine-tune the model to make it perform better at a specific task. This second step is usually accomplished with labeled data — though much fewer learning examples are required in comparison to training the model from scratch.

Natural Language Processing (NLP) has a large variety of tasks and applications, including Automatic, or Machine Translation, Text Summarization, Text Generation, Text Classification, Question Answering, and Named Entity Recognition (NER). The ability to develop and improve these very different types of tasks have wide-reaching possibilities for developing NLP.

Why Use Transformers and Not RNNs Or LSTMs?

Recurrent Neural Networks (RNNs) got very popular in sequence modeling for supervised NLP tasks like classification and regression. Thanks to the Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU), the vanishing gradient problem got relieved which allowed the modeling of longer sequences. Yet, there are some problems associated with these models: Firstly, they are very hard to parallelize on multiple processes. Secondly, even though LSTM and GRU are great at minimizing the vanishing gradient problem, RNNs still have difficulties modeling very long sequences. With the introduction of Transformers which completely left out the recurrent unit, only relying on the attention mechanism, both problems, the parallelization issue and the vanishing gradient problem for longer sequences were solved.

What Are Transformers in Machine Learning?

While explaining the Transformers architecture in detail, would go beyond the scope of this article, I want to outline some conceptual features of Transformers and especially the process that’s relevant to using them.

Unsupervised Pre-Training

The Transformer is usually trained on a very large amount of data in an unsupervised, or, more precisely, semi-supervised way. When dealing with a language modeling task, this happens usually so that the model will get a general statistical understanding of a language. After pre-training, it can be fine-tuned for a downstream task. That means, it can be adapted, with a much smaller amount of data — labeled in this case — for a task that is more specific, like sentiment analysis.

Attention Mechanism and Context-Free Embedding

Transformers leave the recurrent unit completely out and solely rely on the attention mechanism. Through this, the model is able to capture long-term dependencies of texts — something crucial to correctly understanding a language. Contrary to RNNs, the learning also happens without any specific direction, i.e. it is non-sequential. This is also the reason we can apply parallelization when training Transformers. However, by using this non-sequential approach, the architecture understands which words are related to each other.

Spanish Language Models

Even though NLP as a discipline is rapidly growing, most of the efforts are concentrated on the English language. However, most of the world’s population are not native English speakers and there is a huge number of Spanish speakers in the world and on the internet.

Spanish has about 580 million speakers in the world and is the third most common language on the internet, with about 7,9% of users speaking Spanish. In addition to that, Spanish is the second most-used language on social media, according to the Centro Virtual Cervantes.

This illustrates the importance of the active development of Spanish language models. Not only from a commercial point of view but also for educational purposes.

How to Create Predictions With Transformers

Before fine-tuning our own model, we will download the current state-of-the-art model, called beto-sentiment-analysis. Alternatively, the model can also be used as software, called pysentimiento. We will apply the model to the data we’ll be working with, and see how it performs…

Instantiating the Model

We will see, that our final classifier consists of two different components: the AutoTokenizer and the actual model.

The AutoTokenizer preprocesses the tweets. This means breaking them into smaller pieces and turning them into numbers to create tensors that can be fed into the model. The actual model which we will feed the preprocessed data, and which will produce our labels for classification. For the classification part, we will use the TFAutoModelForSequenceClassification class that allows us to download the model we indicate. Remember, we cannot simply use any model, given our text is in Spanish.

As mentioned earlier, the model we will be using is called beto-sentiment-analysis. It is based on the BETO model, which is essentially a Spanish version of the original BERT model but pre-trained exclusively on Spanish data.

from transformers import AutoTokenizer, TFAutoModelForSequenceClassification# Instantiate Huggingface models
model_name = 'finiteautomata/beto-sentiment-analysis'model = TFAutoModelForSequenceClassification.from_pretrained(model_name, from_pt=True, num_labels=3) 
#from_pt=True, because this model only exists in PyTorchtokenizer = AutoTokenizer.from_pretrained(model_name)# Combining tokenizer and model into one classifier
classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

Loading the Data

In the following, we will first store the data in a data frame. After that, we will create the predictions/labels for the tweets with our classifier pipeline from above.

tweets_raw_df = pd.read_csv('https://raw.githubusercontent.com/lucamarcelo/Vaccine-Tweets-Sentiment-Analysis/main/scraped_tweets_ES_raw.csv')

Creating Labels with the Classifier Pipeline

We will only select the first 10 tweets in the following to see how it performs.

# Create labels in an unsupervised manner with the beto-sentiment-analysis classifier:## Creating lists for a new dataframe that contains the assigned label and the corresponding probability scoresentiment_output = []
sentiment_proba = []## Looping through the scraped tweets and adding the predictions to the previously created listsfor tweet in tweets_raw_df['content'][:10]: #select first 10 tweets only
  result = classifier(tweet) #save analyzer result
  sentiment_output.append(result[0]['label']) #select given label
  sentiment_proba.append(result[0]['score']) #select probability of given label
# Concat results with selected columns from original dataframe
sentiment_beto_df = pd.concat([tweets_raw_df[:10][['date', 'username', 'content']], pd.Series(sentiment_output), pd.Series(sentiment_proba)], axis=1)# Rename new columns to 'sentiment' and 'sentiment_probability'
sentiment_beto_df.rename(columns={0: 'sentiment_output', 1: 'sentiment_probability'}, inplace=True)print('-- This df still contains the tweets that have low probability scores! --')display(sentiment_beto_df)

Our output should look something like this:

In only these few examples we note that there are a few tweets that don’t seem to have a high probability score. To have a higher certainty that the label is predicted correctly, we will only keep those that have a high probability (p>0.9).

Filtering for High Probability Scores

As we can see, in the following only tweets that have a high probability are kept.

# Removing tweets that have low probability scores
df_high_prob = sentiment_beto_df[sentiment_beto_df['sentiment_probability']>0.9].reset_index(drop=True )print('-- The following df only contains tweets that have high probability scores with p>0.9. --')display(df_high_prob)

Now your output should look like this — similar, but we’ve excluded the tweets with probability scores lower than 0,9.

AutoTokenizer

The pipeline function of the transformers library is made up of different parts:

The first one is the so-called AutoTokenizer. As mentioned earlier, the tokenizer is responsible for preprocessing our texts. It will basically create the inputs that our model can understand since it won’t be able to process the words directly.

It will first split a given text into smaller subparts, called Tokens. These tokens consist of words, parts of words, or punctuation symbols. NOTE: We have to make sure to instantiate `AutoTokenizer` using the same model as for classification!

In the second step, the Tokenizer will transform the tokens into numbers since our model won’t understand words or punctuations as they are. With those numbers, we can then build the tensors that we can feed our model. To do this, the tokenizer has a vocab, which is the part we download when we instantiate it with the from_pretrained method since we need to use the same vocab as when the model was pre-trained.

Since our goal is to process multiple sentences at once, we need to perform this operation differently and treat it as a batch.

The AutoTokenizer will basically create the inputs that our model can understand since it won’t be able to process the words directly.

This means we need to make them all to the same length, which essentially means filling empty spaces for those sentences that are shorter and truncating them for the very long sentences to be shortened to the maximum length our model can accept. In our case, the maximum length, given by the argument max_length is equal to 512.

Once we do the preprocessing with our tokenizer, we can use the preprocessed data to feed it to our actual model which will be able to classify between positive, negative, and neutral tweets.

The preprocessed data tweets_preprocessed, contains a dictionary string to a list of ints. It contains the ids of the tokens, as mentioned before, but also additional arguments that will be useful to the model. Here, for instance, we also have an attention mask that the model will use to have a better understanding of the sequence:

The padding is automatically applied on the side expected by the model (in this case, on the right), with the padding token the model was pre-trained with. The attention mask is also adapted to take the padding into account:

# Batch operations with the AutoTokenizertweets_preprocessed = tokenizer(
tweets,padding='longest', # padding to match the longest sequence truncation=True,
max_length=None,
return_tensors="tf" #we are working with TensorFlow
)

As we can see, the outputs are objects that contain the model’s final activations along with other metadata. In order to understand them and be able to interpret them, we have to go through some additional steps, namely passing the outputs through a Softmax function and then applying labels to give the final output meaning.

import tensorflow as tftf_predictions = tf.nn.softmax(model_outputs.logits, axis=-1)tf_predictions

The object has the logits attribute, which will allow us to access the model’s final activations and create the labels for our output.

Fine-Tuning the Baseline RoBERTa Model

Tass 2020 Task Using Labeled Data for Fine-Tuning

We will use the recently developed Spanish RoBERTa base model and fine-tune it with the TASS2020 Task1 dataset: General polarity at three levels, Subtask 1 Monolingual.

The mentioned TASS2020 dataset for “evaluation of polarity classification systems of tweets written in Spanish” contains a total of 8,409 tweets written in Spanish. We have to keep in mind though that this number represents all the tweets from different countries. Since we’re interested in analyzing tweets in Spain, we will focus on the tweets from Spain for training.

The dataset can be found here.

Preparing the Data

In the following, we will start with loading two different datasets: a train set and a validation set. Then we will have a look at it. Note that we’re only loading the text and the labels.

The TASS2020 dataset consists of several training subsets. Each subset represents tweet data from a different country. We’re going to concatenate the subsets to get more training data and like this, our model will have a more robust understanding of the Spanish language.

Note that we’re loading and concatenating the data in one step!

### TRAIN DATA #### Load the different train dataframeses_train = pd.read_csv('https://raw.githubusercontent.com/lucamarcelo/Vaccine-Tweets-Sentiment-Analysis/main/es_train.tsv', sep='\t', header=None, usecols=[1,2])cr_train = pd.read_csv('https://raw.githubusercontent.com/lucamarcelo/Vaccine-Tweets-Sentiment-Analysis/main/cr_train.tsv', sep='\t', header=None, usecols=[1,2])mx_train = pd.read_csv('https://raw.githubusercontent.com/lucamarcelo/Vaccine-Tweets-Sentiment-Analysis/main/mx_train.tsv', sep='\t', header=None, usecols=[1,2])pe_train = pd.read_csv('https://raw.githubusercontent.com/lucamarcelo/Vaccine-Tweets-Sentiment-Analysis/main/pe_train.tsv', sep='\t', header=None, usecols=[1,2])uy_train = pd.read_csv('https://raw.githubusercontent.com/lucamarcelo/Vaccine-Tweets-Sentiment-Analysis/main/uy_train.tsv', sep='\t', header=None, usecols=[1,2])
# Concatenate the dataframes
train_df = pd.concat([es_train, cr_train, mx_train, pe_train, uy_train], axis=0, ignore_index=True)# Removing neutral comments
train_df = train_df[train_df[2] != 'NEU']# Display data
print('TRAINING DATA\n',train_df.shape)
display(train_df.head())### VALIDATION DATA #### Load the different train dataframeses_val = pd.read_csv('https://raw.githubusercontent.com/lucamarcelo/Vaccine-Tweets-Sentiment-Analysis/main/es_val.tsv', sep='\t', header=None, usecols=[1,2])cr_val = pd.read_csv('https://raw.githubusercontent.com/lucamarcelo/Vaccine-Tweets-Sentiment-Analysis/main/cr_val.tsv', sep='\t', header=None, usecols=[1,2])mx_val = pd.read_csv('https://raw.githubusercontent.com/lucamarcelo/Vaccine-Tweets-Sentiment-Analysis/main/mx_val.tsv', sep='\t', header=None, usecols=[1,2])pe_val = pd.read_csv('https://raw.githubusercontent.com/lucamarcelo/Vaccine-Tweets-Sentiment-Analysis/main/pe_val.tsv', sep='\t', header=None, usecols=[1,2])uy_val = pd.read_csv('https://raw.githubusercontent.com/lucamarcelo/Vaccine-Tweets-Sentiment-Analysis/main/uy_val.tsv', sep='\t', header=None, usecols=[1,2])# Concatenate the dataframes
val_df = pd.concat([es_val, cr_val, mx_val, pe_val, uy_val], axis=0, ignore_index=True)# Removing neutral comments
val_df = val_df[val_df[2] != 'NEU']# Display data
print('VALIDATION DATA\n',val_df.shape)
display(val_df.head())

Output: The training and Validation data with labels

# Show distributions of training and test dataimport seaborn as sns
sns.displot(train_df[2])

Output: The distribution of negative and positive labels

# Creating lists for text and labels for taining and validation data## Train data
train_texts = list(train_df[1])
train_labels = list(train_df[2])## Val data
val_texts = list(val_df[1])
val_labels = list(val_df[2])

Preprocessing with the Auto Tokenizer

# Creating tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("BSC-TeMU/roberta-base-bne")# Preprocessing text
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)

Creating the Dataset Objects

Since we have to batch the data, we cannot simply forward the labels and encodings. Instead, we have to create the dataset objects with the .from_tensor_slice()-method in TensorFlow. This format allows for easy batching.

Before creating the dataset objects though, we have to keep in mind though that our model isn’t able to process any non-numeric values. We have to convert our labels to numbers as well. We’ll do this by mapping integers to the strings in the list.

# Convert the labels to ints
## Create a dictionary for the mapping
d = {'N':0, 'P':1}## Map the values in the dictionary to the three lists of labels
train_labels = list(pd.Series(train_labels).map(d).astype(int))
val_labels = list(pd.Series(val_labels).map(d).astype(int))# Create the tensorflow datasets from our encodings
train_dataset = tf.data.Dataset.from_tensor_slices((
dict(train_encodings),train_labels))val_dataset = tf.data.Dataset.from_tensor_slices((
dict(val_encodings),val_labels))

Model Training in Native TensorFlow

Now that we have our data prepared and created the dataset objects, we’re ready to fine-tune our model in native TensorFlow.

from transformers import TFAutoModelForSequenceClassification## Model Definition
roberta_model = TFAutoModelForSequenceClassification.from_pretrained("BSC-TeMU/roberta-base-bne", from_pt=True, num_labels=2)
## Model Compilation
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)metric = tf.metrics.SparseCategoricalAccuracy()roberta_model.compile(optimizer=optimizer, loss=loss, metrics=metric)
## Fitting the data
#history = roberta_model.fit(train_dataset, validation_data=val_dataset, epochs=5, batch_size=64)history = roberta_model.fit(train_dataset.shuffle(len(train_labels)).batch(64), validation_data=val_dataset.shuffle(len(val_labels)).batch(64), epochs=3, batch_size=64)

Evaluation

Calculate the average accuracy for the above model.

# Calculate the mean val_sparse_categorical_accuracy
print('The mean Sparse Categorical Acc for the val_set is: ', np.mean(history.history['val_sparse_categorical_accuracy']))# And save the model
roberta_model.save_pretrained('Twitter_Roberta_Model')

The model can also be found on my GitHub.

Using the Fine-Tuned Model to Create Predictions

Creating predictions with the pipeline

# Instatiate the model with the tokenizer we used before and the model we just trainedroberta_classifier = pipeline('sentiment-analysis', model=roberta_model, tokenizer=tokenizer)# Let's see if it works on an example
roberta_classifier(['Hoy no me encuentro muy bien.', 'Hoy es un día bonito.'])

Our model has now been fine-tuned on a new task and is able to classify tweets into three categories: Positive, negative, and neutral. We also see that with the label comes a probability score for each label which we can use to determine how sure the model is of the label it assigned.

Creating Predictions for Our Dataset

## Creating lists for a new dataframe that contains the assigned label and the corresponding probability score
sentiment_output = []
sentiment_proba = []## Looping through the scraped tweets and adding the predictions to the previously created lists
for tweet in tweets_raw_df['content']:
  result = roberta_classifier(tweet) #save classifier result
  sentiment_output.append(result[0]['label'])#select predicted label
  sentiment_proba.append(result[0]['score']) #select probability ## Concat results with selected columns from original dataframe
sentiment_roberta_df = pd.concat([tweets_raw_df[['date', 'username', 'verified', 'followersCount', 'location', 'content', 'replyCount', 'retweetCount', 'likeCount']], pd.Series(sentiment_output), pd.Series(sentiment_proba)], axis=1)## Rename new columns to 'sentiment' and 'sentiment_probability'
sentiment_roberta_df.rename(columns={0: 'sentiment_output_roberta', 1: 'sentiment_probability_roberta'}, inplace=True)print('-- This df still contains the tweets that have low probability scores! --')display(sentiment_roberta_df.head(10))

Output with predictions and probability for our initial dataset

At this point, we can start making modifications to the data, like renaming the labels, modifying the date, or aggregating by location. The dataset now is also fit to understand the sentiment better, for example by looking at distributions, mean values, and bi- and multivariate analysis.

Conclusions

At this point, you’re ready to take your own pre-trained language model and fine-tune it for the task you need!

We’ve seen that pre-training a model is simply not achievable for most individuals, since it requires huge amounts of data and computing resources. We can however take the pre-trained model, some labeled data for the downstream task, and fine-tune it for the task we need.

TensorFlow offers a great deal of ease to tokenize and preprocess the data, to be able to train a model and even evaluate its performance. After fine-tuning the model, you can go ahead and create predictions with the model for unseen data and analyze the data afterward to drive insights from it.