Sentiment Analysis using RoBERTa to train your model.

A simple guide to help you navigate through a Sentiment Analysis with Machine Learning without the use of Jupyter Notebooks

9 min readJan 10, 2024

Sentiment analysis using machine learning has become an essential tool for understanding public opinion. In this guide, I will walk you through a step-by-step process of performing sentiment analysis using a pre-trained RoBERTa model, specifically the one available from Hugging Face.

The example will use data collected from Reddit with the help of RedditAPI and praw, although you can adapt the approach for other platforms like Twitter or YouTube.

Project Structure

To ensure modularity and ease of management, the project is organized into three distinct files:

reddit.py: Manages the interaction with Reddit API to gather data.
sentiment.py: Handles sentiment analysis, utilizing the RoBERTa model, and includes methods for data cleaning and model training.
main.py: An interface to input data and execute the functionalities of the other two files.

This modular approach enhances code organization and allows for efficient handling of specific components.

Reddit API Integration

To collect data from Reddit, you need to register as a developer and obtain API credentials. Follow the instructions provided at theReddit API Documentation. Once you have your credentials, use the praw package in Python, as demonstrated in the reddit.py file below.

# reddit.py

class RedditClass:
    def __init__(self):
        self.text = "Reddit Initialization"
    
    @staticmethod
    def redditapi():
        # Importing necessary libraries
        import json
        import praw  

        # --------- Reddit API Cred --------- #
        reddit = praw.Reddit(
            client_id="<Your id here>",
            client_secret="<Your secret token here>",
            user_agent="ua"
        )
        # ----------------------------------- #

        subreddits = ["turntables", "vinyl", "audiophile"]

        corpus = []  # List to store the data

        # ---------- Fetching the data ---------- #
        for sreddit in subreddits:
            # Change the value of limit to change the number of posts from the hot section
            for submissions in reddit.subreddit(sreddit).hot(limit=10):
                corpus.append(submissions.title)  # Adding post titles to the list
                post = reddit.submission(id=submissions.id)
                post.comments.replace_more(limit=None)
                for com in post.comments.list():
                    corpus.append(com.body)  # Adding comments to the list

        with open('data.json', "w", newline='') as json_file:
            json.dump({'Data': corpus}, json_file)

As you see from the code block, I have created a class called RedditClass so that I can edit and add some functionality in the future, you don’t necessarily have to. Even if you just create a function inside this file you can still import it to the file where we need to run.

By using this code, the function redditapi will save this list of comments and posts from the subreddits as a json file and then we will use this corpus in the next file called sentiment.py by importing the data from the json file.

Sentiment Analysis with RoBERTa

The sentiment analysis is performed using the RoBERTa model developed by Facebook, specifically the model available here. The sentiment.py file contains methods for sentiment analysis, data cleaning, and model training.

We run each sentence through a pre-trained model that we can find from hugging face.

This particular model has been trained on 124M tweets and returns us Labels 0 -> Negative, 1 -> Neutral and 2 -> Positive

# Importing all the necessary dependencies in sentiment.py
import json
import re
import pandas as pd
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
import torch

I have created multiple methods to execute certain functionalities like data cleansing, sentiment analysis and model training. The methods sentiment, run_sentiment, data_cleaning and run_model are all encapsulated within the sentiment.py file

class Sentiment:
    def __init__(self):
        self.text = "Sentiment Analysis"
        self.lemmatizer = WordNetLemmatizer()

        self.tokenizer = AutoTokenizer.from_pretrained('cardiffnlp/twitter-roberta-base-sentiment')
        self.model = AutoModelForSequenceClassification.from_pretrained('cardiffnlp/twitter-roberta-base-sentiment')

The self.tokenizer and self.model is used to get the RoBERTa model from huggingface. This facilitates the model to assign sentiment values to individual sentences extracted from Reddit.

Gathering the sentiment

    def sentiment(self, sent):
        """
        This function is used to pass through a sentence and get the sentiment
        :param sent: Input Sentence
        :return: Returns the position of the highest probable value which indicates the sentiment.
        0= Negative
        1= Neutral
        2= Positive
        """

        tokens = self.tokenizer.encode(sent, return_tensors='pt')
        result = self.model(tokens)
        return int(torch.argmax(result.logits))

The sentiment method is used to extract the sentiment information from a given sentence utilizing the RoBERTa model. The sentence, represented by the variable sent will be encoded so that it can be passed through the model. The return_tensors are assigned as 'pt' so that we can get the output as a PyTorch tensor.

Upon receiving the model’s results, the argmax() is used to identify the position of the highest value in tensor. For instance, if we have a tensor looking like this…

Invoking the argmax() would provide an output indices of 0,1 and 2, corresponding to the highest values within the tensor. 0 being negative, 1 being neutral and 2 being positive.

Assigning sentiments

    def run_sentiments(self):
        """
        This function is used to get the sentiment of the entire corpus
        :return: Saves the created dictionary consisting of sentences and their respective sentiments as a json file
        """
        # ------------ Loading Data --------------

        with open('data.json', 'r') as json_file:
            corpus = json.load(json_file)['Sentences']

        # ----------------------------------------

        sentences = [x for para in corpus for x in sent_tokenize(para)]
        # We are removing sentences above 512 characters since the model would only take in 512 characters
        filtered_sentences = [x for x in sentences if len(x) < 512]

        results = {}
        for s in filtered_sentences:
            res = self.sentiment(s)
            results[filtered_sentences] = res

        with open('sentiment_data.json', 'w') as json_file:
            json.dump(results, json_file)

Initially, we take the JSON file that we saved after the data acquisition process by using Reddit API. Subsequently, we tokenize the sentences and filter out sentences that exceeds 512 characters, as the model has a constraint of maximum character length of 512 characters.

We then perform a systematic iteration through the filtered sentences with the help of a loop to be passed through the sentiment() method. The outcomes are then stored in a dictionary named results , encapsulating each sentence along with its associated sentiment.

After that, we create a pandas dataframe consisting of each sentence and their corresponding sentiments from the results dictionary and save it as a json file called sentiment_data.json

Data Cleaning

    def data_cleaning(self, sent):
        """
        This function cleans sentences with the help of regex. URLs and hashtags are removed.
        :param sent: Sentence to be cleaned
        :return: Returns cleaned sentence after Lemmatization
        """
        texts = sent.lower()
        texts = re.sub('[^a-z0-9]', ' ', texts)
        texts = re.sub(r'http\S+', '', texts)  # Remove URLs
        texts = re.sub(r'@[A-Za-z0-9]+', '', texts)  # Remove user mentions
        texts = re.sub(r'#[A-Za-z0-9]+', '', texts)  # Remove hashtags
        texts = re.sub(r'[^a-zA-Z\s]', '', texts)  # Remove non-alphabetic characters
        words = word_tokenize(texts)
        words = [word for word in words if word not in stopwords.words('english')]  # Remove stopwords
        words = [self.lemmatizer.lemmatize(word) for word in words]  # Lemmatization
        return ' '.join(words)

By using the data_cleaning method, data cleansing is executed by eliminating URLs, user mentions, hashtags, special characters, and stopwords.

The sentences are deconstructed and each word undergoes lemmatization to derive its root word. They are then reconstructed to sentences. The rationale behind lemmatization is to afford the model the capacity to discern the nuanced weight of the individual terms with their respective contexts.

Finally, the method then returns a meticulously cleaned sentence.

Model Training

    def run_model(self):
        """
        This function is used to train a Naive Bayes Model for sentiment analysis based on the saved sentiment data
        :return: Returns the model, vectorizer used, and the accuracy of the train and test sets.
        """
        model_data = pd.read_json('sentiment_data.json', lines=True)
        model_data['stemmed_data'] = model_data['Sentences'].apply(self.data_cleaning)

        x = model_data['stemmed_data'].values
        y = model_data['Sentiment'].values

        X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

        # Converting textual data into numerical data
        vectorizer = TfidfVectorizer()
        X_train = vectorizer.fit_transform(X_train)
        X_test = vectorizer.transform(X_test)

        # Creating the Logistic Regression model
        nb_model = MultinomialNB()
        nb_model.fit(X_train, y_train)

        # Getting model accuracy
        X_train_pred = log_model.predict(X_train)
        accuracy = accuracy_score(y_train, X_train_pred)

        X_test_pred = log_model.predict(X_test)
        accuracy_scr = accuracy_score(y_test, X_test_pred)

        return nb_model, vectorizer, accuracy, accuracy_scr

To execute the model, the initial step involves importing data into a structured dataframe from the saved sentiment.json file. Subsequently, we apply data cleansing procedures to each sentence in the dataframe.

Following data preparation, variables x and y are created to gather the feature and target values. The stemmed data column, is where we have stored all out cleaned sentences and is assigned to the feature variable x . Similarly, the annotated sentiments from the sentiment column are assigned to the target variable y . We then perform a train-test split allocating 20% of the data in the test variables so that we can asses the accuracy of our model.

To facilitate a nuanced understanding of term relevance within each document/sentence, a TfidfVectorizer() is created. For an in-depth explanation of how Tf-Idf Vectorization process, refer to the comprehensive explanation provided by Mukesh Chaudhary here.

The variables are then fitted into the vectorizer, paving the way to create the Naive Bayes model denoted as nb_model . The model is then trained using the x_train and y_train variables, encompassing the feature and target values.

After training, we evaluate the model accuracy to discern potential overfitting tendencies, so that we can regularization measures if deemed necessary.

In conclusion, the method returns the resultant model, vectorizer and accuracy metrics for both test and training datasets.

Interface and main.py

The user interface is created using the customtkinter library. The main.py file orchestrates the functionalities, allowing the user to choose actions like fetching Reddit data, assigning sentiment, and training the model.

Although the interface was not designed with great aesthetics, it provides good functionality. :p

All checkboxes are programmed to execute their corresponding functionalities upon activation, run by the Run button. The Get Sentiment button facilitates the sentiment retrieval from our trained model, eliminating the need for a repetitive training and cleaning process.

Now for the model.py file,

# First we import all the dependencies

import pickle
import customtkinter
from reddit import RedditClass
from sentiment import Sentiment

sentiment = Sentiment() # Initializing an object from the Sentiment class

# Setting up Custom Tkinter

customtkinter.set_appearance_mode("System") # The appearance change from dark mode to light depending upon the system settings
customtkinter.set_default_color_theme('green') # Color theme of buttons and checkboxes

root = customtkinter.CTk()
root.geometry("500x350")

frame = customtkinter.CTkFrame(master=root)
frame.pack(pady=20, padx=60, fill="both", expand=True)

I’ve created a frame using customtkinter with a focus on clarity and organization that defined the workspace.

def button_press (): # Defining the functionality for the run button
    customtkinter.CTkLabel(master=frame, text=var1.get()).pack(pady=12, padx=10)
    if var1.get() == 1:
        reddit = RedditClass()
        reddit.redditapi()
        customtkinter.CTkLabel(frame, text="Reddit data has been run").pack()
    if var2.get() == 1:
        sentiment.run_sentiments()
        customtkinter.CTkLabel(frame, text="Sentiment has been assigned to the data").pack()
    if var3.get() == 1:
        model, vectorizer, train_acc, test_acc = sentiment.run_model()

        def save_model():
            pickle.dump(model, open('SentimentModel.pkl', 'wb'))
            pickle.dump(vectorizer, open('Vectorizer.pkl', 'wb'))
            customtkinter.CTkLabel(root, text="Model has been saved").pack()
            custom_dialog.destroy()

        customtkinter.CTkLabel(frame, text="Model has been trained").pack()
        custom_dialog = customtkinter.CTkToplevel(root)
        custom_dialog.title('Save Model')
        customtkinter.CTkLabel(custom_dialog, text=f"Training data accuracy: {train_acc}").pack()
        customtkinter.CTkLabel(custom_dialog, text=f"Test data accuracy: {test_acc}").pack()
        customtkinter.CTkLabel(custom_dialog, text="Do you want to save the model?").pack()
        customtkinter.CTkButton(custom_dialog, text="Yes", command=save_model).pack()
        customtkinter.CTkButton(custom_dialog, text="No", command=custom_dialog.destroy).pack()

var1 = customtkinter.IntVar()
var2 = customtkinter.IntVar()
var3 = customtkinter.IntVar()

customtkinter.CTkCheckBox(frame, text="Get Reddit Data", variable=var1).pack(pady=5, padx=10)
customtkinter.CTkCheckBox(frame, text="Get Sentiment Data", variable=var2).pack(pady=5, padx=10)
customtkinter.CTkCheckBox(frame, text="Train the model", variable=var3).pack(pady=5, padx=10)

customtkinter.CTkButton(frame, text="Run", command=button_press).pack(pady=6, padx=10)

The button_press() performs the retrieval of Reddit data, sentiment assignment, and model training, dependent upon the selected checkboxes.

The checkbox states are denoted by values 0 or 1, which are assigned to variables var1,var2 and var3. Conditional checks are then performed by if statements to govern the execution of processes delineated in the methods of the sentiment.py file.

A dialogue box is also created to offer insights into training and testing accuracy when opted to train the mode. Additionally, users are prompted to save the data if need be, through “Yes” and “No” options, enhancing user engagement and data management practices.

The provision for retraining and saving the model is strategically incorporated depending upon the model’s accuracy. Recognizing the model’s performance is extremely important since the inherent data sourced from reddit can be incredibly diverse, reflecting upon the unpredictable nature of user generated content — it is prudent to employ a discerning approach. The capacity to retrain and save the model allows adaptive model enhancement, in response to the inherent variability within the data.

e = customtkinter.CTkEntry(frame, width=200)
e.pack()
e.insert(0, "Enter your sentence here:")

def button_press_2():
    print(e.get())
    model = pickle.load(open('SentimentModel.pkl', 'rb'))
    vectorizer = pickle.load(open('Vectorizer.pkl', "rb"))

    inp = e.get()
    print('Data Received')
    res = sentiment.data_cleaning(inp)
    print('Data Cleaned')
    document = vectorizer.transform([res])
    prediction = model.predict(document)
    print('Prediction Made')
    match int(prediction[0]):
        case 0:
            customtkinter.CTkLabel(frame, text="Neutral Sentiment").pack()
        case 1:
            customtkinter.CTkLabel(frame, text="Neutral Sentiment").pack()
        case 2:
            customtkinter.CTkLabel(frame, text="Positive Sentiment").pack()
    print(int(prediction[0]))

customtkinter.CTkButton(frame, text="Get Sentiment", command=button_press_2).pack(pady=6, padx=10)

root.mainloop()

Finally, an input textbox has been created to receive sentences, and it’s functionality is activted by the button_press_2() . This function helps in the loading of the pre-trained model, and associated vectorizer. The providede sentence then undergoes lemmatization, tokenization, and then transforming the text with the vectorizer. After the analysis conducted by the model on the transformed data, a classification outcome is represented by values 0,1 and 2.

A swith-case structure has been established, producing a labeled output denoting the respective class to which the text belongs. This is triggered upon the activation of the Get Sentiment button.

This structured approach enhances the readability, maintainability, and scalability of the sentiment analysis project. The combination of RoBERTa model, Reddit API, and a modularized code structure ensures an efficient and effective sentiment analysis workflow.