Basic Sentiment Analysis with Machine Learning using Twitter Data

9 min readJul 5, 2024

One of the most popular applications of machine learning is sentiment analysis. With sentiment analysis, we can determine whether a piece of text, such as a tweet, product review, or comment, has a positive or negative sentiment. This is very useful for companies or government agencies to understand public opinion about the services or products they offer. Monitoring a brand can be more effective, ultimately leading to increased customer satisfaction.

In this article, we will walk through the steps of performing a simple sentiment analysis using the Sentiment140 dataset. We will use three machine learning models: Naive Bayes, Support Vector Machine (SVM), and Logistic Regression. We won’t delve into the details of how each algorithm works. Instead, this article will focus on how to prepare the data, train the model, and evaluate the model for sentiment analysis.

Preparing the Environment

First, make sure to install the necessary packages to follow this article:

pip install pandas nltk scikit-learn joblib

Then, download the stopwords from NLTK:

import nltk
nltk.download('stopwords')

Loading and Preparing Data

import pandas as pd

# Load the dataset
data = pd.read_csv('data/sentiment140.csv', encoding='latin-1', header=None)
data.columns = ['target', 'ids', 'date', 'flag', 'user', 'text']
# Drop unnecessary columns
data = data.drop(columns=['ids', 'date', 'flag', 'user'])
# Convert target to binary (0: Negative, 1: Positive)
data['target'] = data['target'].apply(lambda x: 1 if x == 4 else 0)
# Create a balanced subset of 10% data
positive_samples = data[data['target'] == 1].sample(frac=0.05, random_state=42)
negative_samples = data[data['target'] == 0].sample(frac=0.05, random_state=42)
five_percent_data = pd.concat([positive_samples, negative_samples])

The first step is to load the dataset that will be used. In this case, we use the Sentiment140 dataset, which contains tweets with positive and negative sentiments. This dataset is loaded using the pandas library with the pd.read_csv command, where we set the encoding='latin-1' parameter to ensure that non-ASCII characters are handled correctly. This dataset does not have headers by default, so we need to manually add them to facilitate subsequent data processing. The headers ′target′,′ids′,′date′,′flag′,′user′,′text′'target', 'ids', 'date', 'flag', 'user', 'text'′target′,′ids′,′date′,′flag′,′user′,′text′ are in accordance with the Sentiment140 dataset documentation. We then remove the unnecessary columns 'ids', 'date', 'flag', and 'user', leaving only the 'target' and 'text' columns. The 'target' column contains the sentiment to be predicted, while the 'text' column contains the text to be trained.

The next step is to convert the target column. The dataset documentation states that tweets with positive sentiment are labeled as 4, neutral tweets as 2, and tweets with negative sentiment as 0. However, after exploration, we found that the target values are only 0 and 4 (no neutral/2).

Therefore, to simplify, we convert the ‘target’ column, changing values of 4 to 1 and other values to 0. To speed up the process, we will only take a subset of 10% of the original data. We take 5% random samples from the positive data and 5% random samples from the negative data using the sample function with the parameters frac=0.05 and random_state=42 to ensure reproducibility. This way, we can train the model using a device with lower specifications.

Preprocessing Text Data

import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

def preprocess_text(text):
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove punctuation
    text = text.lower()  # Convert to lowercase
    text = text.split()  # Split text into words
    ps = PorterStemmer()
    text = [ps.stem(word) for word in text if not word in set(stopwords.words('english'))]  # Remove stopwords and perform stemming
    text = ' '.join(text)
    return text

five_percent_data['text'] = five_percent_data['text'].apply(preprocess_text)

To make the text ready for use in a machine learning model, we need to perform several preprocessing steps. In this block of code, the preprocess_text function is used to clean and process the text. First, we remove URLs from the text using the re.sub(r'http\S+', '', text) function, which searches for and removes all URL patterns. The next step is to remove all punctuation and non-alphabet characters using re.sub(r'[^a-zA-Z\s]', '', text), leaving only letters and spaces. Then, the text is converted to lowercase with text.lower(), and split into individual words with text.split().

After splitting the text into words, we use the Porter Stemmer from the NLTK library to perform stemming, which is the process of reducing words to their base form. Before stemming, we also remove stopwords (common words that frequently appear but contribute little to meaning, such as “and”, “the”, “is”, etc.) using the NLTK’s list of English stopwords. Each word in the text that is not a stopword is then stemmed using ps.stem(word). The stemmed words are then joined back into a single string with ' '.join(text). The preprocess_text function is then applied to the 'text' column of the five_percent_data dataset using apply(preprocess_text), so each text in the dataset is processed and cleaned before being used in the machine learning model.

Splitting the Dataset and Vectorizing Text Data

The next step is to split the data into training and testing sets, then vectorize the text using TF-IDF:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

X = five_percent_data['text']
y = five_percent_data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
vectorizer = TfidfVectorizer(max_features=5000)
X_train_vect = vectorizer.fit_transform(X_train)
X_test_vect = vectorizer.transform(X_test)
# Save the vectorizer
import joblib
joblib.dump(vectorizer, 'model/tfidf_vectorizer.pkl')

The next step is to convert raw text into a numerical form that can be processed by the algorithm. First, we need to separate the data into features (X) and targets (y) from the dataset. The features, which are the text, are stored in X, while the target or labels (positive or negative) are stored in y. Next, we split the dataset into training and testing data, with 20% of the data set aside for testing (test_size=0.2) and the rest for training. We also ensure that the data splitting is random but consistent with the seed random_state=42.

To convert the text into a numerical form, we use the TfidfVectorizer from scikit-learn. This vectorizer transforms the text into a vector representation based on the frequency of words and their importance in the document (TF-IDF). We limit the maximum number of features to 5000 to maintain computational efficiency. The TfidfVectorizer is first fitted to the training data X_train and transforms the training text into a vector matrix X_train_vect. Next, the trained vectorizer is used to transform the testing text X_test into a vector matrix X_test_vect. After that, we save the trained vectorizer model using joblib.dump so it can be reused later without retraining, and save it in the file model/tfidf_vectorizer.pkl.

Training and Saving Models

We will train three models: Naive Bayes, SVM, and Logistic Regression. After training the models, we will save each model for later use:

from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

# Train and save Naive Bayes model
nb_model = MultinomialNB()
nb_model.fit(X_train_vect, y_train)
joblib.dump(nb_model, 'model/naive_bayes_model.pkl')
# Train and save SVM model
svm_model = SVC(probability=True)
svm_model.fit(X_train_vect, y_train)
joblib.dump(svm_model, 'model/svm_model.pkl')
# Train and save Logistic Regression model
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train_vect, y_train)
joblib.dump(lr_model, 'model/logistic_regression_model.pkl')

After converting the text into numerical representations, the next step is to train several machine learning models for sentiment analysis. Here we use three different models: Naive Bayes, SVM (Support Vector Machine), and Logistic Regression.

First, we train the Naive Bayes model using MultinomialNB, which is a variant of Naive Bayes suitable for text data. This model is trained with the vectorized training data, and then the trained model is saved in the file naive_bayes_model.pkl using joblib.dump. Next, we train the SVM model using SVC, which also takes the parameter probability=True to allow probability estimation. After training, the trained SVM model is saved in the file svm_model.pkl. Finally, we train the Logistic Regression model using LogisticRegression with a maximum iteration limit of 1000 to ensure convergence. After training, the trained Logistic Regression model is saved in the file logistic_regression_model.pkl. By saving these models, we can easily reload and use them for prediction without retraining, saving time and computational resources.

Evaluating Models

from sklearn.metrics import accuracy_score, classification_report

y_pred_nb = nb_model.predict(X_test_vect)
y_pred_svm = svm_model.predict(X_test_vect)
y_pred_lr = lr_model.predict(X_test_vect)
print("Naive Bayes Model")
print("Accuracy:", accuracy_score(y_test, y_pred_nb))
print("Classification Report:\n", classification_report(y_test, y_pred_nb))
print("SVM Model")
print("Accuracy:", accuracy_score(y_test, y_pred_svm))
print("Classification Report:\n", classification_report(y_test, y_pred_svm))
print("Logistic Regression Model")
print("Accuracy:", accuracy_score(y_test, y_pred_lr))
print("Classification Report:\n", classification_report(y_test, y_pred_lr))

To evaluate the performance of the created models, we make predictions using each model: Naive Bayes, SVM, and Logistic Regression on the test data that has been vectorized (X_test_vect). These predictions are then compared with the actual labels (y_test) to calculate accuracy and generate a classification report.

For each model, this code prints the accuracy and classification report, which contains important metrics such as precision, recall, and F1-score. Accuracy is calculated using the accuracy_score function from sklearn.metrics, which gives the percentage of correct predictions out of the total predictions. The classification report is generated using the classification_report function, which provides detailed performance of the model for each class (positive and negative) in terms of precision (accuracy), recall (completeness), and F1-score (harmonic mean of precision and recall). By comparing these metrics, we can get a more complete picture of the performance of each model in the sentiment analysis task.

Making Predictions on New Data

To test it on new data such as text we create ourselves, we can create a new file and write the code below

import joblib
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import argparse

nltk.download('stopwords')
# Preprocess text data
def preprocess_text(text):
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove punctuation
    text = text.lower()  # Convert to lowercase
    text = text.split()  # Split text into words
    ps = PorterStemmer()
    text = [ps.stem(word) for word in text if not word in set(stopwords.words('english'))]  # Remove stopwords and perform stemming
    text = ' '.join(text)
    return text
# Load vectorizer and models
vectorizer = joblib.load('model/tfidf_vectorizer.pkl')
nb_model = joblib.load('model/naive_bayes_model.pkl')
svm_model = joblib.load('model/svm_model.pkl')
lr_model = joblib.load('model/logistic_regression_model.pkl')
# Argument parser
parser = argparse.ArgumentParser(description='Predict sentiment of input text.')
parser.add_argument('-t','--text', type=str, help='Text to analyze sentiment')
args = parser.parse_args()
# Process input text
input_text = args.text
input_text_processed = preprocess_text(input_text)
input_text_vect = vectorizer.transform([input_text_processed])
# Predict using Naive Bayes
nb_prediction = nb_model.predict(input_text_vect)[0]
nb_prob = nb_model.predict_proba(input_text_vect)[0]
# Predict using SVM
svm_prediction = svm_model.predict(input_text_vect)[0]
# Predict using Logistic Regression
lr_prediction = lr_model.predict(input_text_vect)[0]
lr_prob = lr_model.predict_proba(input_text_vect)[0]
# Display predictions
print(f"Input Text: {input_text}")
print(f"Naive Bayes: {'Positive' if nb_prediction == 1 else 'Negative'} (Confidence: {nb_prob[nb_prediction]:.2f})")
print(f"SVM: {'Positive' if svm_prediction == 1 else 'Negative'}")
print(f"Logistic Regression: {'Positive' if lr_prediction == 1 else 'Negative'} (Confidence: {lr_prob[lr_prediction]:.2f})")

Here we create the preprocess_text function again to clean and prepare the input text. Next, we load the previously saved TF-IDF vectorizer and Naive Bayes, SVM, and Logistic Regression models using joblib. Using argparse, this script will be able to accept input text from the command line, allowing flexibility in its usage. The input text provided by the user is processed with the preprocess_text function and then transformed into a vector using the trained TF-IDF vectorizer. Predictions are made with each model, and the results are displayed to the user. With this, we can make sentiment predictions on text input in the terminal. For example, python predict.py -t "Manchester United is a great club":

Here is an example of providing input with negative sentiment such as python predict.py -t "Manchester United is suck":

Conclusion

In this article, we have gone through various steps to perform sentiment analysis using the Sentiment140 dataset. Starting from loading and preparing the data, we preprocess the text to clean it before using it in a machine learning model. Next, we split the data into training and testing sets, and vectorize the text using TF-IDF. Three different models, namely Naive Bayes, SVM, and Logistic Regression, are trained and evaluated to see their performance in predicting text sentiment.

To improve the performance of the model and achieve more accurate sentiment analysis, some suggestions to consider include exploring other models such as neural network-based models, performing hyperparameter tuning for each model, using more data for training, and applying more advanced preprocessing techniques such as lemmatization and word embedding. Additionally, regularly monitoring and updating the model with new data can help keep the model relevant and accurate over time. Continuous exploration and experimentation are key to achieving the best results in machine learning. Happy learning!

Source Code: Github