Presidential First GOP Debate Sentiment Analysis withVarious Models

Dana Fatadilla Rabba
4 min readAug 17, 2022

--

Public sentiment when an election is held is very important. By analyzing the sentiments that exist in the community, political parties and candidates can see public trends which can be used as the basis for making decisions for the next step.

In this article, we will discuss the classification of sentiment based on people’s tweets on the internet. We will discuss about NLP, specifically text processing including pre-processing, stemming, lemmatization, tokenization, word embedding with TF-IDF, and some deep learning models of text processing.

Understanding the Dataset

The data is about the first 2016 GOP Presidential Debate in Ohio that can be downloaded here. The dataset used consists of 13871 rows × 21 columns. In this discussion, only 2 columns are used, namely text and sentiment. The text column contains related tweets from people on twitter while the sentiment column is the label of the text column. There are 3 classified sentiments, namely positive, neutral, and negative. There are some duplicate rows, therefore it is better to delete the duplicate rows so that only 10567 rows remain.

Text Pre-Processing

First of all, we have to process the text first. The most common thing to do is to lower the letters of each text and replace non-alphabetic elements with blank spaces.

df.text = df.text.apply(lambda x: str(x).lower())def strip_html(raw_text):  clean_text = re.sub(pattern='[^a-zA-Z]',repl=' ', string=raw_text)  return clean_textdf.text = df.text.apply(lambda x: strip_html(x))

The next step we use is tokenization. Tokenization is breaking the raw text into small chunks. Tokenization breaks the raw text into words, sentences called tokens. These tokens help in understanding the context or developing the model for the NLP. The tokenization helps in interpreting the meaning of the text by analyzing the sequence of the words. It is easy to perform tokenization in Python using available library.

wpTokenizer = WordPunctTokenizer()df["review_tokenized"] = [wpTokenizer.tokenize(text) for text in df["text"]]

The next step is to remove the stop words that exist in each text. After that, we can perform stemming, lemmatization or both. In this article, I only use lemmatization.

Then we divide the data according to the size of the specified training and test data.

train_X, test_X, train_y, test_y = model_selection.train_test_split(df['review_tokenized_cleaned'], df['sentiment'], test_size = 0.2, random_state =1)

Before being inputted into the model, the sentiment column is converted to numbers first using an label encoder.

label_enc = LabelEncoder()
train_y = label_enc.fit_transform(train_y)
test_y = label_enc.transform(test_y)

In addition, the tokenized text needs to be converted into a vector so that it can be read by a computer. One method that is widely used is vectorization followed by TF-IDF. This method can be performed using TfidfVectorizer from scikit-learn.

tfidf_vect = TfidfVectorizer(max_features = 10000)
tfidf_vect.fit(df.review_tokenized_cleaned)
train_X_tfidf = tfidf_vect.transform(train_X)
test_X_tfidf = tfidf_vect.transform(test_X)

ML Model Building

After the text is processed, it is inputted into the machine learning model to be trained. In this article, I tried 4 different ML models, namely SVM, Logistic Regression, XGBoost, and Random Forest Regression. Each of these models obtained an accuracy of 62%, 62%, 59%, and 61% respectively after being evaluated with test data.

Deep Learning Model Building

To create a deep learning model, you need to change the resulting label of the encoder label to one hot encoding.

train_cat_y = tf.keras.utils.to_categorical(train_y)test_cat_y = tf.keras.utils.to_categorical(test_y)

Then you can simply initiate the deep learning model to perform the classification task as usual. In this article, I tried a simple model as follows.

dl = tf.keras.Sequential([   tf.keras.layers.Dense(32, input_dim=10000, activation='relu'),   tf.keras.layers.Dropout(0.2),   tf.keras.layers.Dense(64, activation='relu'),   tf.keras.layers.Dropout(0.2),   tf.keras.layers.Dense(32, activation='relu'),   tf.keras.layers.Dropout(0.2),   tf.keras.layers.Dense(3, activation='softmax')])

The model is trained and then evaluated and gets an accuracy of 61%. Not too bad for a simple deep learning model.

To improve model performance, you can try other text processing methods, the embedding section for example. Then, you can also try other text classification methods such as LSTM and BERT, or use a pre-trained model. For other methods you can use, you can check out my google colab or github.

--

--