Decoding Indonesia’s Election Buzz: A Sentiment Analysis Journey through Indonesian Presidential Debate 2024 YouTube Comments

Join me on a journey through the eyes of Gen Z on YouTube, as we analyze the live sentiment of the debate using top-notch tools like IndoBert, BigQuery, and Metabase. Get ready to uncover insights and gain a fresh perspective on the future of Indonesian politics!

Published in

Data Science Indonesia

12 min readApr 6, 2024

While many political consultant firms focus on Twitter and TikTok for sentiment analysis, we’re turning our attention to an often-overlooked source: YouTube comments. And not just any comments — we’re zeroing in on the most-watched program during the live debates, Musyawarah Debat Capres by Narasi TV, hosted by renowned journalist Najwa Shihab on her YouTube channel.

Why Narasi TV? Because it’s where the young generation, especially Gen Z — the ones who are the most prominent determinant of the election — are tuning in and sharing their thoughts. By analyzing the comments on this channel, we’ll capture the sentiment towards each candidate during the debate and gain insights into the opinions of the future decision-makers of Indonesia.

We’ll walk you through the entire process, from collecting the comments using the YouTube API, to pre-processing and exploring the data, to inferring the sentiment results using the IndoBert model. And to make it even more exciting, we’ll be using BigQuery for data warehousing and Metabase for data visualization to bring the results to life.

So, buckle up and get ready to decode the election buzz with us! Let’s discover the power of YouTube comments as a source of live sentiment and uncover the sentiments of the younger generation towards the presidential candidates.

Getting Started

To run the sentiment analysis, first of all, we should install all the required packages. All the source code is available in my github repository. You can check all of the complete codes there. We would need several Python packages including packages for scrapping (googleapiclient), processing (pandas and spacy), and modeling (tensorflow and bert). On my github repository, I already provided the required packages in the requirements_nlp.txt file so you can just install all the packages by running this command:

pip install -r requirements_nlp.txt

Scrapping the Dataset

The dataset we would like to process is obtained from YouTube comments. To scrape the data, we need to use the YouTube API v3, which is available through the Google Cloud Platform. First, open the Google Cloud Console and click the navigation menu in the top left corner. Go to the ‘API & Services’ menu, select ‘Enabled APIs and services,’ and click ‘Enable APIs and services.’ Search for ‘YouTube Data API v3’ and click ‘Enable.’ Next, go to the ‘API & Services’ menu in the navigation menu again, choose the ‘Credentials’ menu, tap ‘Create Credentials,’ and select ‘API Key.’ The API key will be automatically created. You can copy the key and use it to connect to the YouTube Data API.

Import all the required libraries.

import googleapiclient.discovery
import pandas as pd
import re
import string
import spacy
import seaborn as sns

Connect to the Youtube API.

api_service_name = "youtube"
api_version = "v3"
DEVELOPER_KEY = "YOUR_DEVELOPER_KEY"

youtube = googleapiclient.discovery.build(
    api_service_name, api_version, developerKey=DEVELOPER_KEY)

Make an API request, let’s say we want to get the comments for the first presidential debate. The link of the video is https://www.youtube.com/watch?v=gUz_MgdwKg0. The last characters after v= is the video_id. Simply put the video_id (gUz_MgdwKg0) on the API request

request = youtube.commentThreads().list(
    part="snippet",
    videoId="gUz_MgdwKg0",
    maxResults=1000
)

get the response of the request

response = request.execute()

You will get this kind of JSON response. Each element represents each comment in the video. For example, this json response represents a comment ‘Saya kesini lagi’.

After making sure you got the right response, you can get all the comments by storing them in a list and transforming it into a pandas dataframe. This is the python code to run it.

# Get the comments from the response.
for item in response['items']:
    comment = item['snippet']['topLevelComment']['snippet']
    public = item['snippet']['isPublic']
    comments.append([
        comment['authorDisplayName'],
        comment['publishedAt'],
        comment['likeCount'],
        comment['textOriginal'],
        public
    ])
# load next page
while (1 == 1):
  try:
   nextPageToken = response['nextPageToken']
  except KeyError:
   break
  nextPageToken = response['nextPageToken']
  # Create a new request object with the next page token.
  nextRequest = youtube.commentThreads().list(part="snippet", videoId="gUz_MgdwKg0", maxResults=1000, pageToken=nextPageToken)
  # Execute the next request.
  response = nextRequest.execute()
  # Get the comments from the next response.
  for item in response['items']:
    comment = item['snippet']['topLevelComment']['snippet']
    public = item['snippet']['isPublic']
    comments.append([
        comment['authorDisplayName'],
        comment['publishedAt'],
        comment['likeCount'],
        comment['textOriginal'],
        public
    ])

df1 = pd.DataFrame(comments, columns=['author', 'updated_at', 'like_count', 'text','public'])
df1

You will get this kind of dataframe output.

Because we need comments from all of the debates from the first debate until the fifth debate, you can run the same method to get the comments in each debate video, store it in a dataframe and then merge all of it into one dataframe. You will get this output after merging all of the dataframes. Remember that you will also need a column to flag on comments coming from which debate (1st debate, 2nd, and so on)

Data Preprocessing

Once we have obtained our complete dataset, we are ready to preprocess the data before feeding it into a model. Preprocessing the data involves removing unnecessary characters, stop words, and slang words.

To remove unnecessary characters such as numbers and whitespaces, we will use basic string manipulation techniques in Python.

def text_preprocessing(text) :
    out1 = text.lower() # change to lower
    out2 = re.sub(r"\d+", "", out1) # remove all numbers
    out3 = out2.translate(str.maketrans("","",string.punctuation)) # replacing all punctuations with empty string
    out4 = out3.strip() # remove all whitespace
    out5 = re.sub(r'\s+', ' ', out4) # replace all double whitespace (tab enter etc.) with just single space
    return out5

To remove stop words, we will use the spacy library. We should transform the texts into a token.

# make the blank spacy pipeline
nlp = spacy.blank('id')
# making function for tokenization and remove stop words
def tokenize_text(text):
    doc = nlp(text)
    return [token.text for token in doc if not token.is_stop]
# apply to df
df['tokens'] = df['text'].apply(tokenize_text)

To remove slang words, we would need a helper file in colloquial-indonesian-lexicon.csv which consists of several Indonesian slang words and their formal forms. We will use this python function to remove or replace all the slang words in the dataset.

# function to remove slang words
def replace_slang_word(words):
    for index in range(len(words)):
        index_slang = slang_words.slang==words[index]
        formal = list(set(slang_words[index_slang].formal))
        if len(formal)==1:
            words[index]=formal[0]
    return words

The final form of the dataset before we put it into a model should be like the following output.

Sentiment Analysis

We will perform sentiment analysis using a pre-built model called the Indonesian RoBERTa Base model. This model is claimed to be quite good for Indonesian language sentiment analysis, with an accuracy of 93.2% and an F1-macro score of 91.02%.

To conduct inference using this model, we will use the Transformers library (alternatively, you may use the PyTorch library). Subsequently, we will create an object called nlp that consists of the model's sentiment analysis pipeline.

from transformers import pipeline
import tensorflow as tf

pretrained_name = "w11wo/indonesian-roberta-base-sentiment-classifier"

nlp = pipeline(
    "sentiment-analysis",
    model=pretrained_name,
    tokenizer=pretrained_name
)

You can try the pipeline by inputting a text, for instance, ‘Anda Goblok!’. You will receive the results of the sentiment analysis along with its score, as shown below.

Now that we have preprocessed our dataset, we are ready to perform sentiment analysis on each comment using the model. However, keep in mind that the length of the model’s input is limited. Therefore, we may need to truncate some of the longer comments to a maximum of 500 characters.

try:
    df['sentiment'] = df['text_fin'].apply(lambda x: nlp(x[0:500], padding='longest')[0]['label'])
except Exception as e:
    print("Error:", e)

You should see the following output.

Once we have the complete output of the sentiments of each comment, store it in a CSV file.

df.to_csv('youtube_comments_debat_capres_sentiments.csv', index=False)

Uploading to Data Warehouse

This step may not be necessary for everyone, as we have already performed sentiment analysis and you can present the results in any form or tool that you prefer or are familiar with. However, in most industrial applications, this type of analysis typically requires a data warehouse to store the output table.

For this purpose, we will use BigQuery as our data warehouse tool, which is quite common in data analytics fields due to its serverless data warehouse system. To use BigQuery, open the Google Cloud Console and select BigQuery from the navigation menu. First, create a dataset and then upload the CSV file into the dataset as a table. While I won’t go into detail on how to operate BigQuery here, you can find other sources that explain how to do so. In brief, there are two methods for uploading your dataset into BigQuery: the first is by manually uploading it through the UI console, and the second is by using a Python script. Both methods work perfectly, but I prefer the second method as it is more reliable and customizable. I have provided the Python script in my GitHub repository for your reference.

Connecting to Data Visualization Tool

Now that your dataset has been uploaded to BigQuery, you can connect it to your preferred data visualization tool. In this example, I will use Metabase. To establish the connection, you’ll need a service account JSON file. This file authenticates your connection to BigQuery. Refer to the Google Cloud Platform documentation for instructions on creating a service account and downloading the JSON key file.

Here’s an example of how to establish a connection between your local Metabase instance and BigQuery:

Once you have successfully established the connection, you can explore all of your datasets in BigQuery through Metabase, as shown in the example below.

Analyzing the Result

Now that we have set up all the necessary tools and data, we can move on to the exciting part: visualizing and analyzing the results. In line with our objective for this article, we will use the sentiment analysis results to determine which candidate is more attractive to millennials and Gen Z in the YouTube platform during the debate.

This is the dashboard outlook in Metabase.

We have applied filters for ‘nth debate’ and ‘candidate’ to the dashboard, which allow us to focus on a specific debate and set of candidates.

Before delving into the details, let’s take a look at the overall sentiment of the comments, regardless of which candidate they are addressing.

We can also identify the most frequently occurring words in the comments to gain insights of which topics that are being discussed the most.

As we can see from the chart, the words ‘Prabowo’ and ‘Gibran’ are the most frequently used by commenters. ‘Anies,’ ‘Muhaimin,’ and ‘Amin’ are also frequently mentioned. The word ‘Ganjar’ appears in the list, albeit in a lower position. This suggests which candidate is the central focus of the debate. If we calculate the aggregate mentions for each pair of candidates, Prabowo-Gibran is the most frequently mentioned, followed by Anies-Muhaimin, with Ganjar-Mahfud in last place.

We can also analyze the sentiment of the most liked comments. As we can see, the majority of the top-liked comments express positive sentiment.

Now, let’s examine the sentiment trend of each candidate during the debate.

As we can see from the dashboard, the Anies-Muhaimin pair consistently led in positive sentiment across all debates, except for the second debate where Prabowo-Gibran took the lead. In terms of negative sentiment, Anies-Muhaimin and Prabowo-Gibran alternated in receiving the most negative comments, with Anies-Muhaimin receiving the most negative comments in the first, third, and fifth debates, while Prabowo-Gibran received the most negative comments in the second and fourth debates. Ganjar-Mahfud remained at the bottom of both positive and negative sentiment rankings compared to the other candidates.

There are several factors that could explain the observed patterns in sentiment. One possible reason for Prabowo-Gibran’s high positive sentiment is Gibran Rakabuming Raka’s unexpectedly strong performance in the second debate. Many political analysts have described his performance as ‘impressive’ and ‘deadly’ to his opponent, which may have resonated with audiences and contributed to positive sentiment towards the candidate. As shown in the charts below, when we filter the sentiment for the second debate, we can see that the most frequently occurring words are related to the Prabowo-Gibran pair, with ‘Gibran’ being the most prominent.

Despite attracting positive comments, Prabowo-Gibran’s performance also attracted negative comments, particularly in the second and fourth debates where he consistently received the most negative sentiment. This suggests that Gibran’s gestures during the debates were controversial among audiences, with some viewing them positively and others negatively. In the fourth debate, the ‘shock effect’ of his gestures had worn off, and some viewers felt that they were excessive and disrespectful, particularly towards Mahfud MD.

The Anies-Muhaimin candidate consistently led in positive sentiment during the debates, likely due to their strong performances in each debate. However, they fell to second place in the second debate due to Gibran’s unexpected performance and his attack on Muhaimin, which Muhaimin was unable to handle effectively. Despite consistently receiving the most positive sentiment, the Anies-Muhaimin candidate also received the most negative comments, particularly in the first, third, and fifth debates where Anies was in the front. This suggests that their performances in those debates were somewhat controversial and divisive among audiences.

The Ganjar-Mahfud candidate was the least mentioned in the comments, indicating that they struggled to stand out and differentiate themselves from the other candidates. This may explain why their vote count was the lowest in the Indonesian General Election Commission (KPU RI) Presidential voting results for 2024 (this article was published after the election results had been announced).

Conclusions

The Indonesian Presidential Election of 2024 was the largest political event in the world in the first quarter of the year, with a significant portion of the participants being millennials and Gen Z, accounting for over 60% of the voters. One way to measure the electability of the candidates is through sentiment analysis, which many firms conducted using data from social media platforms like Twitter, Instagram, and TikTok. The results of these analyses, such as those conducted by Drone Emprit, consistently showed the Anies-Muhaimin candidate in the lead.

However, after the results were announced, the sentiment analysis did not match the actual outcome, with Prabowo-Gibran winning in a landslide victory. This discrepancy could be due to several reasons, such as the fact that not all Indonesians actively express their political opinions online, and the country still faces issues with low internet accessibility, which is below 80%. Sentiment analysis may be more effective in predicting election results in developed countries like the US and Canada, where people are more vocal, and internet access is more established.

In conclusion, I would like to congratulate Pak Prabowo Subianto and Mas Gibran Rakabuming Raka on their victory as the President and Vice President Elect of Indonesia for the 2024–2029 term. I hope that their administration can lead Indonesia towards a brighter future as a developed nation.

Notes

I would like to recommend that all readers try different methods of conducting sentiment analysis instead of following my approach using the IndoBert model. While the model is relatively accurate in predicting sentiment, it still has a lot of room for improvement, especially when dealing with sentences that contain two or more candidates as the subject. Consider trying out AI tools APIs such as GPT-4 or fine-tuning the IndoBert model with a better dataset to improve its accuracy.

Additionally, I recommend using different data warehouse and data visualization tools for exercise purposes, as I chose my tools for the same reason. Trying out different tools can help you gain a better understanding of their capabilities and limitations, and you may find a tool that better suits your needs.