Tracking the Media’s Coverage of the 2024 Election Through Sentiment Analysis
This article was produced as part of the final project for Harvard’s AC215 Fall 2023 course.
Authors: Luke Stoner, Andrew Sullivan, Kane Norman
Website: Race for The White House
Introduction
As we approach the 2024 United States Presidential Election, every moment seems crucial, and each piece of news battles for attention. While many Americans’ attention has shifted to online, alternative news sources, television news media still plays a major role in U.S. election cycles and shaping public opinion. However, in this sea of sensationalism and punditry, the true substance of candidates can become obscured, leaving potential voters grappling with biased narratives and a lack of in-depth coverage.
Over the past decade, the topics of bias and misinformation have become a major point of contention in the American media landscape. From former president Donald Trump’s consistent claims of “fake news,” to “Russiagate” in 2017, and election denialism in 2020, voters in both parties hold gripes with the way their preferred candidates are covered in the media, especially on TV. Furthermore, news outlets often focus on negative stories and major candidates that drive views, leaving lesser-known candidates without an equal platform to share their message.
In this project, we aimed to investigate how each candidate is covered and how that coverage varies by network. First, we simply track how much coverage each candidate receives, or specifically the number of times their full name is mentioned on the news. Additionally, we use a RoBERTa-based model to analyze the sentiment of candidate coverage, classifying each full name mention as either positive or negative. Overall, our analyses confirmed many of the expectations at the start of the project; candidates with lower name recognition received far less coverage, while traditionally party-affiliated networks like Fox News and MSNBC illustrated stark differences in their coverage.
Data Collection
We had one major source for this project — the Internet Archive’s TV News database. The database stores clips from programming across many different networks, including mainstream American channels (Fox News, CNN, MSNBC), international sources (BBC, RT, DW), and select local news stations. In addition to the video clips, the archive stores accompanying closed captions, which is what we scrape to form our dataset, along with other metadata such as the date and network of the candidate mention. Lastly, this database is updated daily, with most clips available within 12 hours of the show airing.
To scrape the data, we used Selenium’s ChromeDriver, which allowed us to interact with the Internet Archive in the same manner as a standard user. We were able to specify a search of the archive by crafting a URL for each candidate, additionally specifying the start and end of the search. By the end of the project, this scraping process was automated via a scheduled Python script, where we ran Selenium in “headless” execution mode, meaning that a browser window was not opened during scraping. The final unlabeled dataset contained the following for each mention: candidate first/last name, candidate party, date and network of mention, and the cleaned mention text. The cleaning process for the scraped captions included removing unnecessary characters and spaces, as well as cropping the text to a maximum of 100 words to capture only the relevant textual context.
Model Training
With our data secured, we moved on to our modeling phase. Advancements in machine learning over the past five years have created state-of-the-art, accurate methods for measuring the sentiment of text. While more simplistic methods such as rules-based classification were somewhat accurate, the transition to RNNs (recurrent neural networks) and now transformer models such as Google’s BERT greatly increased the accuracy of sentiment classification. For our project, we chose a fine-tuned version of RoBERTa-Large called SiEBERT. We chose the model for two main reasons:
- The model was trained on 15 different datasets, including social media data and reviews in various industries. This makes the model more robust and generalizable than other models.
- The model uses two output classes (negative, positive) rather than three (negative, neutral, positive); this provides two benefits. First, using only two classes leads to higher classification accuracy. Second, results are more interpretable (an average sentiment label of 0.45 means that a candidate’s mentions are 45% positive).
The figure below from the SiEBERT paper illustrates that the fine-tuned RoBERTa model performs better than other popular models on average, and additionally depicts the increase in accuracy when moving from three to two classes.
One issue for our project was that we initially did not have any labeled data to train on. Therefore, we first attempted a few self-learning techniques. First, we used a process in which we provided the entire dataset with an initial label via SiEBERT, then fine-tuned the model using a high-confidence subset of the data, meaning the negative or positive sentiment score reached a predefined threshold. However, we found that this method led to a very unbalanced dataset (most mentions were labeled positive). We also tried to use a form of weak supervision, creating “weak labels” via NLTK’s VADER and fine-tuning based on those labels. This approach also ran into issues, mainly that even with very early stopping implemented, the RoBERTa model just matched the VADER label >95% of the time.
Therefore, we decided to hand-label a small random sample of our data to assess initial accuracy and fine-tune the model. Out of the box, SiEBERT achieved a relatively low accuracy of ~65%. This is most likely due to the sentiment of a mention on the news not always being straightforward. For example, imagine a newscaster stated, “Compared to Ron DeSantis, Nikki Haley’s debate performance was fantastic.” The model returns a positive label given that text, but if the candidate of interest is Ron DeSantis, then obviously the model is incorrect. Nuances like these are still difficult to capture in sentiment classification. That said, after fine-tuning the model on our hand-labeled data, we were still able to raise validation accuracy to ~79%.
In addition to sentiment classification, we also experimented with text summarization tasks. Once again utilizing fine-tuned transformer models, we randomly sampled mentions for each candidate to create a summary of how the candidate was being discussed in the news. We also employed a keyword extraction model to investigate if that would provide any further insight. While the summaries and keywords were not complete gibberish, ultimately we decided they did not provide enough value to our project to include them in our final implementation.
Model Deployment
Once our model training was complete, it was time to deploy the model. Ultimately, our goal here was to create an app that could take text as input and use our final model to return a negative and positive sentiment score. To accomplish this, we utilized Flask to create the app, which was then containerized via Docker and deployed to a Vertex AI endpoint.
Once the app was deployed to Vertex AI, we could also utilize Google Cloud Platform monitoring tools. The endpoint dashboard provides useful performance metrics such as prediction speed, as well as request and response code monitoring to identify and diagnose potential errors. In the example below, we can that the endpoint experienced an approximately five-minute period where it had trouble effectively returning responses.
API
Our API serves a simple, yet vital role in our project. The service, which is run on a Uvicorn server, completes a three-step process every Sunday at 6 pm via a cron job. First, it scrapes the Internet Archive for the past week’s candidate mentions. Next, it takes those mentions and sends them to our Vertex endpoint in batches for labeling. Lastly, the newly labeled data is concatenated to the existing labeled data in our GCP data bucket, and a new version of the data is created and stored. Consistently updating our data is critical to keeping our frontend current and relevant.
Frontend
With our frontend, we wanted to focus on telling a clear story, while also giving the user as much opportunity for interaction with the data as possible. Our application begins with an introduction to the project and a clear explanation as to what our data is, where it comes from, and how we process it. We also introduce the user to the candidates, as we assume both Americans and international users alike may not be familiar with many of the candidates.
Once introductions are complete, we then move on to providing insight into how coverage varies across candidates, both in volume and sentiment. Our visualizations show that more well-known candidates do get much more coverage than lesser-known candidates, but their coverage tends to be more negative. While the average candidate hovered around 50% positive mentions, Joe Biden and Donald Trump averaged approximately 42% and 35% respectively.
We then moved into network-focused analysis, investigating if the biases we expected to see manifested themselves in the data. Unsurprisingly, there was clear evidence of differences in coverage across networks. For Republican candidates, reports on Fox News and Fox Business skewed much more positive than those on CNN and MSBNC, and vice versa for Democratic candidates. We specifically found that more partisan networks liked to focus on negative coverage of their primary adversary. Fox’s most covered candidate was Joe Biden at just 26% positive reports, while the majority of MSNBC’s reports covered Donald Trump at 30% positivity.
Next, we look at how major news events may impact media coverage; we include examples such as Donald Trump’s indictment and the Republican Debates. Lastly, our frontend concludes with an overview of what the data taught us, pointing out potential issues with the current media landscape and the nature of our election cycles.
Kubernetes Scaling and CI/CD
With our model, API service, and frontend established, we then had to ensure our web app could properly handle user traffic; therefore, we turned to Google Kubernetes Engine. GKE’s autopilot clusters made the creation and continued use of a Kubernetes cluster simple and intuitive. Still, the autopilot cluster was a powerful tool, autoscaling resources based on CPU and memory usage, as well as providing nodes in multiple zones to prevent outages.
In addition to scaling, we also implemented continuous deployment to ensure changes to our code base were reflected in our model, API, and frontend. On new pushes to our model or app, a new Docker image was built and pushed to Google Artifact Registry; then, that newly pushed image was used to deploy a new version of the model to our Vertex AI endpoint. Additionally, on new pushes to our API or frontend, we once again built and pushed a new Docker image, this time using the new image to redeploy to our Kubernetes cluster.
Conclusion and Future Work
Unfortunately, we didn’t quite solve the United States’ media problem. However, we did learn a lot through this whole process, both about our media, and everything that goes into the full-stack development of a web app. As data scientists, we were generally proficient in collecting data and modeling it. However, the concepts of containerization, data pipelines, experiment tracking, model deployment, app design, and many others that we encountered during this project were either mostly or entirely new to us.
For future work on this project, we believe we could improve our product by making the following changes or additions:
- Revisit the modeling process to increase accuracy (more hand-labeled data, revisit self-learning)
- Make better use of tools such as distillation and compression to improve modeling efficiency
- Use a software tool such as Ansible to manage the automated deployment of our API and frontend to Kubernetes
We thank our professor Pavlos Protopapas and the teaching staff for providing a challenging, yet rewarding course that undoubtedly made us more well-rounded data scientists, programmers, and developers.