2021 Inauguration on Twitter

Published in

Trustworthy Social Media

4 min readMar 8, 2021

Introduction

We recently tested our dynamic data collection method using word embeddings and corpus frequencies to track Twitter discussions about the 2021 presidential inauguration. In this post, we cover the visualizations and metrics used throughout our mini study, a browser-based UI used for human-assisted data collection, and some qualitative findings.

Data Visualization Platform

While our dynamic method can theoretically be fully automated, we are currently testing a human-in-the-loop process for keyword selection.

To that end, we are building a data visualization platform which processes new twitter data, runs machine learning models on this data, and dynamically updates frontend browser-based UI to reflect changes in live social media discussions and recommend new keywords for data collection. The figure below shows the full cycle of the backend and frontend of our platform.

Workflow of our Data Collection, Storage, Analysis, and Visualization Platform. Here we use Twitter APIs as an example for data collection and we use Google Cloud Platform (GCP) as an example for data streaming and storage. GCP Monitor works in a sequential way while the backend analyses can be parallel.

Below are some key UI elements on this platform.

Forecast plot generated after training ARIMA on log-transformed #insurrection frequency data, which predicts log frequencies 15 timesteps into future and provides a 95% confidence interval.

Interactive tsne (t-distributed stochastic neighbor embedding) plot of the closest 30 neighbors to the tracked keyword #insurrection (shown in blue).

Table of 30 closest neighbors to #insurrection, sorted by a linear combination of keyword cosine distance to #insurrection and corpus frequency

We used a prototype of this platform to make human-assisted keyword selections throughout the 2021 inauguration case study.

Experiment Details

We started our dynamic twitter monitor on 2021 on January 11, 2021 (at 10:40:44 Pacific Standard Time), with a single keyword, “inauguration.” From then on until January 22, 2021, we used our dynamic keyword selection method to choose new keywords for data collection using Twitter’s streaming API.

The figure below shows the hourly volume of inauguration-related tweets detected by our dynamic monitor throughout the experiment.

Fig 1: Volume of tweets collected by a dynamic monitor based on word embeddings and frequencies and static monitor using only 1 keyword (“inauguration”) to collect data.

Each day of the experiment, we used a prototype of our data visualization platform to visualize important metrics (corpus frequencies, GloVe embeddings, nearest neighbors) and select new keywords for data collection. Our browser-based UI allowed our research group to streamline a human-assisted keyword selection process.

Observations

From the word clouds below, we see that our dynamic monitor covers a wider range of topics and has a more uniform distribution in terms of the frequency of different keywords when compared to a static monitor (which uses only “inauguration” to pull data).

Word cloud from the **static monitor** (left) and the **dynamic monitor** (right) on day 9 of the experiment showing the most frequently used hashtags in the statically and dynamically obtained Twitter datasets. Size of word corresponds to frequency of word within corpus.

Our dynamic method revealed trending terms which cropped up from Jan 11 to Jan 22, like the high-frequency keywords shown below.

Further, our data visualization platform allowed us to clearly trace the evolution of our dynamic keyword set throughout the experiment:

Keywords update (add and remove) by the dynamic monitor for data collection in real-time case study of 2021 #inauguration discussions on Twitter.

Let’s provide some real-world context to the keywords in the figure above. Emergence of ‘#nationalguard’ on Jan 12 likely corresponds with the 2021 capitol protests (as does ‘#capitolriot’ and ‘#capitolbuilding’ on Jan 16), and emergence of ‘databreach’ on Jan 13 is likely linked to the hacking of Parler’s data. As time goes on, we see that the monitor begins to focus on politicians, detecting various forms of ‘#biden’ (‘#joebiden’, ‘#bidenharris’ ‘#bideninauguration’, ‘#bidenharrisinauguration’), ‘#trump’ (‘#donaldtrump’), and ‘#kamalaharris’. After inauguration on Jan 20th, all #inauguration discussions naturally lost traction (see Figure 10 for declining frequencies after the fact).

In general, we find that using our dynamic keyword selection method to generate keyword recommendations and graphics on our data visualization platform made for a more informative and thorough data collection process. For more details on case study findings, experiment design, and the data visualization platform, see our recent arxiv submission.

Next Steps

As an immediate next step, we plan to incorporate predictive time series modeling into our algorithm as another datapoint for keyword selection. Further, in the next few months, we aim to publicly release code for the data visualization platform so others can try this method out on their own twitter data. Thanks for reading, and stay tuned for more work!