2021 Inauguration on Twitter
Introduction
We recently tested our dynamic data collection method using word embeddings and corpus frequencies to track Twitter discussions about the 2021 presidential inauguration. In this post, we cover the visualizations and metrics used throughout our mini study, a browser-based UI used for human-assisted data collection, and some qualitative findings.
Data Visualization Platform
While our dynamic method can theoretically be fully automated, we are currently testing a human-in-the-loop process for keyword selection.
To that end, we are building a data visualization platform which processes new twitter data, runs machine learning models on this data, and dynamically updates frontend browser-based UI to reflect changes in live social media discussions and recommend new keywords for data collection. The figure below shows the full cycle of the backend and frontend of our platform.
Below are some key UI elements on this platform.
We used a prototype of this platform to make human-assisted keyword selections throughout the 2021 inauguration case study.
Experiment Details
We started our dynamic twitter monitor on 2021 on January 11, 2021 (at 10:40:44 Pacific Standard Time), with a single keyword, “inauguration.” From then on until January 22, 2021, we used our dynamic keyword selection method to choose new keywords for data collection using Twitter’s streaming API.
The figure below shows the hourly volume of inauguration-related tweets detected by our dynamic monitor throughout the experiment.
Each day of the experiment, we used a prototype of our data visualization platform to visualize important metrics (corpus frequencies, GloVe embeddings, nearest neighbors) and select new keywords for data collection. Our browser-based UI allowed our research group to streamline a human-assisted keyword selection process.
Observations
From the word clouds below, we see that our dynamic monitor covers a wider range of topics and has a more uniform distribution in terms of the frequency of different keywords when compared to a static monitor (which uses only “inauguration” to pull data).
Our dynamic method revealed trending terms which cropped up from Jan 11 to Jan 22, like the high-frequency keywords shown below.
Further, our data visualization platform allowed us to clearly trace the evolution of our dynamic keyword set throughout the experiment:
Let’s provide some real-world context to the keywords in the figure above. Emergence of ‘#nationalguard’ on Jan 12 likely corresponds with the 2021 capitol protests (as does ‘#capitolriot’ and ‘#capitolbuilding’ on Jan 16), and emergence of ‘databreach’ on Jan 13 is likely linked to the hacking of Parler’s data. As time goes on, we see that the monitor begins to focus on politicians, detecting various forms of ‘#biden’ (‘#joebiden’, ‘#bidenharris’ ‘#bideninauguration’, ‘#bidenharrisinauguration’), ‘#trump’ (‘#donaldtrump’), and ‘#kamalaharris’. After inauguration on Jan 20th, all #inauguration discussions naturally lost traction (see Figure 10 for declining frequencies after the fact).
In general, we find that using our dynamic keyword selection method to generate keyword recommendations and graphics on our data visualization platform made for a more informative and thorough data collection process. For more details on case study findings, experiment design, and the data visualization platform, see our recent arxiv submission.
Next Steps
As an immediate next step, we plan to incorporate predictive time series modeling into our algorithm as another datapoint for keyword selection. Further, in the next few months, we aim to publicly release code for the data visualization platform so others can try this method out on their own twitter data. Thanks for reading, and stay tuned for more work!