Dynamics of Political Polarization: Insights from Using Machine Learning and Natural Language Processing with Twitter Data

Published in

Trustworthy Social Media

5 min readNov 4, 2022

The American public increasingly finds itself bitterly divided over political differences. Survey indicators, partisan media, and the public’s voting patterns inform this sense of division in our politics. That said, we use applications of Machine Learning and Natural Language Processing (NLP) methods in a novel way to paint a more nuanced picture of divisions in American political opinions.

It turns out that even very simple NLP methods that rely on simple word frequencies in politicians’ tweets can be extremely predictive when it comes to predicting party affiliation, getting over 80% accuracy without any special tuning. These simple models are very robust: a model trained on the tweets from the House of the Representatives can be equally predictive when tested on the tweets from the US Senators. Furthermore, even though politicians of both major political parties have become increasingly partisan and homogeneous in terms of their ideological commitments, these simple models can produce very credible rankings of the politicians along the left-right spectrum. These rankings very closely replicate similar rankings that have been based on politicians voting records. If these results withstand more rigorous scrutiny, NLP methods on politicians’ tweets could become a very simple measuring tool for ascertaining the degree of partisanship for any politician, even those with very meager or nonexistent voting records. In these settings, even simple assumptions on the structure of the topics yields large gains in understanding political polarization amongst political elites.

Imposing some additional latent structure on this bag-of-words approach can allow for interpretability on top of uncovering the original rankings. To this end, we use the Principal Components of the outputs from a Joint-Sentiment Topic model to produce scores of polarization. These scores describe relative positions of federal legislators, state legislators, members of the executive branch, local politicians, and journalists in a shared issue space over time. This allows us to analyze polarization in online speech across a variety of domains and over time. Based on posts from their public Twitter accounts from 2018 to 2022, we first uncover topics that politicians and journalists are discussing during this period. At the same time, we estimate whether sentiments underlying this discussion are positive, negative, or neutral. We then take these topic and sentiment scores to uncover online partisan positioning (using a more classical dimension reduction technique).

This method offers several advantages over existing methods in political science for measuring ideological positioning, such as DW-Nominate (based on how legislators vote on bills) and Bonica Scaling (based on campaign donations). First, it does not rely on binary data, so it is able to better identify moderates and extremists in the party. Second, the positions estimated by our scores are derived from continuous data, based on salient topics, allowing for coherent analysis of the variation contributing to each dimension. Third, it allows for comparisons for all Twitter or social media users in the same derived space.

Figure 1: Discussion Space on Political Twitter

With our method, we show in Figure 1 that members of the public can be directly compared with their politicians, enabling new ways to measure political representation. In Figure 1, we identify two dimensions to the data we find on political Twitter: a partisan dimension on the Y-Axis and a Journalistic Dimension on the X-Axis. The further up you go on the Y-Axis, the more liberal we find a speaker’s tweets on Twitter. The further right on the X-axis, the more like our method considers you to be a journalist. The topics contributing the most to political polarization in this period are tweets conveying a negative view of Trump Environmental Regulations for Democrats, and tweets conveying negative sentiment towards Mask Policies for Republicans. Unsurprisingly, we find that topics related to highlighting another tweet or article contribute the most to being identified as a journalist. Finally, the scores can be estimated tractably over arbitrarily long time periods, and so policy stances can be compared over time, in the same space.

We note journalists generally tend to be less ideologically cohesive than politicians in how they communicate on social media presence, according to our method. This finding could be an artifact of how journalists were placed in their own cluster, but it might also have implications for how Journalists exist in their own information ecosystem, one separate from political elites.

Some of our other preliminary results are surprising, especially given the heated tone of political news media.

Figure 2: Dynamics in Polarization of Senate Twitter Accounts

First, as we see in Figures 2 and 3, there are relatively stable levels of polarization in elite speech on Twitter during the 2022 to 2020 period. At the start of the Covid pandemic in March 2020, both the House and Senate depolarized, but quickly re-polarized thereafter. We note this period with the vertical line in both figures. We find strong statistical evidence this convergence in political speech was unlikely to be mere random noise. This suggests something akin to a rally-around-the-flag effect in the initial phase of the pandemic.

Figure 3: Dynamics in Polarization of House Twitter Accounts

Second, Figures 4 and 5, showing the discussion space in the California State Legislature suggests that evidence for political division appears stronger at the national level. In the state legislatures, legislators tend to share similar stances on the issues. There are a number of plausible hypotheses that might explain this result. State houses might be less influenced by national media, or state legislators tend to be less attentive to nationally polarizing social issues. Or perhaps these are simply characteristics of state legislative politics in these very blue or very red state legislatures. Clearly more research is needed to determine why we see these differences in political divisions between state and national legislatures.

Figure 4: Dynamics in Polarization of CA State Senate and Assembly Twitter Accounts

We hope these measures will help researchers further study applications to political polarization, political communication strategy, and the dynamics of political messaging.

About the author. Daniel Ebanks is a PhD student in social science and computational social science at Caltech. This research was made possible by an internship from Nvidia. I thank Bojan Tunguz (Nvidia), R. Michael Alvarez (Caltech), Betsy Sinclair (WUSTL), and Joon Park (Caltech) for their collaboration on this project.

Dynamics of Political Polarization: Insights from Using Machine Learning and Natural Language Processing with Twitter Data

Written by Danny Ebanks