Natural Language Processing of Conversations in Python with ConvoKit
Using conversational analysis to explore intergroup linguistic coordination on Reddit
Conversations are critical to our political and civic institutions, communities, and families. However, limited Natural Language Processing (NLP) has been devoted to studying conversations, focusing instead on individual pieces of text.
There are a few unique challenges related to the NLP of conversations:
- conversations unfold over time,
- have multiple speakers and utterances (text units),
- and the order of speakers and utterances matters.
Dealing with these complexities on your own can quickly become a headache. Luckily, you don’t have to! A group of researchers at Cornell University have created a Python module for conversational analysis called ConvoKit.
ConvoKit is a tool for transforming raw conversational data into a format that is easier to manipulate, analyze, and share with others. It has multiple linguistic analyses already implemented, such as context-independent linguistic coordination and politeness strategies. Plus, it has a selection of conversational corpora (collections of texts), like the 900K subreddits and Wikipedia editors’ talk pages, that you can download and analyze.
This tutorial covers the basics of ConvoKit through an example project exploring conversational group dynamics in a polarized context of r/Conservative and r/democrats — two of the largest partisan subreddits.
Research Question
Current research suggests that affective polarization, i.e., negative feelings, dislike, and distrust toward the other party or group, has been growing in the US and other advanced democracies [1]. In the US, Republican and Democrat identities are becoming increasingly more important in everyday life, affecting not only voting decisions but also friendship, dating, and neighborhood choices [1,2]. For instance, in 2019, more than half of Americans said they would not date someone from the other party [3].
Mounting empirical evidence suggests that social media platforms like Twitter, Facebook, and Reddit play an important role in increasing polarization by allowing for partisan sorting, amplifying divisive content, and incentivizing intergroup conflict [4]. Reddit provides a place for partisans to participate in discussions with members of their ingroup and the outgroup by contributing to different subreddits. However, do partisans use the same language when communicating across party lines?
Here, I’m interested in investigating whether partisans unconsciously adapt to the linguistic style of their interlocutors more when the interlocutor is an ingroup member compared to an outgroup member. The degree of imitating the linguistic style, also called linguistic coordination or synchrony, could be considered an implicit measure of affect towards the interlocutor and thus is likely affected by polarization.
To test this, I will use speakers who contributed to both of the largest partisan subreddits — r/Conservative (Republican, conservative) and r/democrats (Democrat, liberal) — and investigate how their overall linguistic coordination differed when they spoke with ingroup or outgroup members.
Theory Behind ConvoKit
ConvoKit offers a framework for conversational analysis with two fundamental concepts: a corpus and a transformation. A corpus is a collection of conversations, and you can do things to the corpus using transformations.
Every corpus has three main elements: speakers, conversations, and utterances. Speakers are participants in conversations, and the things they say are called utterances. You can build a corpus from a collection of utterances associated with a speaker, conversation, and timestamp. Additionally, you can add speaker-level, conversation-level, or utterance-level metadata to keep track of variables you care about.
Transformers are functions that take in a corpus and return the same corpus with some changes. For instance, we will use a linguistic coordination transformer later in this tutorial to calculate how much speakers coordinate to their party members and people from the opposite party.
Practice
With the theory out of the way, let’s move on to the analysis! All of the code below is also on my GitHub. To start off, we need to install the ConvoKit module: pip install convokit
.
We are going to use two Reddit corpora available through ConvoKit — r/Conservative and r/democrats. The data was built from Pushshift.io Reddit Corpus and includes posts and comments from subreddit creation to October 2018. These are pretty big subreddits so it will take a while to download the data.
from convokit import Corpus, downloadcons = Corpus(filename=download("subreddit-Conservative"))dems = Corpus(filename=download("subreddit-democrats"))
We can see exactly how big the subreddits are by running cons.print_summary_stats()
and dems.print_summary_stats()
.
Turns out that r/Conservative has over 140K speakers across 300K conversations and over 3M utterances, while r/democrats has only 37K speakers in 84K conversations and 370K utterances. So r/Conservative is 3.5 to 10 times bigger than r/democrats, depending on how you want to define size.
Utterances
Let’s see what a random utterance from the conservative corpus looks like: utt = cons.random_utterance()
.
We can check the text of any utterance by calling utt.text
. In this case, the utterance reads: “Abolish the Department of Education.” Intriguing suggestion.
Every utterance, just as every speaker and conversation, has a unique id attached to it. utt.id
gives us the id “dk9awc5”. Every utterance is also a part of a conversation, which we can check by utt.conversation_id
. The person who said the utterance is ( utt.speaker.id
) “patri0t556”, and it was a reply to utterance with id “6ng0g4”. We can retrieve that utterance by running cons.get_utterance('6ng0g4')
.
Every utterance in every corpus must have six primary attributes: text, id, conversation_id, speaker, timestamp, reply_to.
In addition to the primary attributes, utterances can have metadata encoded as a dictionary you can access through a meta attribute: utt.meta
. For instance, utt.meta['score']
gives the score of the utterance calculated as the difference between upvotes and downvotes it received.
Conversations
Just like with utterances, we can get a random conversation with convo = cons.random_conversation().
We can check what utterances are part of the conversation by running convo.get_utterances_ids()
. We can also print the structure of the conversation:
Conversations also have their own metadata. For instance, to check the title of the conversation run convo.meta['title']
.
Data Cleaning and Preprocessing
Now we need to clean the data a bit. I found a few conversations that included utterances that were later deleted, and these inconsistencies affect the downstream analysis. I will filter each corpus to only have conversations for which none of the utterances are corrupted. For that, I will check whether each utterance and the utterance it was a reply to still exist in the corpus.
If we print the summary stats again, we see that there were only 26 corrupted conversations in the conservative corpus and 2 in the democrats corpus, overall less than .01%.
We want to find the speakers that have contributed to both subreddits.
both = [spkr for spkr in dems_filtered.speakers if spkr in cons_filtered.speakers]
Turns out there are 9134 such speakers in total. I will categorize them as conservative (or democrat) if they were a part of more conversations in the conservative (or democrat) corpus. Speakers who had equal numbers of conversations in the two subreddits are excluded. I’ll store this information in the speaker-level metadata for easy access.
Linguistic Coordination
To measure linguistic coordination, I will use a transformer implemented in ConoKit based on the paper that first introduced the metric [5]. You can read more about the theory behind it and its implementation.
Simply put, the linguistic coordination of a given speaker to a given target reflects how much a speaker increases their use of function words that the target used compared to the speaker’s baseline. Linguistic coordination ranges from -1 to 1, with higher scores indicating more synchrony and negative scores indicating the opposite of synchrony. For this analysis, I will average linguistic coordination over targets to obtain one score per speaker.
Linguistic Coordination Analysis
Because it is implemented as a transformer in ConvoKit, we need to fit instances to each corpus and transform them individually.
To calculate coordination we need to select speakers and targets of coordination appropriately. For this we need to have a few selector functions:
everyone = lambda spkr: Trueconservatives = lambda spkr: spkr.id in cons_maindemocrats = lambda spkr: spkr.id in dems_main
everyone
will return every speaker in a corpus, while conservatives
and democrats
will return only speakers who are in both corpora but whose main corpus (e.i., partisan identity) is either r/Conservative or r/democrats.
We can analyze the results by subreddit for more clarity.
This produces lists of scores of how much each group coordinates to ingroup and to outgroup members (four in total). We can run a one-sided related-samples t-test to check if the same people coordinate significantly more when talking with ingroups as compared to outgroups.
Doing this for conservatives, we get a p-value of around 0.005, which means that indeed conservatives coordinate significantly more to fellow conservatives than to democrats. Similarly, we find that democrats coordinate significantly more to fellow democrats with a p-value of around 0.0002.
Finally, we can visually compare the histograms of coordination with ingroup versus outgroup for both subreddits.
We can see that there is a reasonable difference between intragroup and intergroup communication in both plots, and the effect seems larger for democrats.
I hope you enjoyed this tutorial and are excited about analyzing some conversations yourself! All the code for this analysis is on my GitHub, so feel free to check it out. If you want to create a ConvoKit corpus from your own data, I have a GitHub template specifically for that!
Please let me know if you have any questions or suggestions. You can leave a comment here or tweet at me @YaraKyrychenko anytime!
References
[1] Westwood, S. J., Iyengar, S., Walgrave, S., Leonisio, R., Miller, L., & Strijbis, O. (2018). The tie that divides: Cross-national evidence of the primacy of partyism. European Journal of Political Research, 57, 333–354.
[2] Iyengar, S., Lelkes, Y., Levendusky, M., Malhotra, N., & Westwood, S. J. (2019). The Origins and Consequences of Affective Polarization in the United States. Annual Review of Political Science, 22, 129–146.
[3] Ballard, J. Fewer than half of Americans would date across party lines. YouGov. YouGov, 24 Oct. 2019.
[4] Van Bavel, J. J., Rathje, S., Harris, E., Robertson, C., & Sternisko, A. (2021). How social media shapes polarization. Trends in Cognitive Sciences, 25(11), 913–916.
[5] Danescu-Niculescu-Mizil, C., Lee, L., Pang, B., & Kleinberg, J. (2012, April). Echoes of power: Language effects and power differences in social interaction. In Proceedings of the 21st international conference on World Wide Web (pp. 699–708).