Natural Language Processing of Conversations in Python with ConvoKit

Using conversational analysis to explore intergroup linguistic coordination on Reddit

Published in

NYU Data Science Review

8 min readNov 29, 2022

Do partisans mimic the linguistic style of fellow party members more than of the opposing party members? A histogram of linguistic coordination on Reddit of democrats to democrats and democrats to conservatives suggests so! Plot by the author

Conversations are critical to our political and civic institutions, communities, and families. However, limited Natural Language Processing (NLP) has been devoted to studying conversations, focusing instead on individual pieces of text.

There are a few unique challenges related to the NLP of conversations:

conversations unfold over time,
have multiple speakers and utterances (text units),
and the order of speakers and utterances matters.

Dealing with these complexities on your own can quickly become a headache. Luckily, you don’t have to! A group of researchers at Cornell University have created a Python module for conversational analysis called ConvoKit.

ConvoKit is a tool for transforming raw conversational data into a format that is easier to manipulate, analyze, and share with others. It has multiple linguistic analyses already implemented, such as context-independent linguistic coordination and politeness strategies. Plus, it has a selection of conversational corpora (collections of texts), like the 900K subreddits and Wikipedia editors’ talk pages, that you can download and analyze.

This tutorial covers the basics of ConvoKit through an example project exploring conversational group dynamics in a polarized context of r/Conservative and r/democrats — two of the largest partisan subreddits.

Research Question

Current research suggests that affective polarization, i.e., negative feelings, dislike, and distrust toward the other party or group, has been growing in the US and other advanced democracies [1]. In the US, Republican and Democrat identities are becoming increasingly more important in everyday life, affecting not only voting decisions but also friendship, dating, and neighborhood choices [1,2]. For instance, in 2019, more than half of Americans said they would not date someone from the other party [3].

Mounting empirical evidence suggests that social media platforms like Twitter, Facebook, and Reddit play an important role in increasing polarization by allowing for partisan sorting, amplifying divisive content, and incentivizing intergroup conflict [4]. Reddit provides a place for partisans to participate in discussions with members of their ingroup and the outgroup by contributing to different subreddits. However, do partisans use the same language when communicating across party lines?

Here, I’m interested in investigating whether partisans unconsciously adapt to the linguistic style of their interlocutors more when the interlocutor is an ingroup member compared to an outgroup member. The degree of imitating the linguistic style, also called linguistic coordination or synchrony, could be considered an implicit measure of affect towards the interlocutor and thus is likely affected by polarization.

To test this, I will use speakers who contributed to both of the largest partisan subreddits — r/Conservative (Republican, conservative) and r/democrats (Democrat, liberal) — and investigate how their overall linguistic coordination differed when they spoke with ingroup or outgroup members.

Theory Behind ConvoKit

ConvoKit offers a framework for conversational analysis with two fundamental concepts: a corpus and a transformation. A corpus is a collection of conversations, and you can do things to the corpus using transformations.

Every corpus has three main elements: speakers, conversations, and utterances. Speakers are participants in conversations, and the things they say are called utterances. You can build a corpus from a collection of utterances associated with a speaker, conversation, and timestamp. Additionally, you can add speaker-level, conversation-level, or utterance-level metadata to keep track of variables you care about.

Transformers are functions that take in a corpus and return the same corpus with some changes. For instance, we will use a linguistic coordination transformer later in this tutorial to calculate how much speakers coordinate to their party members and people from the opposite party.

Practice

With the theory out of the way, let’s move on to the analysis! All of the code below is also on my GitHub. To start off, we need to install the ConvoKit module: pip install convokit.

We are going to use two Reddit corpora available through ConvoKit — r/Conservative and r/democrats. The data was built from Pushshift.io Reddit Corpus and includes posts and comments from subreddit creation to October 2018. These are pretty big subreddits so it will take a while to download the data.

from convokit import Corpus, downloadcons = Corpus(filename=download("subreddit-Conservative"))dems = Corpus(filename=download("subreddit-democrats"))

We can see exactly how big the subreddits are by running cons.print_summary_stats() and dems.print_summary_stats().

Turns out that r/Conservative has over 140K speakers across 300K conversations and over 3M utterances, while r/democrats has only 37K speakers in 84K conversations and 370K utterances. So r/Conservative is 3.5 to 10 times bigger than r/democrats, depending on how you want to define size.

Utterances

Let’s see what a random utterance from the conservative corpus looks like: utt = cons.random_utterance() .

We can check the text of any utterance by calling utt.text . In this case, the utterance reads: “Abolish the Department of Education.” Intriguing suggestion.

Every utterance, just as every speaker and conversation, has a unique id attached to it. utt.id gives us the id “dk9awc5”. Every utterance is also a part of a conversation, which we can check by utt.conversation_id . The person who said the utterance is ( utt.speaker.id ) “patri0t556”, and it was a reply to utterance with id “6ng0g4”. We can retrieve that utterance by running cons.get_utterance('6ng0g4') .

Every utterance in every corpus must have six primary attributes: text, id, conversation_id, speaker, timestamp, reply_to.

In addition to the primary attributes, utterances can have metadata encoded as a dictionary you can access through a meta attribute: utt.meta. For instance, utt.meta['score'] gives the score of the utterance calculated as the difference between upvotes and downvotes it received.

Conversations

Just like with utterances, we can get a random conversation with convo = cons.random_conversation(). We can check what utterances are part of the conversation by running convo.get_utterances_ids() . We can also print the structure of the conversation:

Conversational structure: each line is the id of a speaker who said the utterance. As we can see, this tree is not linear, and some speakers were later deleted. Screenshot by the author

Conversations also have their own metadata. For instance, to check the title of the conversation run convo.meta['title'] .

Data Cleaning and Preprocessing

Now we need to clean the data a bit. I found a few conversations that included utterances that were later deleted, and these inconsistencies affect the downstream analysis. I will filter each corpus to only have conversations for which none of the utterances are corrupted. For that, I will check whether each utterance and the utterance it was a reply to still exist in the corpus.

If we print the summary stats again, we see that there were only 26 corrupted conversations in the conservative corpus and 2 in the democrats corpus, overall less than .01%.

We want to find the speakers that have contributed to both subreddits.

both = [spkr for spkr in dems_filtered.speakers if spkr in cons_filtered.speakers]

Turns out there are 9134 such speakers in total. I will categorize them as conservative (or democrat) if they were a part of more conversations in the conservative (or democrat) corpus. Speakers who had equal numbers of conversations in the two subreddits are excluded. I’ll store this information in the speaker-level metadata for easy access.

Linguistic Coordination

To measure linguistic coordination, I will use a transformer implemented in ConoKit based on the paper that first introduced the metric [5]. You can read more about the theory behind it and its implementation.

Simply put, the linguistic coordination of a given speaker to a given target reflects how much a speaker increases their use of function words that the target used compared to the speaker’s baseline. Linguistic coordination ranges from -1 to 1, with higher scores indicating more synchrony and negative scores indicating the opposite of synchrony. For this analysis, I will average linguistic coordination over targets to obtain one score per speaker.

Linguistic Coordination Analysis

Because it is implemented as a transformer in ConvoKit, we need to fit instances to each corpus and transform them individually.

To calculate coordination we need to select speakers and targets of coordination appropriately. For this we need to have a few selector functions:

everyone = lambda spkr: Trueconservatives  = lambda spkr: spkr.id in cons_maindemocrats = lambda spkr: spkr.id in dems_main

everyone will return every speaker in a corpus, while conservatives and democrats will return only speakers who are in both corpora but whose main corpus (e.i., partisan identity) is either r/Conservative or r/democrats.

We can analyze the results by subreddit for more clarity.

This produces lists of scores of how much each group coordinates to ingroup and to outgroup members (four in total). We can run a one-sided related-samples t-test to check if the same people coordinate significantly more when talking with ingroups as compared to outgroups.

Doing this for conservatives, we get a p-value of around 0.005, which means that indeed conservatives coordinate significantly more to fellow conservatives than to democrats. Similarly, we find that democrats coordinate significantly more to fellow democrats with a p-value of around 0.0002.

Finally, we can visually compare the histograms of coordination with ingroup versus outgroup for both subreddits.

Histograms of linguistic coordination of conservatives to conservatives and conservatives to democrats. Plot by the author

Histograms of linguistic coordination of democrats to democrats and democrats to conservatives. Plot by the author

We can see that there is a reasonable difference between intragroup and intergroup communication in both plots, and the effect seems larger for democrats.

I hope you enjoyed this tutorial and are excited about analyzing some conversations yourself! All the code for this analysis is on my GitHub, so feel free to check it out. If you want to create a ConvoKit corpus from your own data, I have a GitHub template specifically for that!

Please let me know if you have any questions or suggestions. You can leave a comment here or tweet at me @YaraKyrychenko anytime!

References

[1] Westwood, S. J., Iyengar, S., Walgrave, S., Leonisio, R., Miller, L., & Strijbis, O. (2018). The tie that divides: Cross-national evidence of the primacy of partyism. European Journal of Political Research, 57, 333–354.

[2] Iyengar, S., Lelkes, Y., Levendusky, M., Malhotra, N., & Westwood, S. J. (2019). The Origins and Consequences of Affective Polarization in the United States. Annual Review of Political Science, 22, 129–146.

[3] Ballard, J. Fewer than half of Americans would date across party lines. YouGov. YouGov, 24 Oct. 2019.

[4] Van Bavel, J. J., Rathje, S., Harris, E., Robertson, C., & Sternisko, A. (2021). How social media shapes polarization. Trends in Cognitive Sciences, 25(11), 913–916.

[5] Danescu-Niculescu-Mizil, C., Lee, L., Pang, B., & Kleinberg, J. (2012, April). Echoes of power: Language effects and power differences in social interaction. In Proceedings of the 21st international conference on World Wide Web (pp. 699–708).