Tutorial: Plotting Lexical Dispersion (Conspiracy Lies from the Left-of-Center)

Published in

The Political Ear

5 min readApr 22, 2018

How often do people talk about specific topics — say, conspiracy — on Twitter? This tutorial is going to help you visualize an answer, and look at the trend over time. We’ll combine some basic language processing and Seaborn to analyze social media data. Unlike the classic Lexical Dispersion plot, we’re going to create a better way to visualize density of words.

The python3-updated version of the O’Reilly NLTK book has a great chapter on plotting dispersion in a corpus or across corpora. But plotting a dataframe of social media posts, organized by date (and not simply place in wordcount), is something radically different.

The set-up

Continuing from my own work on conspiracy, we’re going to focus on a key question: how often does Donald Trump tweet about conspiracy? Here are the general steps to creating a lexical dispersion plot:

1. Get your data, pull it into Pandas (I’ll assume you know how to do this. If you don’t, check out this tutorial (although there are a million ways to do this))

2. Clean the text and feature engineer a few different columns for plotting. We’ll do a boolean, a list of conspiracy topics, and a single conspiracy word.

3. Plot that text against their dates in a Seaborn strip plot (or what us NLP nerds call a lexical dispersion plot)

If all goes according to plan, we’ll end up with something that looks like this (with a little more flare than the intro image):

A Lexical Dispersion Plot that shows the density of Trump Tweets about Conspiracy Keywords

Conspiracy Talk: True or False?

In one smooth function, we can both clean up the text for searching AND search for responses to our conspiracy words.

We start by creating a list of words that would indicate conspiracy is afoot. I’m choosing indicators, although this could be done with hashtags as well! Then we’ll feature engineer a Boolean that tells us if a tweet contains ANY of the conspiracy indicators in our list. This is done with a simple .contains() method paired with some regex voodoo to help us search through the series.

Hmm… not much tweeting here about conspiracy…

This returns a dataframe with this new Boolean feature that tells us if it’s True that a tweet contains a conspiracy indicator. Here we can see the value_counts()

Conspiracies per tweet

Next we’ll create a feature that lists any conspiracy words that occur in a text instance. We can clean our text up AND create our column in one clear function

The reason we want to clean is because I’m doing indicators here. If we were doing hashtags, we’d want to get rid of the punctuation stripping, and use a startswith(“#”) method. Similarly, we drop case on everything with the .lower() method so that strange capitalization habits aren’t messing us up. Doing another value_counts(), we can see some interesting things:

This method is hard, because there are so many various combinations that it’s hard to plot in a way that makes sense. So the third method…

Conspiracy Objects

Finally, we can use a function to show us only an approximation of the direct object of a tweet sentence. A direct object is the noun(phrase) that receives the action of the verb phrase that separates the subject of the sentence from the predicate. In layman terms, this is the thing that the conspiracy is “against” or “using.” In the tweet “The #FakeNews is doing the bidding of the deep state,” our direct object is deep state. The benefit of this function is that it also catches it if the conspiracy is the subject and there is no conspiracy object as the D.O. “#FakeNews is ruining America,” and catches it if there’s ONLY a conspiracy D.O., like “Liberal lamestream media is #FakeNews.”

It’s much more manageable for plotting.

Now we can create our Lexical Dispersion Plot!

Here, I’ll only give you one example plot, and you can play with the other data output approaches.

First, we want to convert our created_at object to datetime using pandas’ built in parser.

If you get an error that says that neither of the objects is numeric, you’ve skipped this step.

Then we set up our strip plot for export. You can have it display directly in your notebook using plt.show() (and plt.tight_layout()), but I prefer to look at them separate from the notebook. This way, I can place the df and the viz side-by-side and eyeball whether there are differences between my data and visualization.

And we get…

But why would we do this?!

This kind of visualization is great for academic pattern recognition, but probably not much else. Business intelligence would be better served using a timeseries.

An example: Yale Psychiatry professor Brandy Lee made the news claiming that Trump’s reliance on conspiracy theory was a sign that he was “unraveling.”

Now, I’m not a Trump-supporter. Far from it. I think that he’s a bigoted sexual harasser with a lot of insecurity that is transformed into over-performed machismo. BUT, his populist promotion of conspiracy has declined since the primaries — we see it right here. Lee’s casual (ab)use of mental health evaluation is thus used to fan the flame of a form of stigma and shaming that does very little for democratic engagement and disproportionately harms political professionals who struggle with their mental health. We might add that Lee also continues a long history of people from the Left and the Right who identify possible State Crimes Against Democracy as schizophrenic, paranoid, obsessive-compulsive, etc. — for better or worse, revolution or compliance.

In sum, this plot shows that Brandy Lee’s remarks are inappropriate. They play into the very stigma discourses that she as a psychiatrist should be working to neutralize. Confronting the harmful assumptions of a psychiatrist who confuses media saturation with frequency seems like a good enough reason to make a Lexical Dispersion plot to me.