Time to check your candidate’s “chat-scan”

When Iowans gather in their caucuses tonight, how important will the actual campaign issues be in determining the winners? And which issues in particular?

Here at the Laboratory for Social Machines, part of the MIT Media Lab, we have an ongoing project called The Electome that tracks what we call the horse race of ideas. The Electome’s algorithms (along with a few of us humans) analyze data from Twitter and from a set of media sources to gauge which issues are most likely to influence the Presidential campaign. [For more on how The Electome works, please see the FAQs at the end of this post.]

When I last wrote about this on Medium in December, national security/foreign policy (read: terrorism) and immigration dominated the conversation on social media and in the news. Those topics are still top of mind for many on Twitter. But in the candidate debates and forums since, as well as on social media, other issues like gun safety, health care, and particularly the economy have bubbled back up as well. Here are election-related issues that the twitterati talked about most this January: this visualization by my lab colleague Raphael Schaad shows each topic’s share of the total conversation about the election over the course of the month.

Election-related topic share of total election-related conversation, by day, Jan 2016

The relative shares change according to what’s happening in the campaign, sometimes significantly. For example, note the spike in conversations about the economy and health care just after the Democratic candidates’ debate on January 17th, which featured sharp exchanges on those issues.

From our vantage point in the grandstand, we decided to look at the horse race of ideas through a new pair of binoculars, trying to see which of these issues are most closely associated with each of the candidates individually and therefore most likely to affect the “real” horse race — the one the media is following.

A natural way to measure the most overt form of association is by counting tweets that mention both a candidate and an issue — the computer scientists with whom I work call that co-occurrence. So a tweet that refers to Donald Trump and the Mexican wall associates Trump with the issue of immigration; a tweet that refers to Bernie Sanders and income inequality associates him with the economy. You get the idea. We can’t yet assess stance — what position a tweet takes on a given issue — with a sufficient degree of confidence, so for now The Electome just identifies the subject matter. But that does indicate which issues are driving the overall conversation as well as conversations about specific candidates.

The Electome, supported in part by The Knight Foundation, has been capturing all election-related tweets for exactly a year. In that time, we identified about 3.5 million people who have have tweeted about the election, generating about 70 million tweets.

Based on that data, we first identified the six top election-related issues on Twitter in January, which are highlighted in the chart below.

We then mapped the degree of co-occurrence between the leading candidates and those six issues to create a distinctive association profile for each candidate — let’s call it a chat-scan. Here’s what the January chat-scans (through January 29th, so including reaction to the last GOP debate on the 28th) look like for the leading candidates going into the Iowa caucuses.

You can study the chat-scans for yourself, of course, but here are some things I noticed:

  • Despite his business background, Trump is not closely associated with economic issues.
  • Trump is more highly associated with race issues than his rivals — not a surprise given the popular reaction to some of his comments.
  • Even though foreign policy/national security is by far the most tweeted-about issue, the closest association for Cruz and Rubio is with the issue of immigration.

Again, no surprise there: immigration was a major focus for both candidates in January, culminating in their sharp exchanges during the Trump-less debate on the 28th.

By contrast, neither Sanders nor Clinton cross-indexes highly with the issue of immigration.

  • Clinton’s association is fairly evenly distributed among national security/foreign policy, health care, and the economy.
  • Sanders is more highly associated with conversations about the economy and health care, in keeping with his populist appeal. He is considerably less tied to national security and foreign policy than Clinton is.

That said, for now the scans differ largely along party lines, reflecting the issues that are dominating the respective primary races. Trump and Sanders are almost a perfect inverse of one another, for example.

It will be interesting to see how the candidates’ chat-scans change over the course of the campaign, especially as the field narrows.

The Electome’s unique access to all election-related tweets also enables us to track overlapping interests among groups of tweeters — hidden as opposed to overt connections. The tool can identify how many people who talk about Candidate A also talk separately about topics X, Y, or Z — and that can be any topic: genetically modified food, online dating, Taylor Swift, the Super Bowl, Game of Thrones. It’s a different and potentially revealing way to infer demographic information about the people associated with each candidate. We call this hidden overlap latent association, as opposed to what you might call blatant association. We’ll have more to say about latent association and what it tells us about the people talking about the candidates in the weeks to come

Finally, the Electome’s data on association enables us to examine an interesting question: how large a role do issues really play in the conversations about the candidates? Turns out it depends on the candidate.

We compared the percentage of tweets that refer to specific candidates and also refer to specific issues (co-occurrence) to the percentage that involve what we label “other” — the candidates’ personalities, say, or their poll numbers, or the political process. Are the conversations around some candidates relatively more issue-driven, and others less? Based on those 70 million election-related tweets over the past year, the answer is yes: some candidates are more closely tied to the horse race of ideas than others.

As the chart shows, the conversation about Clinton has been the most focused on specific issues. The conversations that touch least on issues and ideas: the ones about Donald Trump.

We’ll have our Electome binoculars trained on the horse race of ideas to see what happens now that actual voting is about to begin.

aheyward@media.mit.edu/@andrewheyward

NOTE: Raphael Schaad/@raphaelschaad, Soroush Vosoughi and Prashanth Vijayaraghavan, researchers at the Laboratory for Social Machines, developed the analytics and visualizations for this post

FAQs

How does The Electome identify tweets that refer to the election?

Twitter has given the Laboratory for Social Machines access to its entire database, which is growing by an estimated 500 million tweets per day. We have been tracking election-related tweets since February of 2015. A computer program uses language analysis to identify the tweets that refer to the American election — approximately 250,000 a day at this point. Then the algorithm classifies each tweet by issue and/or candidate. Data analysts check random selections of tweets to confirm their relevance, and the computer program uses their assessments to keep improving its performance over time.

Do you count only original tweets, or are re-tweets part of the mix?

The data includes re-tweets.

Is the program counting only election-related tweets originating in the United States?

Only 1% of tweets are geo-tagged by Twitter, so in order to capture tweets from the United States, our program filters for tweets coming from U.S. time zones in English. That does mean that English-language tweets from Canada relevant to the U.S. election are included.

What about tweets in other languages, such as Spanish?

For now, we are only analyzing tweets in English.

How does The Electome determine “share” of conversation on Twitter?

The share of conversation is the number of tweets about a specific topic or candidate divided by the total number of election-related tweets within a given time period. So, for example, if there were 100,000 election-related tweets in a given time span and 20,000 were about Donald Trump, his “share” would be 0.2 out of 1.0, or 20%.

Does The Electome “know” what position a given tweet takes on an issue?

No. We’re not yet capable of measuring stance, namely where the tweeter stands on an issue — just engagement with it.

How does the population using Twitter compare to the U.S. population overall?

The conversation among Twitter users is not representative of the public at large. Twenty percent of Americans use the platform, and their demographic makeup and levels of political interest differ from the public overall. The analysis should be viewed as a readout on the views of the platform’s users.

What about share of media coverage — how is that determined?

The Electome monitors the websites of a collection of influential media sources, currently 14 in number. That number will increase as we build out the system. The computer program identifies the articles that are about the election — roughly 200 per day out of 2,000. Then it classifies each article according to candidates and topics.

You used the plural there: how does The Electome count stories that refer to more than one topic, or more than one candidate?

The “share” is divided equally among the topics or candidates. If an article mentions five candidates, each of them gets credit for a ⅕ or 20% share. By the way, the same goes for tweets that mention more than one candidate. (Generally, tweets are too short to mention more than one topic.)

Does The Electome have a way to account for headlines, photos, videos, word count, prominence of placement, and other factors that might give some stories a greater “presence” than others?

Not yet — the team is working on ways to account for most of those factors. Videos and photos are a separate data challenge.