A First Glimpse through the Data Window onto the Internet Research Agency’s Twitter Operations

Kate Starbird
16 min readOct 17, 2018

--

by Tom Wilson, Ahmer Arif, Melinda McClure Haughey, Leo Stewart, and Kate Starbird

Today (October 17, 2018), Twitter released two data sets related to online information operations that have targeted U.S. citizens in recent years. In this blog, we describe some preliminary analysis of one of these sets — a dataset containing tweets shared by accounts determined (by Twitter) to be associated with the Internet Research Agency (IRA) in St. Petersburg, Russia. These accounts had all been suspended and their content had been removed from public view. Twitter has now released the full tweet record for these accounts, going back to their first tweets. This dataset, therefore, represents a new window for researchers into the historical activities of these accounts — one that may provide insight into the strategies and goals of the broader information operations of which these activities were part. Here, we offer an initial glimpse through that window, providing a rough idea of what researchers might find as we comb through these (millions of) tweets.

A little context. Our research lab has previously completed several studies that have identified and attempted to explain the role of information operations in online conversations. Initially, some of these discoveries were accidental — i.e. we had been examining other online phenomena (e.g. rumors during crisis events, framing contests in #BlackLivesMatter discourse) and began to see evidence of intentional manipulation that we can now attribute to IRA activities. At a high level, these new data confirm many of our initial observations, for example about IRA agents “sowing division” on “both sides” of the #BlackLivesMatter conversation. These new data also help us see how the IRA activities that we saw in our previous work fit into broader patterns of information operations executed by IRA-affiliated accounts. In this write-up, we repeatedly reflect on how these new data provide new insight into that previous body of work.

One caveat here is that we do not know how comprehensive or representative these accounts are of the larger operations on Twitter and other platforms. For this reason, we will need to be careful to moderate our interpretations of these data while keeping these limitations in mind. In particular, while we can describe what we now know to be specific aspects of these information operations, we cannot draw strong comparisons in overall weight of one aspect of the operations versus another or make strong claims about what we do not see in this data (there may be some aspects of this campaign that are — at least partially — missing from view).

Rough Dimensions of IRA Tweet Dataset

Accounts, Tweets, and Volume Over Time

This IRA Tweet Dataset contains 9,041,308 tweets posted by 3667 accounts determined to have been used as part of the IRA’s operations. This data set includes tweets dating back to 2009. The final tweet in the set was sent on June 21, 2018 — suggesting that the operations continued even after Twitter began to publicly acknowledge and then suspend IRA accounts in 2017. Figure 1 shows the temporal graph (tweets over time) of this activity. The activity of the accounts is initially very low volume — just a few tweets (if any) a day in 2009 to a few dozen a day in 2010 and 2011. There’s a slight surge between mid-2012 and mid-2013, followed by a relatively quiet period, and then a drastic increase in activity in mid-2014, around the time of the downing of Malaysia Airlines Flight 17 (MH17).

Figure 1: Temporal graph of tweets over time for the IRA accounts, per week, for the duration of the dataset

Domestic (Russian) and International (English) Operations

The data contain information about the language of accounts (self-declared by the account holder) and tweets (automatically detected by Twitter based on the tweet text). Neither of these are perfect measures, but they help us get a rough idea of the breakdown by language. The data reveal two large, distinct operations — one targeted at Russian-language audiences and the other at English-language audiences.

The vast majority of IRA accounts had their language set to either English or Russian. Of the 3667 distinct accounts, 2286 were set to English and 1013 were set to Russian. Interestingly, though there are fewer accounts set to Russian language, there are more tweets, overall, in Russian (4,853,185) than in English (3,261,931). However, the temporal graph of volume over time by language (Figure 2) shows that English-language tweeting became a larger part of the activities over time. This language analysis demonstrates that the IRA’s use of Twitter had both domestic and international components.

Figure 2: Tweets over time by language (English or Russian, as detected by Twitter)

Taking a closer look at activities in the two languages over time shows Russian language tweet volume surging in mid-2014 (around the time of the downing of MH17) and then dropping off over time, with a few small spikes followed by increasingly lower waves (valleys and spikes). English language tweets pick up in early 2015, then fade a bit in early 2016 before surging back during the lead up to the U.S. Presidential election. They continue with relatively high volume into 2017 and then drop off as we reach 2018 (possibly due to Twitter’s efforts to remove accounts).

For much of the remainder of this initial round of analysis, we focus on English-language tweets — those that targeted English-speaking audiences in U.S. and elsewhere. In future work, we plan to engage researchers fluent in Russian language and culture in analysis of the Russian language accounts and tweets.

Measuring Impact through Retweet Counts

Though impact of these tweets will be hard to measure, we can get a sense of reach by examining retweet numbers. Of the more than 9 million (9,041,308) tweets from IRA accounts, about 3.3 million were retweets of other accounts (including non-IRA accounts). The other 5.7 million tweets were “original” tweets (not retweets of other accounts) and about 1 million of these received one or more retweets. Taken together, these IRA-original tweets were retweeted more than 31 million (31,250,535) times. More than two-thirds of this aggregate retweet count (21,913,266) was for English-language tweets, indicating significantly more engagement, per tweet, for English-language content.

When we look across language and over time, the data tell an even more intriguing story. We can see that while the reach, in terms of retweet counts for IRA-original tweets, of the Russian language content remains pretty consistent over time (at or below 100,000 retweets per week for the entire period), the reach for the English language content rises steadily during 2016 and 2017. Engagements with the English language go from almost negligible numbers at the beginning of 2016 to a peak of more than 350,000 retweets per week just before the U.S. election and then (after a short lull following the election) rising again over the course of 2017 to over 445,000 retweets per week in June 2017. We imagine that Twitter’s efforts to shut down IRA-affiliated accounts and automated amplifiers during the summer of 2017 caused the rapid decrease of retweet numbers near the end of this dataset.

Figure 3: Retweet counts over time by language

This view of the data suggests that the IRA’s English language efforts were particularly impactful, in terms of visibility of and engagement with their content, during the lead up to the 2016 U.S. election. This effect does not seem to be related purely to the output of their accounts (i.e. they were not tweeting that much more content during that time), but instead relates to either organic uptake (“real people” beginning to follow and retweet these accounts) or additional strategies of inorganic amplification (retweets from automated “bot” accounts that are not included in this dataset). However, from this limited view of the data, we are unable to distinguish between these two effects. Likely, these retweet patterns reflect some combination of engagements by “real” people and orchestrated amplification.

Salient Themes of IRA Influence Campaigns on Twitter

The Most Retweeted Hashtags

To get a sense of the different topics that were salient in the conversation, we did some basic analyses to identify the most prominent hashtags in the data and to examine how certain hashtags appear in the same tweets. The following table (Table 1) lists the top 20 hashtags in English-language tweets in the IRA Tweet Dataset. Please note that we used the hashtag listed in the metadata of the tweet records, which resulted in some hashtags occasionally being excluded due to system quirks. Using this method, 2,968,573 tweets of the English-language tweets (or ~33% of the set) have a hashtag.

Table 1: Most tweeted hashtags in English-language tweets from IRA accounts

The first few prominent hashtags are generic tags (e.g. news, politics, sports) that were likely used to make these tweets discoverable for people searching for information on those topics. They also reflect a strategy of impersonating news sources, especially local news outlets in the U.S., that is prominent in the data (and has been reported on previously). We plan to delve further into this activity in a future post — or leave that to other researchers.

The first non-generic tag, appearing in 9th position with 21,821 tweets, is #MAGA, a shortening of Donald Trump’s “Make America Great Again” campaign slogan that was used by politically active tweeters to show support for candidate and eventual U.S. President Trump. Several other top hashtags in the set reflect conservative or pro-Trump political leanings. #Trump appears at 31st in the list. But perhaps the more interesting political tags here are #tcot and #PJNet. The first of these, #tcot, which stands for “top conservatives on Twitter”, is a slightly older tag (established in 2009) used by politically active Twitter users to mark and “channel” audiences into their conversations. #PJNet is a newer incarnation of the conservative political hashtag, used since 2014 (and perhaps earlier) to organize “grassroots” political activism on Twitter. The #PJNet hashtag, which appears in both tweets and account profiles (from several IRA accounts and thousands of other Twitter accounts), stands for “Patriot Journalists Network” and is part of a multi-layered messaging strategy that recruits Twitter users to join coordinated tweeting campaigns. The group has been written about previously (here and here) and was eventually blocked by Twitter from automatically spreading content on its platform. Our analysis does not demonstrate that #PJNet was purposefully coordinating with IRA agents, but it does suggest that the hashtag (and the group of users that coalesced around the hashtag) became a focal point for IRA activities.

It is important to call attention to the fact that not all of these popular IRA hashtags target people on the political “right”. IRA agents were also active in conversations targeting the political “left”. The 14th most tweeted hashtag in the data is #BlackLivesMatter. This hashtag was created and used by political activists in the U.S. who were protesting violence perpetrated by police against African Americans. During 2015 and 2016, IRA affiliated accounts appropriated and used this term as part of their influence campaigns. We recently released a paper detailing how IRA agents participated in “both sides” of a politically polarized #BlackLivesMatter conversation on Twitter in 2016. Their activity echoed and amplified political divisions between those communities. On the political right in that conversation, IRA activity converged to support Donald Trump. On the political left in that conversation, IRA activity functioned to amplify narratives that were critical of Hillary Clinton and encouraged community members not to vote. These new data align with and add context to those previous findings.

None of the most tweeted hashtags in this data set were invented by IRA agents. All were opportunistically used by IRA agents seeking to cultivate audiences for their content. Many of the more effective tags (in terms of engagement from those audiences) were tags that marked participation in politically charged conversations (e.g. #MAGA, #BlackLivesMatter). In these cases, IRA agents carefully crafted and maintained online personas — through their Twitter profiles, tweets, and other platforms as well — that positioned them as activists within these communities. They operated different accounts targeting these different communities — for example, @crystal1johnson and @gloed_up in the pro-#BlackLivesMatter conversations, @TEN_GOP and @SouthLoneStar in the pro-Trump conversations. These analyses align with and confirm our previous studies indicating that Russian information operations target and infiltrate online political communities and then seek to shape those communities in ways that support Russia’s geopolitical goals. Using politicized tags marked these tweets (and their account owners) as members of particular political “tribes”, helped them build credibility, reputation, and an audience for their political messaging.

Significantly, the non-generic, politically-charged hashtags received significantly more retweets per tweet than the generic tags (see the “aggregate retweet” column in Table 1 where yellow highlighting indicates very high engagement rates and purple highlight indicates fairly high engagement rates). This observation could be due to multiple factors (like the fact that strategies both evolved and gained audiences over time), so we cannot read too much into this yet without follow up analysis. But if we can control for some of these other variables and then manage to find this same trend, we might need to rethink some of our explanations for why IRA agents were using these politically charged tags — in other words, it may be more about us, than them. IRA activities may have gravitated towards these politicized tags, not merely to “sow division”, but because those were conversations where they could most effectively cultivate their audiences.

Co-Occurring Hashtag Network

Figures 4 and 5 provide another view into the different themes and conversation communities that were targeted by the IRA activities. The Figures represent two different views of the “co-occurring hashtag network graph.” In this graph, prominent hashtags are nodes (or circles). The nodes are sized relative to the number of IRA tweets that contain that hashtag. These nodes are connected by an edge (or a line) if two hashtags appeared in the same tweet. That edge grows stronger (thicker) if many tweets contain both hashtags. We then use a “force-directed” algorithm to distribute the nodes in a way that tries to pull connected nodes together and push unconnected nodes apart. Finally, we used a “community detection” algorithm (colors) to identify clusters of nodes that are more densely connected.

Figure 4. Co-Occurring Hashtag Network Graph (Full View)

Figure 4 shows the full co-occurring network graph, revealing the prominence of the generic news tags in comparison to the rest of the graph. Figure 5, a closer up view of the hairball of hashtags on the right of the graph, provides a high level view of some of the prominent themes within the IRA activities (at least those that appear in hashtags). A cluster of pro-Trump hashtags appears in magenta centered around #maga. In black (in the lower center of the graph) are a cluster of the slightly older hashtags that reflect conservative organizing — including #tcot, #PJNet, #2a, #p2, and some hashtags used to criticize Obama. The light blue cluster comes together around the #BlackLivesMatter hashtag and reveals some other similar tags used to target similar conversations — e.g. #BlackTwitter, #BlackToLive, and #PoliceBrutality. In light green are an array of pop culture tags that likely were used to target and gather attention (i.e. eyeballs, retweets, and follows) from a younger crowd. And in orange are several hashtags focused, to some extent, on international affairs, terrorism, and anti-Islam messages. Aligning with our above observations about the most-tweeted hashtags, this graph demonstrates that major focal areas for IRA activities on Twitter were: (1) generic “news” content; (2) pro-Trump conversational communities; (3) conservative political activism; and (4) #BlackLivesMatter activism.

Figure 5. Co-Occurring Hashtag Network Graph (Focus on Right Side)

Prominent Hashtags Over Time

Figure 6 provides temporal graphs (tweets per week) of several prominent hashtags that we thought might provide insight into this data. This provides a sense of when certain hashtags were used by IRA agents and gives a sense of how their activities evolved over time — as well as how they reacted to external events. These are rough graphs right now, based on the hashtag field in the metadata (which is imperfect). Future content analysis should reveal other prominent terms or groups of terms (not necessarily hashtags) that may provide additional insight into strategies over time.

Looking at patterns of activity over time indicates that most of these targeting strategies gain prominence starting in 2015. In early 2016, the accounts collectively seem to move away from generic news-type hashtags and begin to focus on using politicized hashtags such as #tcot, #PJNet, and #BlackLivesMatter. After the 2016 U.S. election, the activity targeting the #BlackLivesMatter conversation almost completely disappears and most of the activity coalesces around pro-Trump and conservative political themes — with #MAGA becoming, by far, the most tweeted hashtag in 2017.

Figure 6. Tweets over Time for Specific Hashtags

Two Trolls, Sitting in Adjacent Offices at the IRA, Tweeting at Each Other

Journalists like Peter Pomerantsev, Mike Mariani and Adam Curtis suggest that Russian information operations aim to destabilize public discourse, rather than pushing a particular viewpoint within it. They hypothesize that these operations maintain an ideological fluidity so as to tap into existing social fractures from different “sides”, spreading not just inaccurate information, but also division and mistrust.

Drawing upon our previous contextual knowledge of these activities, we did a cursory qualitative analysis of how two prominent IRA accounts (@TEN_GOP and @Crystal1Johnson) interacted with each other — to gain some insight into if and how sowing division may have been a component of IRA operations. These are the top two most retweeted accounts in the dataset (with over 6 and 3.7 million retweets respectively) but their content was designed to resonate with different audiences. @TEN_GOP masqueraded as a grassroots political organization being run by Republicans in Tennessee and attracted a primarily right-leaning audience, while @Crystal1Johnson cultivated a politically left-leaning audience by presenting itself as a female African American journalist and social activist.

In our prior research, we observed that these two accounts engaged their different audiences with divergent messages about the #BlackLivesMatter movement and police related shootings. @Crystal1Johnson tweeted to support #BlackLivesMatter and attack the legitimacy of law enforcement authorities, while @TEN_GOP framed #BlackLivesMatter as an anti “law and order” movement and supported counter-movements like #AllLivesMatter and #BlueLivesMatter. Due to limitations in how we collected our data for that study, we had previously been unable to observe how these accounts interacted with each other directly, if at all. With access to the Twitter IRA Accounts dataset, we can now build a fuller picture:

@TEN_GOP (2016–12–05 21:52): Whether #WalterScott was running towards the police or away from them had he followed their orders he would still be alive! <embedded visual content>

@Crystal1Johnson (2016–12–05 21:57): @TEN_GOP 🖕🏾

@TEN_GOP (2016–12–05 21:59): @Crystal1Johnson facts hurt?

@Crystal1Johnson (2016–12–05 22:02): @TEN_GOP Have you been called a dumbass before?

In the example exchange above, we can see two IRA troll accounts, on two opposing political “sides”, tweeting back and forth at each other.

@TEN_GOP tweeted out a message (which received 434 retweets) that frames Walter Scott’s actions as the reason for being killed by police officers. [The tweet contained embedded visual content that we did not yet have access to when we conducted this analysis earlier this week (though this data will be available soon).] Five minutes later, @Crystal1Johnson replied with an adversarial tweet containing an emoji of a vulgar hand gesture. Note how @Crystal1Johnson’s emoji hand intentionally used a dark skin tone, to project the account’s African American persona. The two accounts send another round of back-and-forth retorts, both reflecting a kind of “trolling” technique (an online behavior characterized by trying to get a negative emotional response out of another account or accounts).

This exchange perhaps looks like many other adversarial online exchanges. We now know that these two accounts were operated by collaborating agents sitting within the same organization — or possibly could have been operated by the same agent enacting two fictional characters like a virtual puppeteer. This exchange highlights how the IRA accounts were not just spreading inaccurate information, but also tapping into stereotypical thinking and modeling ways of political engagement for their audiences. Taking a view informed by structuration theory, we can hypothesize that by enacting these personas in this way, the IRA agents both echo and shape norms around online behavior in these politically charged environments.

@TEN_GOP and @Crystal1Johnson had a few other direct interactions during the course of their activities. The following one was on election day in 2016:

@Crystal1Johnson (2016–11–07 18:20) So we’re screwed either way..😑
#ElectionFinalThoughts <embedded visual content>

@TEN_GOP (2016–11–07 19:31) @Crystal1Johnson Wake up! This will happen if Hillary wins. Stop being slave of Democrat plantation!”

This exchange provides a concise example of some of the larger trends in the data. In the lead up to the 2016 election, while IRA-affiliated accounts on the political “right” converged in support of Donald Trump (the Republican candidate), IRA-affiliated accounts within the left-leaning #BlackLivesMatter conversation argued that there were no good candidates (and often suggested that voters stay home).

Other Interesting Patterns, Anomalies, and Open Questions

We have barely scratched the surface on this data set, and we don’t even have room here (or your attention) to talk at length about all of the interesting things that we’re seeing. There’s the case of the 32 highly-active accounts that were clearly trying to impersonate local news sites, but never seemed to fully “activate” for any clear nefarious purpose. And the case of the chemical accident hoax that several IRA accounts tried perpetrate in 2014. There are open questions about how IRA accounts evolved and even abruptly switched identities over time, and questions about how IRA activities were shaped by external events. We plan to follow up on these and other questions in the future, but will wrap this first article up (so we can get some sleep!)

Final Thoughts / Next Steps

This article represents a preliminary, mostly quantitative analysis of the Twitter IRA dataset. For our lab, this is a first step — an exploratory analysis — that will likely lead to more in-depth qualitative analysis on specific aspects of the data. Our initial reaction to this data (informed by our analysis) is that there are not necessarily hugely new or shocking pieces, but that this more comprehensive view of the IRA data confirms previous observations and hypotheses (for example, around how IRA activities differentially micro-targeted pro-Trump and pro-BlackLivesMatter discourses) and provides additional context to help us understand the broader patterns within these activities. We definitely want to leave the door open to the possibility that there are some really novel insights yet to be gained from these data, but it will likely take some time for researchers to identify those. I’m excited to see what the community of researchers finds!

We want to commend Twitter for releasing this data to the public, as we firmly believe that this is an “all-hands on deck” type of moment, as we collectively aim to better understand what these strategies are and how they work. The public availability of this data set will allow diverse perspectives and methods of analysis that will likely shed new light onto these phenomena.

--

--

Kate Starbird

Associate Professor of Human Centered Design & Engineering at UW. Researcher of crisis informatics and online rumors. Aging athlete. Army brat.