Spot a Bot: Identifying Automation and Disinformation on Social Media

Kris Shaffer
Data for Democracy
Published in
16 min readJun 5, 2017
photo by Josh Sorenson

By Bill Fitzgerald and Kris Shaffer

There are bots everywhere, or so it seems. Some of these bots can be fun. But all too often, automated and otherwise high-volume social media accounts exist to deceive. Not only do they try to make readers believe they are real people, but they also participate in the spread of disinformation and malware, as well as coordinated harassment campaigns.

The bad news is that they often succeed in their deceptive ventures. But the good news is that most bots ― and their close cousins, “sockpuppets” and “trolls” ― exhibit some clear tell-tale signs. What follows is a list of those signs, based on our research into bots, sockpuppets, and disinformation on Twitter. With these signs, anyone can spot a bot, and resist the spread of disinformation online.

Bots vs. sockpuppets vs. good old fashioned trolls

It’s easy to jump on the bots, but fully automated accounts are only part of the problem when it comes to disinformation and harassment. In addition to bots, which are accounts that are fully automated and controlled by code (or an app) set up by a human user, there are two other categories of high-volume accounts that we have seen in our research (and the research of others): sockpuppets and trolls.

A sockpuppet account is created by one individual in order to appear like the account is controlled by a second, distinct individual. The second individual may be a real person (this type of impersonation is sometimes referred to as catfishing), or an invented persona that simply masks the identity of the account creator. Bots also often take on made-up or stolen identities. What distinguishes a sockpuppet from a bot is that sockpuppets are at least partially controlled by a human, where bots are fully automated via code. Often a human controls multiple sockpuppet accounts, tweeting different content from each one, or (more often) tweeting or retweeting the same content from all of them. There are various tools available to someone who wants to coordinate content across multiple accounts, but it could be as simple as using Twitter’s own TweetDeck app.

Of course, some people don’t feel the need to use multiple accounts or hide behind a false persona to tweet disinformation or participate in a harassment campaign. These accounts are the trolls we’ve heard about for some time on Twitter. While it’s easy to think that they might be more vulnerable than sockpuppets, Twitter’s and law enforcement’s history of (not) responding to evidence of abuse, harassment, libel, and disinformation has empowered many users to tweet horrendous things from their own accounts, often in high volume.

It also can be difficult to cleanly parse the differences between a fully automated account, an account that is partially automated, and a manually controlled account that tweets at high volume from a pre-defined script. Likewise, it can take time to parse out the sockpuppets, the hacked accounts, and the trolls. However, the result is the same: a network of signal-boosters that can alter the public information landscape and greatly overwhelm the information-seeking public, or an abuse target.

So how do we spot these accounts in the wild? Following are a number of traits we’ve found in our research. As you might expect, many accounts that are not bots or sockpuppets exhibit some of these traits. None of them are foolproof. But the more of these traits an account displays, the more likely it is to be a disinformation account. In our research, we’ve found it far more helpful to look for evidence of these traits in a large collection of tweets, rather than trying to come up with discrete lists of bots, sockpuppets, trolls, and regular users. It’s often these traits that are most dangerous, and it’s these traits that we can look out for when engaging information online ― and when sharing information ourselves. It is also worth highlighting that many of the traits exhibited by bots and sockpuppets are pulled directly from tactics used in online harassment.

Traits of bots and other high-volume disinformation accounts

The sleepless account

One of the primary ways to determine that an account is not manually controlled by a human is that it never sleeps. That is, if you download its tweets from a time period of several days, and there is no break in activity when a human would sleep. Such an account could be partially automated ― manually controlled during the day, but with pre-loaded content set to be released while the user sleeps. But continual round-the-clock activity is a sure-fire sign of some measure of automation.

The retweet bot

One of the most common types of bots we’ve come across is the retweet bot. This account is programmed to automatically retweet content that comes from certain “catalyst” accounts, or content that contains certain keywords. Software developers can create advanced retweet bots using the same frameworks we use to collect and analyze tweets, or they can use a Google Spreadsheet to search for content and retweet in accordance with a pre-defined schedule. Accounts that are exclusively retweets (or pretty close) aren’t always bots, but if they exhibit some of the other traits here, there’s a good chance they are. And whether they are bots or sockpuppets, they are a core component of the mass influence campaigns we have observed. Through sheer volume, retweet bots can function to amplify, normalize, and mainstream disinformation.

The reply bot

Similar to the retweet bot, the reply bot is set to monitor Twitter for tweets from specific accounts or containing certain text, hashtags, or links. It then immediately replies with pre-loaded content. During the GamerGate campaign, I (Kris) was barraged with disinformation tweets from several of these accounts every time a tweet of mine contained the text “GamerGate”. Ben Starling, CE Carey, and I (Kris) also observed numerous reply bots in the lead up to the French election, replying to mentions of Le Pen or Macron with an image containing disinformation text or a disparaging or unflattering image of a candidate (often a fake). Like the retweet bot, these are simple to create.

We should note that both high-volume retweeting and replying can be undertaken manually, as well. In fact, apps like Tweetdeck make it easy for someone to track trending terms or accounts and then retweet from dozens of accounts simultaneously (something we saw in our French election Twitter archive) or reply to someone with the same content from dozens of accounts simultaneously. You don’t need to be a robot, or even know how to code, to engage in automated activity. But these are traits common to disinformation and abuse campaigns.

Stolen content

A common occurrence we’ve seen among automated accounts seeking to spread malware and/or collect user data is to steal content from other accounts. As we’ve discussed elsewhere, accounts seeking to misdirect users to their site sometimes monitor accounts known for having “click-bait” content (content engineered to attract attention and convince a high number of users to click on the link for external content), and reproduce their content, but replacing the link in the original tweet with a link to their own site. Since these bots are likely not to be followed by many real people ― and are likely to be suspended after a short period of activity ― they will typically also add a hashtag related to a hot topic, or an extremist community.

As we discussed in our detailed article about this kind of bot, this is a particularly nefarious kind of bot. The creators are not politically motivated. They’re simply trying to get ad revenue by tricking people into clicking on their link. But because most click-bait content is also often the most polarizing content, these bots end up amplifying polarizing content that is sometimes itself misinformation or disinformation. This subtly contributes to the increased polarization in our culture, and when it overlaps with another influence operation, can end up amplifying the effects of that influence operation.

Stolen profile images

Many bots and sockpuppets are examples of catfishing, the use of a false, often stolen, persona for dishonorable purposes. Many of the disinformation accounts that we observed had a picture of a real person as their profile image. In every case, when we performed a reverse image search, we found the picture being used by other Twitter accounts exhibiting bot-like or sockpuppet-like features, and often other false accounts on other social media platforms. Our friends at the Digital Forensics Research Lab have written extensively on this trait. See “Portrait of a Botnet” and “The Many Faces of a Botnet”.

Along with a stolen profile image, we have seen many overly patriotic profile pictures and banner images on bot and sockpuppet accounts. This is largely because we have been studying disinformation in American and French politics, but in our experience, it seems that American flags, pictures of Donald Trump, and the red “Make America Great Again” baseball cap photoshopped onto a celebrity are all far more common occurrences among false accounts than among real accounts.

Tell-tale account names

While many bots and sockpuppets (and certainly trolls) have “regular” user names, or “handles”, fake accounts will often have tell-tale account names (link goes to a pdf download), often artifacts of the automated account-creation process. These names may include variations on a single “real” name, variations on a celebrity name, and long strings of alphanumeric garbage (often after a “real” name).

Recent accounts

Many of the bots and sockpuppets we have seen have been created recently. Whether because they were created for a recent campaign, or because they are replacements for suspended accounts, “young” accounts are very common in disinformation and harassment campaigns. In some cases, though, accounts are purchased for the purpose of being employed in a botnet, precisely because they are not recent and have a non-bot history, and therefore appear less bot-like to Twitter. However, in those cases, it is common to delete the previous account holder’s tweets before engaging in a harassment campaign. Thus, “old” accounts often have a very recent first tweet, if they are employed in a botnet. Recent initial account activity need not be indicative of a false account, but when seen in conjunction with other traits ― or with an inordinately high follower count ― it tips the balance towards bot/sockpuppet.

Activity gaps or filler content

Some users hold onto their false accounts for multiple campaigns. When they do so, there may be significant gaps in the account activity between campaigns (or after campaign activity has been deleted by the user), or they may insert filler content, most commonly inspirational quotes or pornography ― though we did run across a bot that had over 500,000 tweets, the most recent of which were in English and themed around professional wrestling, until it was operationalized in French as part of the #MacronGate campaign.

This filler content can fulfill multiple purposes. It could make the bot look harmless to users that spot it between campaigns, particularly if it is following users and favoriting their tweets in order to build up a non-bot-like activity profile and a list of “real” followers. More nefariously, by including things like inspirational quotes, accounts that participate in a harassment campaign may end up not to be considered in violation of Twitter’s Terms of Service, which include in their definition of abuse, “if a primary purpose of the reported account is to harass or send abusive messages to others” (emphasis added). In other words, if most of the account’s tweets are harmless, they are less likely to be found in violation of Twitter’s rules about harassment.

Coordination

Bots and sockpuppets often work in coordination with other bots and sockpuppets. This can come in a variety of forms. We’ve seen the same content shared by multiple accounts, simultaneously or over a period of time. As mentioned above, we’ve seen the same tweets retweeted or replied to simultaneously or in quick succession by a higher number of accounts than is typical on Twitter, and that coordination often comes from accounts that work together regularly. (Such a network of bots is called a botnet, and we’ve taken to occasionally using the term socknet to refer to a coordinated network of sockpuppet accounts.) There may also be significant overlap in the lists of accounts that a botnet (or socknet) is following and followed by, whether these are instances of following each other, or having purchased artificial followers from the same source.

In the lead up to the French election, we also noticed a more subtle form of coordination, one that is more readily seen on the large scale via statistical analysis. The accounts that had the highest overall volume of tweets related to the French election had a significantly different timing profile than Twitter as a whole.

In general, Twitter activity containing references to the French election increased in the days leading up to the election, peaking on election day (Sunday, May 7).

But the highest volume accounts peaked on Friday, May 5 ― the day that the MacronLeaks content was posted and reported on in the media.

In fact, the accounts that tweeted the most on Friday (before the leak was posted) dropped off precipitously over the next few days.

While some of the bots/sockpuppets followed different profiles, the general trend showed that the disinformation campaign was coordinated in a way that countered the general trends of Twitter activity leading up to the election. Comparing the two trends helped us to identify the coordinated disinformation activity.

Semantic similarity

Sometimes the coordination is obvious, with identical text, images, URLs across multiple tweets from multiple users. Other times it is more subtle. For example, rather than automating the content creation, the perpetrators may be coordinating efforts on another site (4chan, Reddit, Discord, etc.) and reproducing language from a shared script, but doing so non-uniformly. Fortunately, there are algorithms that help us analyze this, and in a previous study, our D4D colleagues (led by Jonathon Morgan) found evidence of a coordinated influence operation across multiple social media platforms by statistically comparing social media data to other online collections of texts. They found that networks which tended to have very different semantic properties all of the sudden converged (and later diverged) in the kinds of patterns contained in their posts. While not as obvious as verbatim repetition, it is nevertheless a clear sign of large-scale coordination.

Metadata similarities

In addition to the data ― the actual content of the tweets ― similarities in account metadata can be indicative of a botnet (or socknet). For example, accounts employed in a botnet were often created at or around the same time. Twitter’s API (application programming interface) includes down-to-the-second information about when an account was created with every tweet downloaded. If we observe multiple accounts participating in a disinformation or harassment campaign and they were created within minutes of each other, there’s a strong chance they are coordinated by the same individual.

Other times, accounts were purchased, in order to make the account look older and more reputable than it is. In these cases, the creation dates are different, but the perpetrator may have deleted all the old tweets. In that case, looking for similar first-tweet dates (a bit harder to come by via the Twitter API, but still possible with many accounts) can reveal evidence of coordination.

Other metadata similarities may be observed, such as similar screen names, similar profile pictures, similar slogans or hashtags in the user bio, etc. When any of these similarities appear among accounts that exhibit other bot-like or sockpuppet-like traits, there is significant likelihood of a botnet/socknet.

Twitter’s responsibility

Twitter is not a neutral platform. The specific affordances, limitations, and policies of the platform facilitate certain kinds of user behavior, and Twitter’s employees make choices about how to respond (or not) to that behavior, and how to tweak the platform in light of the ways it is used. They have made significant changes to the platform in response to user activity, including the incorporation of @-replies and hashtags, the introduction of sponsored content, an algorithmic feed, the removal of RSS feeds, and significant changes to their API that have transformed the business model (or existence) of third-party app developers. The company’s history tells us clearly that they are willing to make sometimes drastic changes to the platform when they see a need or an opportunity, and that the specific affordances of the platform are not set in stone.

That said, what is the responsibility of a company like Twitter, whose platform so readily facilitates the spread of disinformation and the coordination of online harassment and abuse campaigns? What could/should they do to address the problem? What have they already done?

To start, we need to highlight that Twitter isn’t alone here. While this example focuses on Twitter, it is equally applicable for other platforms.

Twitter knows a lot about us. In addition to the actual content of our tweets, the accounts we interact with, the accounts we choose to ignore, and the content we share and favorite, they collect a range of information about us automatically. Twitter’s privacy policy is pretty clear about what they collect.

A quick read through their policy shows that Twitter collects a username, password, and an email address for every account. They can collect a phone number. But let’s stop here and look at what just the simple act of account creation tells Twitter about the account. In addition to the information required to create the account, Twitter will also know, at minimum:

  • The IP address used to create the account
  • Basic information about the device and method used to create the account
  • The time the account was created

With just this information, Twitter could begin to create profiles that would help predict which accounts are used by actual humans, and which accounts are likely to be used as bots or sockpuppets.

On a technical note, because most automated accounts are created behind a VPN or an anonymizing proxy or both, the IP address data could be used as a predictor. There are lots of legitimate reasons to use a proxy and a VPN, but over time, being able to cross-reference content posted from an account against how the account was created would help identify bot or sockpuppet accounts with greater accuracy. If a bot or sockpuppet account that engages in harassment or disinformation was created from an unobscured IP address, that’s also very valuable information. We know that Twitter tracks how people connect in some ways, based on how they have treated people accessing Twitter via Tor.

Moving on, Twitter also collects a range of other information automatically. A small subset of this information includes location data derived from at least four sources: GPS information from a mobile device, wireless networks near a mobile device, cell towers near a mobile device, and/or location derived from an IP address.

Twitter also collects information about devices and behavior of users as they interact on Twitter. This data trail includes “IP address, browser type, operating system, the referring web page, pages visited, location, your mobile carrier, device information (including device and application IDs), search terms, or cookie information.”

And that’s to say nothing of the data collected from around the web as users visit pages that have social sharing icons. A good proportion of less sophisticated bots and sockpuppets would likely have very different web browsing habits than regular people.

From the vantage point of identifying bots, sockpuppets, or trolls, the presence or the absence of any of these data points is meaningful and useful. Disinformation accounts are almost certainly less likely to allow access to contacts. The direct message history of bots, sockpuppets, and trolls likely looks very different than that of regular people.

Just from looking at this small subset of data collected by Twitter, a “regular” account will almost certainly generate a very different data trail than a bot, sockpuppet, or troll account. We need to emphasize that both the presence and absence of data — which can be observed by Twitter — is of interest.

If an account never has any referrer history, always accesses the site via proxied IP addresses, and tweets steadily throughout the day regardless of time zone, these three factors together increase the likelihood that we are looking at a bot, or a sockpuppet shared among multiple people.

Twitter (and other social media platforms) has the data to perform this analysis.

We can see how Twitter currently uses this data when we head over to their pages on targeted advertising (and to emphasize, all social media platforms offer comparable services — Twitter is not unique here).

If we look at “Interest targeting”, Twitter lets us know that we can target humans based on any one of 350 sub-categories.

If we look at “Device targeting”, we can see that we can target humans based on their operating system, the specific device they use, or whether or not their device is new.

Want to target women? Twitter has you covered there too (provided you define gender as binary).

Want to target individual humans using arbitrarily defined criteria? Then “Tailored Audiences” are your tool of choice.

The suite of tools Twitter provides to advertisers is powered by the data it collects from people using and interacting on Twitter. Both Twitter’s terms of service and their advertising copy make it very clear that they can target with a high degree of accuracy.

This raises the question: if Twitter can identify individuals for their advertisers, why can’t they use the same tools to curtail harmful bots, sockpuppets, and trolls (or hate speech and harassers)?

Companies like Twitter, Google, and Facebook all have large data sets, going back years. They have built the analytical tools to use these data sets effectively. This post identifies multiple habits of bots, sockpuppets, and trolls — and all of these elements can be highlighted and surfaced via basic data analysis. However, the repeated failure of industry to address disinformation and misinformation cannot be ignored. One possible explanation is that their advertising tools aren’t as useful and powerful as these companies claim, and advertisers are wasting money when they purchase these services.

This leaves industry in an interesting place. If the mechanisms used to target people for ads are as powerful as claimed, these same mechanisms could be used to identify disinformation campaigns. If, however, the accuracy of targeting has been overstated, and industry really is powerless to address disinformation, then the power of targeted ads have been oversold.

Either way, it’s time for Twitter ― “the free speech wing of the free speech party” ― to be transparent about what they can’t do, and what they won’t do, in the fight against disinformation.

--

--

Kris Shaffer
Data for Democracy

Data Scientist. Computational musicologist. Digital media specialist. Developer. Author. University of Mary Washington, Hybrid Pedagogy.