Building a Bot-Free Twitter

9 min readOct 1, 2018

The 2016 US Presidential Election saw a proliferation of linked bot activity on social media, a ‘Bot Army’. The goals of each bot are not entirely known, but, clearly, the Kremlin made an attempt to fight in ‘A Battle of Information’ — to blend in with regular users and spread ‘Fake News,’ to confuse social media users, and to undermine American trust in media and the democratic process.¹

The following year, Twitter launched an investigation and identified that a Kremlin-linked organization known as the Internet Research Association had created at least 3500 different bot accounts on Twitter to spread disinformation with around 200,000 Tweets. A further 220,000 “low-quality” accounts were deleted, spreading 1.2 billion Tweets and impacting 1.4 million people.²

I decided to see if I could do something about it. For Prof. Latanya Sweeney’s Gov 1430 class, I built a web application, tentatively known as BotFreeTwitter, that uses machine learning to identify whether Tweets are bot-like Naïve Bayes classifier and whether users are bots with a Random Forest classifier.

With BotFreeTwitter, users can interact with their Twitter timelines as well as user timelines, as usual, but a small indicator shows with what probability that Tweet is from a bot or user is a bot. The Naive Bayes classifier that judges whether a Tweet is from a bot has a validation accuracy of 0.96. The Random Forest classifier that judges whether a user is a bot or not has a validation accuracy of 0.79. Finally, the user timelines load with a speed of 7.78 seconds and the home timeline loads with a speed of 8.53 seconds on a localhost server. More work is needed to make the web application more efficient and to improve the accuracy of the user classifier, but it’s a good start.

Background Research

I based a lot of my work on Varol et al.³ who pioneered the development of Twitter bot detection algorithms and created the Botometer website. I also worked off of Phadte and Bhat’s website which allows users to input a Twitter username and the website will quickly judge whether the user is a bot or not. Users also can see various trends from Twitter, such as top tweeted hashtags by bots and most common words used by bots.⁴

Both websites suffer from the same flaw, however. They require users to individually input usernames to check whether the accounts are bots or not. The creators of botcheck.me attempted to remedy these flaws by building a Chrome extension and, when the user sees a Tweet, allows them to click on it to see a quick analysis of whether the selected user is a bot or not. This is a significant improvement in terms of user experience, allowing consumers to quickly check whether users are bots or not with just a click, while still using Twitter freely. Another Chrome extension, Botson, accomplishes the same task, but by using Botometer’s API. Tweets that seem to be tweeted by bots are blurred out.

Both Chrome extensions have mixed reviews, however. They are glitchy and have fairly slow loading times. They also demonstrate the two extremes of user-choice. While the first extension requires users to choose to check whether the Tweet is from a bot, the second extension blurs the offending Tweet without asking the user. A balance must be struck among model-efficiency and accuracy.

Methods

Data Sources

Creating a dataset of “ground truth” values for whether a user is a bot or not, or whether Tweets are from bots or not, is hard as it faces significant issues with scaling. Two such datasets were used and combined to train the models. The first set comes from Gilani et al. who crowdsourced a hand-annotated group of accounts as either bots or not from the month of April 2016. The second set comes from NBC News, which made public a dataset of users deleted from Twitter after having been identified as bots with associated Tweets. These datasets were then parsed and combined such that, in total, 678,610 Tweets were analyzed and 4,314 users were analyzed.

Tweet Classifier

I built the first classifier to classify whether a given Tweet was from a bot or not. A Multinomial Naïve Bayes, Bag of Words text-classifier was implemented with Scikit-Learn in Python. First, Tweets are parsed by removing special characters including emojis. Then, the text is converted into Term Frequency Input Document Frequency (TF-IDF) vectors, which count how often each term appears in the Tweets relative to how many Tweets there are.

These vectors are then compared with all of the other vectors to create probabilistic models based on Bayes Theorem that predict how likely a Tweet is from a bot based on the terms that the Tweet uses. As this is a Naïve Bayes model, it ignores the order of the words and focuses just on the appearance of the words in the Tweet.

User Classifier

The second classifier is designed to judge whether a user was Twitter bot or not based on the attributes of the user itself. This takes the following statistics: number of Tweets, number of followers, number of favorites, ratio of followers to friends, ratio of favorites to Tweets, and the age of the account. Clearly, these are parameters are limited and more work is required to expand this. A Random Forest Classifier with 100 decision trees was used, again with Scikit-Learn in Python. This effectively creates a hundred decision trees with the various parameters and then, based on the outcomes of the hundred trees, decides whether the given user is a bot or a human.

Web Application

Finally, I built a related web app to present the models in a user-friendly way so that it looks like Twitter. Users can log in and view their own home timelines and also view user pages and their associated timelines. For each Tweet, users can see how likely that Tweet is from a bot and can see the probability with which that user is a bot.

The app was built with Flask, and linked to a backend that ran the Scikit-Learn algorithms with API endpoints. The front end was designed with Bootstrap for a clean, blue-white feel. The Tweepy Python API was used for gathering and parsing Tweets.

Screenshot of my Twitter feed on BotFreeTwitter.

Results

Model Accuracy

+-----------+-----------+--------+----------+------+
|           | Precision | Recall | F1-Score | AUC  |
+-----------+-----------+--------+----------+------+
| Humans    |      0.98 |   0.62 |     0.76 | --   |
| Bots      |      0.89 |   1.00 |     0.94 | --   |
| Avg/Total |      0.91 |   0.90 |     0.90 | 0.96 |
+-----------+-----------+--------+----------+------+

The Tweet-classification model has a total accuracy rate of about 90%.

+-----------+-----------+--------+----------+------+
|           | Precision | Recall | F1-Score | AUC  |
+-----------+-----------+--------+----------+------+
| Humans    |      0.78 |   0.77 |     0.77 | --   |
| Bots      |      0.80 |   0.81 |     0.80 | --   |
| Avg/Total |      0.79 |   0.79 |     0.79 | 0.89 |
+-----------+-----------+--------+----------+------+

The user-classification model has a total accuracy rate of about 79%.

+----------+-------+--------------+-----------+----------+
| # Tweets | # Fav | Friend Ratio | Fav Ratio | Acct Age |
+----------+-------+--------------+-----------+----------+
|     0.22 |  0.26 |         0.25 |      0.06 |     0.20 |
+----------+-------+--------------+-----------+----------+

The user-classification model evenly relies on all of the features, except the ratio of favorites to Tweets.

The results show that the Tweet-classification model has an accuracy rate of about 90% while the user-classification model has an accuracy rate of about 79% based on the validation sets, which represented random subsets of 20% of the total dataset, based on the F1-Score. Their Area Under the Curve scores, which balance the False Positive Rates against the False Negative Rates are 0.96 and 0.89 respectively, which means that both of them minimize False Positives and False Negatives fairly successfully, at least for the validation sets.

Web Application Runtime

It takes an average 7.78 seconds after 10 trials to load the user profile pages. And, it takes an average of 8.53 seconds to load the home timeline page. These numbers were calculated on a MacBook Pro 2016 with a 2.9 GHz Intel i5 processor and 16GB RAM with the application running on a localhost server, meaning there is minimum latency between the server and the browser. Running the application on a platform such as Google App Engine or Heroku would probably yield performance improvements as the server specifications would be much stronger, although there may be an increase in latency.

Case Study 1: User Bot Identification

The user-classification model seems to lean towards classifying users as bots aggressively. In this case, clearly @PrincessDebate is probably a bot. Its description and name are mostly emojis and most of its Tweets pro-Trump replies to celebrity politicians. The user-classification model gives it a score of 0.89. For reference, the Botometer website gives a score of 2.7/5 and Botcheck claims that the user is also a bot.

Unfortunately, BotFreeTwitter thinks I’m a bot too… with a probability of 1.0. I would like to contend that I am, in fact, not a bot, and Botcheck and Botometer support that conjecture. But, interestingly, looking at my Twitter and the parameters that I test, it’s not that surprising that I’m being classified as a bot since I mostly just retweet news sites.

There’s definitely a lot of overfitting with the validation set and there are issues with false positives. In the future, we can tackle this in two ways. First, we’ll need a larger dataset. Second, we’ll have to add more features to our model, such as combining the user and tweet models or features that take a more holistic view of the user like Tweet frequency.

Case Study 2: News and Non-Profit Organizations

Another problem is that news groups and non-profits might also be picked up as bots.

+---------------+-------+------------------+--------+
|     News      | Score |    Non-Profit    |  Score |
+---------------+-------+------------------+--------+
| @cnnbrk       |  0.75 | @tedtalks        |    0.6 |
| @nytimes      |  0.74 | @unicef          |   0.62 |
| @cnn          |  0.94 | @redcross        |   0.52 |
| @bbcbreaking  |  0.88 | @museummodernart |   0.54 |
| @sportscenter |  0.81 | @wikileaks       |   0.68 |
+---------------+-------+------------------+--------+

The top five most followed news organizations and the top five most followed non-profit organizations are all classified as bots, although the non-profit organizations are a bit better off. This is likely because non-profit organizations Tweet much less frequently than new organizations, which Tweet every time there’s some news. We can fix this by flipping a switch and saying that verified accounts are not bots.

Implications

In its current, minimum viable product form, this project shows that it is possible to create a Twitter that identifies whether Tweets are from bots and whether users are bots with basic machine learning based algorithms with fairly high accuracies and a relatively simple design. The models do not use large feature-sets and are simplistic, untuned models that can definitely be improved in the future. The model does pick up too many false positives, so more work is definitely required to make this a production application.

If a user is pre-informed that a Tweet they are looking at is from a bot, then they will hopefully view the Tweet with more skepticism than if they were convinced that the bot was in fact a real person. This might help stymie disinformation on platforms like Twitter.

Maintaining the integrity of Twitter means retaining it as a platform for the free exchange of ideas and information. Twitter is about individuals finding a way to both express themselves and to seek others’ views on the world. So, Twitter users need to be able to ignore bots and seek real people’s ideas and opinions. And, that’s what BotFreeTwitter offers users.

Special Thanks to: Prof. Latanya Sweeney, Jinyan Zang, Ji Su Yoo