Twitter Political Bots Likely Exist for Every Country in the World (Case study: Indonesian 2019 Presidential Elections)
Introduction
News regarding Russian twitter bots influencing American elections has been on the front page for over a year now. There has been many studies and investigations done on the Russian twitter botnets, to identify, categorize and understand their network.
However, we hear very little of twitter bots being engaged in propaganda activities for every other country in the world. Elections are high stakes in whichever country they occur in, and we would expect to see similar propaganda activities conducted in social media. I set out to investigate whether this occurred in a recent presidential election in Indonesia, and indeed, I found propaganda bots accounting for ranging up to about 57% of the accounts participating in a political topic.
In this article, I will be describing the methodology by which I identify twitter bots, followed by an analysis of the results I have obtained.
Methodology:
Early on in my investigation, I realized that identifying twitter bots by hand would not be scalable. A scalable method would be to use a machine learning model to identify bots, however this would require a certain number of true labels (bot accounts vs human accounts) to train a model on.
In my first iteration, I decided to use the Cresci 2017 dataset to train a machine learning model to identify bots. Then I built a twitter scraper to scrape data and used the machine learning model to identify bots. Unfortunately, this method quickly ran into a couple of problems.
Firstly, the Cresci 2017 dataset did not contain all the data that the Twitter API returns, this meant that if I were to create new features based on those data, the Cresci 2017 dataset would not be able to support those features and the machine learning model would not be able to make use of the features.
Secondly, the Cresci 2017 dataset seemingly had erroneous labels. In my testing, I found a few accounts that were labelled bots in Cresci’s dataset, but under examination, they appeared completely human by my standards.
With these problems, I decided to drop the Cresci 2017 dataset. Instead, I proceeded with a bootstrapping method to build up my own labels and machine learning model in an efficient way.
To achieve that, first, I built a platform that displayed the scraped data for each twitter account in an intuitive and easy to interpret fashion. The UI not only displays the basic data available on the Twitter page, but also derived features such as retweet ratio and posting activity for hour of day and day of week. This allowed me to quickly and accurately evaluate each account to be a bot or human, as well as optionally assign them a label for the type of bot.
Next, the platform will automatically retrain the machine learning model (Random Forest) using the labels given, and recalculate the bot probabilities (0 to 1) for each twitter account in the database. The machine learning model is trained using 21 features derived from account and tweet characteristics.
To close the loop, I continued to evaluate accounts with bot probabilities close to 0.5, as these scores indicate that the machine learning model is unsure whether the accounts are bots or humans, and assigning a label to them would be very effective in training the model. This method allowed me to go about labeling accounts for training in the most efficient way.
In addition, I also ran a clustering algorithm (DBSCAN) on accounts identified to be bots, to divide the bots up to various clusters for further analysis. The following picture is a diagram describing the entire platform and its services.
Results:
Throughout the investigation, I have hand-labelled a total of about 250 accounts, and continuously scraped 54,887 Twitter accounts and their tweets (~38 million). These Twitter accounts have tweeted at least one of the 5 hashtags relevant to the Indonesian presidental elections 2019 (#jokowidodo, #jokowi, #prabowosubianto, #prabowo, #Pilpres2019) between 29 March and 25 April.
We can examine individual hashtags to derive insights about the activities relating to them.
#jokowidodo
This hashtag is an example of a huge dedicated effort in botting activity. We see a 57% proportion of bot account relative to human accounts.
Despite efforts by Twitter to suspend accounts, there are still 25% of bot accounts that are still active.
Since the machine learning model does not use the suspended status of the account to predict the bot score for the account, and assuming that suspended accounts are mainly due to botting activities, because almost all of the suspended accounts are labelled as bots by the machine learning model, we can conclude that the machine learning model is relatively accurate in flagging out bot accounts.
#jokowidodo
Breaking down the bots by clusters, we can see that a large majority of bot accounts in this topic to belong to a cluster that has been manually identified as “retweet bots” and “jokowi picture bot”.
Using the platform, we can also examine the individual clusters to gain further insight into their characteristics and activities.
Here are the human-bot percentages for the other hashtags:
Conclusion:
Propaganda in elections existed ever since elections were held, and political parties all over the world are likely to be already employing the use of twitter bots for the purpose of influencing their elections.
In this investigation, I have developed a platform that enables quick identification of bots, as well as employing machine learning to do accurate identification of bots on a large scale. With that, I am able to examine not only individual bots, but also botnets as a whole, so as to understand their characteristics and behavior. This can also be used to understand the motives behind the bot operators.
With the use of the platform, I have found evidence of propaganda bot activities on Twitter for the recent 2019 Indonesian elections, as well as quantifying the extent of the botting activities on Twitter during the elections.
This investigative report serves as a precautionary warning to citizens all over the world to be on the lookout for sock puppet accounts and avoid being misled by false pretenses of support for political candidates on social media.
In the future, I will be working on adding more features to improve the accuracy of the machine learning model. I will add more analytics, especially in the area of social network analysis (SNA). Finally, I will also use this platform to investigate bot activities in the upcoming elections in my region (Philippines - May 2019, Taiwan - Jan 2020, Singapore - Sep 2020) and report the findings.
Sincerely,
Jin-E
Prophunt.net