Analyzing the Potential of Machine Learning in Political Science

Financial institutions use massive amounts of data from the stock market to find patterns in financial time series. The data is fed into a machine learning algorithm that classifies and predicts the trends. The healthcare industry also uses Machine Learning algorithms to raise accuracy in detecting illnesses. Machine learning (ML) is a type of Artificial Intelligence that uses large amounts of data to spot patterns and make predictions without being explicitly programmed to, using self-learning methods. ML algorithms are now widely used, and its advantages and potential applications in political sciences will be explained in this paper. Political scientists also use significant amounts of data to predict voter outcomes using polls. However, polls have numerous limitations: they are not easily scalable, expensive to conduct, and not always representative of the population. ML does not have those disadvantages and offers predictive models that are more accurate and efficient. The goal of this paper is to explore the use of Machine Learning algorithms on Twitter data for making predictions about the outcomes of elections.

How political campaigns work

Campaigns use data in many ways. These include discovering favorability and strategizing which places need more mobilizing. They also use these data to compose a list of citizens to contact when in need of various services (Nickerson and Rogers, 2014). There are three types of predictive scores that are used in predictive models: behavior scores, which calculate the probability of certain behavior occurring, support scores, which calculate political preference of voters, and responsive scores, which predict how voters will respond to certain programs (Arceneaux and Nickerson, 2009). The responsive scores also predict how different groups of individuals will respond most positively to direct communication. Campaign managers also use these three types of scores in strategizing which citizens to target, the best communication methods for different groups, and how best to persuade voters (Nickerson and Rogers, 2014). These predictions need to be as accurate as possible which is a serious challenge because data collected from polls is not representative. Project Narwhal is an example of successful data collection and analysis which played a crucial role in Obama’s 2013 re-election. The engineering team led by Josh Thayer used the program to track voters and volunteers around the country. The team appended two types of data from consumer databases: updated phone numbers and additional information from customer data vendors which include home ownership status, mortgage information, and education. However, additional information from the vendors is usually expensive, so most campaigns often focus mostly on phone calls and door knocking (Nickerson and Rogers, 2014). Both phone calls and door knocking provide polling results that are used in predictive models. Phone calls is a faster method than door knocking but it still time consuming and inefficient. In addition, not everyone owns a phone, making it hard for campaign data analysts to obtain representative data. This lack of good ways to collect data and create predictive models creates a need for Machine Learning, a more accurate way to process data.

To improve the current methods, one must first understand these systems and their limitations. The present standard for evaluating opinions and making predictions regarding elections are polls, a short interview of randomly selected individuals. However, various factors limit their usefulness. For one, most polling methods introduce some sampling bias. For example, if a poll uses landlines to ask for opinions, sampling bias can create inaccurate results (Pew Research Center, 2010). Additionally, many of the problems common in most interview methods are also present in polls. Questions could easily be unclear or leading, creating inaccurate responses, and statistical noise from small sample size can cause an error, as can nonresponse bias, a bias that makes polls more volatile because supporters of leading candidates become more likely to respond (Pew Research Center, 1998).

Finally, surveys can become inaccurate due to conformity. One may not voice their support for a minority candidate in an attempt to fit in with the general population, especially if the candidate’s viewpoints are significantly removed from the social norm. A University of Southern California (USC) online poll showed that Trump supporters, especially women, were less likely to tell a phone pollster that they supported Trump than Hillary supporters were, reinforcing this idea that conformity can sway how respondents answer (Emamdjomeh and Lauter, 2016). All these anomalies cause polls to have severe limitations in their applications and show the need for some other predictive model.

A partial solution to some of these limitations may be aggregating polls. One of the most well-known examples of this is FiveThirtyEight’s poll-based predictions (Silver, 2017). Like most poll aggregations, this method reduces statistical error due to sample size, since the sample size increases. Additionally, FiveThirtyEight has other advantages due to its method. It weights polls based on the credibility of the bureau, reducing some concerns about leading questions, and its simulations run with various types of intentional anomalies to account for statistical biases. However, many of the limitations still exist: the aggregation of polls can only be as good as the polls themselves. In particular, the problems with non response bias and conformity still exist. Thus, although aggregating polls alleviates some problems, the need for a better way to gauge the opinion and predict elections still exists.

Data Collection

Since Twitter is a very popular web service, it has been proven to be a useful source of data and an ‘effective indicator of real-world performance’ (Asur and Huberman, 2010). For instance, in their study, Asur and Huberman used it to analyze the ways to create attention for different movies as well as that attention over time. Using more than 3 million tweets, they constructed a linear regression model that can predict movie revenue before its release.

In general, the data from Twitter is collected by querying the Twitter API — application program interface, a set of tools building software and archiving randomly sampled real-time stream. This method allows gaining a roughly uniform sampling of up to 7 million public messages per day. To be specific, during in the US presidential elections in 2012 the central figures were Obama and Romney. Thus, the first step of using the API to collect the tweets is to pick keywords — e.g. ‘Obama’, ‘Romney’, ‘Democrats’, ‘Republicans’. Afterwards, parsers were created to figure out geolocation and exclude the ones that do not specify their location or are outside the US. The location is needed so that the Machine Learning algorithm can make predictions for specific regions in the US. They ended up with data assumed to be connected to the elections and related to specific locations.

While using this data from Twitter, an ethical conflict arises. Is it ethically acceptable to collect people’s opinion without their explicit agreement and use it to persuade them later? Two ethical framings are relevant here. The first is Utilitarianism, which aims to do the greatest good for the greatest number, here implying that this method is ethical because both stakeholders will benefit: people use Twitter to share and political campaigns analyze those tweets. On the other hand, Kant’s theory of morality would argue for limiting the access to the data, as people are used as means to an end, their opinions analyzed only for the sake of political campaigns. These two framings are in conflict, and one should consider the context of the situation, but this can be seen as ethical, since, with the decision to use social media, one consents to the possibility of it being used for various purposes.

Natural language processing (for Sentiment Analysis)

After gathering the tweets, we need to analyze the language. For that, we use Natural Language Processing (NLP), which indicates the ability of a computer to understand human speech in all forms. NLP researchers aim to understand and manipulate natural languages to perform desired tasks — in our case, to classify the data consisting of people’s opinions and analyze them statistically.

For the purposes discussed here, one needs to examine the sentimental component of the tweets. Sentiment analysis (SA) investigates people’s opinions towards different matters. The first step of SA is pre-processing analysis. It consists of part-of-speech tagging where some nouns, verbs, and adjectives get tagged for future elimination. Then, words are replaced with their roots, e.g. ‘city’ and ‘cities’ become the same. Afterward, prepositions and articles get removed, but negations do not, as they significantly affect the attitude of the user. Thus, negations are kept together with the word they refer to. Clauses are also considered because they strengthen or weaken the intensity of the opinion. Each tweet then gets assigned to a particular category indicating its sentiment or emotion. Finally, the tweets will be classified into three categories: -1 for a negative sentiment, 0 for a mixed one, and 1 for a positive one.

Machine Learning Model

Before discussing the machine learning model, we will define what characterizes an efficient model, using 4 criteria created by Beauchamp, an assistant professor at Northwestern University who specializes in political science and machine learning. Firstly, the success of a model needs to be measured statistically. The most common measurement is Mean Average Error, which does not take into account the size of the sample. Alternatively, one could use statistical significance, which minimizes the problem of an unrepresentative sample because it puts the sample into the context of the population by considering the sample size. Secondly, before the creation of the model, clear benchmarks need to be set. These will distinguish between successful and unsatisfactory models and help avoid confirmation bias. In this case, existing polls can be used as a benchmark. Thirdly, we need to guarantee that the training set for the model is large enough to improve the accuracy of the model. Finally, “out of sample” analysis needs to be performed on the model to evaluate its performance on future data. This analysis involves repeating the prediction multiple times. The model is trained with the data collected before a given day. Afterwards, the information on the given day is used on the model to make the prediction. This procedure repeats itself and the errors for every following day are calculated to understand the performance (Beauchamp, 2015).

Having defined the features of a successful model, we will discuss the model itself. The goal of the model is to predict the percentage of voters who will vote for a certain candidate, the independent variable. This means that we are dealing with a linear regression problem. The models need to specify a set of dependent variables through which will give us the independent variable. Until now two main methods have been used for this prediction — volume-based and sentiments analysis. The volume-based approach involves measuring the number of tweets that mention a candidate. The sentiment analysis, on the other hand, uses Natural language processing to classify each tweet as positive, negative or mixed. To achieve maximum accuracy, the sentiment analysis needs to be added to the volume-based approach. To achieve this, the share of positive and negative volume for all tweets is calculated in addition to the ratio of positive over negative tweets for each party (Bermingham and Smeaton, 2015). Based on these variables we fit a regression which can make predictions. This results in a more accurate model because both volume and sentiment have predictive power. After implementing the model, it is important to validate it based on the 4 criteria defined in the previous paragraph. Depending on the exact implementation, the model’s accuracy will vary. Chandrasekar et al. managed to reach an accuracy of 80% doing the sentiment analysis by hand (Chandrasekar et al., 2012). Bermingham and Smeaton used volume-based and sentiment analysis and achieved an accuracy of 95% (Bermingham and Smeaton, 2011). Beauchamp was able to achieve an accuracy of 98% and also validated the model based on his 4 criteria (Beauchamp, 2015). All of the research shows the tremendous potential of using machine learning to predict the election.

This application would have several advantages over predictive models generated through polling. Firstly, this would likely be less time consuming and costly than polling. Once the algorithm is created, it only needs to be ran on the new data. Additionally, the systemic limitation around conformity that polls struggle with is largely mitigated with this model, as we are now using a data source where people mostly interact with acquaintances instead of strangers, decreasing the gap in conformity (Emam-Djomeh and Lauter, 2016). Furthermore, Machine learning allows for “live-updates” because new tweets are constantly posted on Twitter. This model also allows for following the voter turnout for specific regions in the country which is almost impossible to do with polls. Last but not least, Beauchamp’s research proves that ML can be more accurate than traditional polls (Beauchamp, 2015).

Figure 1: Visualisation of the process our machine learning model goes through. First, the data is collected and taken through pre-model processing. Then, the data is analysed through both sentiment analysis and the volume-based approach. After that, both approaches are ‘weighted’ using a regression method that best fits the current data, with 0.5 and 0.8 used here as examples. Finally, all analyses are counted and added up for a final score.


There are currently some limitations to our model. Specifically, the fact that it relies on Twitter data hurts our ability to use it in other countries and times. Currently, Twitter is a very popular social media in the United States, with only four other countries having a higher ratio of Twitter users than the US (Kuwait, the Netherlands, Brunei, and the UK) (Baronchelli et al., 2013). However, in time, the possibility exists that the amount of Twitter users decreases, causing less reliability of this model. Similarly, Twitter is much less popular in many other countries, particularly non-Western countries or countries with lower GDPs (Baronchelli et al., 2013). This makes it significantly harder to use this model in other countries; there is a possibility for non-representative samples. Since we cannot control this aspect of the model, we have to find a way to circumvent this problem. This could be done by adding other social media to our model to make it more accurate in different countries.

Non-representative samples can result even in countries where Twitter is popular. If one presidential candidate has significantly more followers on Twitter, our sample can become non-representative. This difference in followers can have different causes, like the followers of one of the candidates using Twitter more. During the 2012 Presidential election in the US, Barack Obama had twice as many followers on Twitter than Mitt Romney (Chandrasekar et al., 2012). This difference can create a biased sample and affect the accuracy of the model. Another problem with using social media as a data source is spammers. If people realize that predictions are made through Twitter, they can purposefully spam tweets to affect the predictions. These spammers can make the model unreliable and can also bias the sampling. These limitations need to be addressed before such a model is used in political campaigns.

When doing sentiment analysis, language barriers can also become a problem when using the model in different countries: our current model only considers Еnglish when considering positive and negative emotions. Thus, the model should be adjusted in each country to ensure the sentiment analysis is in line with the language(s) spoken in that country.

Finally, problems with the nuances of language even when it is known: slang and sarcasm. Slang can vary among peoples speaking the same language, and it can change rapidly, with new words emerging and becoming popular. Additionally, political discussions often involve sarcasm. At this point in technological development, programs are still unable to analyze these fringe cases.

Other Predictive Methods

An alternative way to predict elections not yet discussed is the 13 keys to the White House method, devised by American historian Allan Lichtman and Russian scientist Vladimir Keilis-Borok (Lichtman et al., 1981). The method makes 13 statements, either about the state of the US (like ‘there is no significant unrest’) and characteristics of the candidates (like ‘the candidate of the incumbent party is charismatic or a national hero’). Depending on how many questions are answered in a certain way, it predicts an outcome of the election. This method has been found to be quite predictive of the election results: it has been predictive of every election since the method was devised in 1981. This data can be useful for our model, too. Our current model could combine with other predictive data — like the FiveThirtyEight’s predictive poll data or these 13 keys — to create an even stronger prediction.

What else we can get from the data

Besides predicting voting preferences, machine learning can also advise political campaigns on how to persuade voters. At Northeastern University, Nick Beauchamp is currently developing an algorithm “ that could make it easier for politicians to know exactly what to say to make us love them and hate their enemies,” (Lapowsky, 2015). He conducted an experiment using data available online on Obamacare to construct several paragraphs explaining the costs and benefits of the healthcare. He then used the Amazon crowdsourcing community, Mechanical Turk, to ask different people to rate the paragraphs on a scale of 1 to 9, whether they strongly approve or strongly disapprove of Obamacare. He concluded that some of the paragraphs constructed by his algorithm where much more persuasive than others. Using algorithms like these, campaigns can combine past speeches, their respective responses, and social media posts to figure out what to and what to avoid in a political speech. These algorithms can also be used to determine why voters favor the opposition more and what could be said to redirect the support. But all these potential manipulation tools question the very principles of free choice. Social media information is public, and politicians have a right to access that information too. Persuasive techniques have always been used in politics. But if our politicians continue to find more efficient ways of persuading us, what will happen to our ability to choose freely? Lapowsky (2015) claims that “if we’re more aware of how easily we can be manipulated, perhaps we will be more willing to question those who are trying to manipulate us,” and in doing so we become more aware of what we choose than ever.

Future Possibilities

As discussed, the machine learning model solves most of the problems that the survey method creates. However, it simultaneously creates new challenges as mentioned in the ‘Limitations’ section of the paper. Further research can be done on the combination of polls and machine learning because the combination of the two can yield even more accurate results. To the combine the two methods, polls can either be used as a dependent variable in the machine learning model or used in training the machine learning model.

Even in its current state, Machine Learning algorithms show great potential of making accurate predictions about the outcomes of the elections. As the field develops, the accuracy of the models will increase, and the use of ML in political campaigns will be further justified. This paper aims to inspire further research in the field by showing the current exciting possibilities.

Natural Language Processing Explanation

There are two models used for sentiment analysis. The first one is called BoW (Bag of Words). Its goal is to categorize documents by analyzing and classifying different corpus. BoW usually has a large list of words that carry sentiment and have their value when they are found in the text. The flaw is that it is always seeing the words as objects and never attempts at understanding the structure of the text beside predefined lexical units. Another model uses NLP and attempts to understand the text by tagging parts of speech, entities, as well as takes context into account. Regarding the algorithm, there are three main classification levels of sentiment analysis (SA): document-level, sentence-level, and aspect level. As an expression of positive or negative opinion, they aim to classify the document itself, each sentence and each entity accordingly. (Medhat et al., 2014).