Can I predict who you voted for from your web search history?

Marina Wyss
Analytics Vidhya
Published in
8 min readMay 7, 2020
Photo by Mitchell Luo on Unsplash

For my Master’s thesis, I wanted to see if it would be possible to predict who someone voted for using only their web search history. I was lucky to have access to a really unique dataset that made this possible: six months of web browsing history — including the text that was written into search engines— for 708 Americans. Not only that, but all participants were interviewed in a series of surveys where they were asked about their demographic characteristics, whether they voted in the 2018 midterm elections, and if so, for which party. Everyone in the study was compensated, and actively opted in to web tracking by installing a browser extension on desktop and mobile.

There are many reasons to believe that web search history could be predictive of voting behavior. We’re all familiar with targeted advertising based on browsing history, which assumes that the way someone behaves online is associated with some group that the advertiser is interested in reaching. This makes sense, because prior research has shown that online behavior and social media activity can predict demographic characteristics such as age and gender, which are in turn associated with things like party preference. The language individuals use when writing social media posts or blogs has also been shown to be predictive of demographics and party identification. More specific to this research question, aggregate-level Google Trends data on web searches in geographic areas has been used to successfully forecast everything from changes in the stock market to the spread of disease or the level of voter turnout.

The cool thing about this project is that I was able to work with individual-level web search history, which (as far as I know), hasn’t been done before. I was also able to look at all search engines, and not just Google. So, unlike prior research that focused on the volume of Google searches for a particular keyword in a geographic area, I was able to evaluate an individual’s entire search behavior across all platforms to consider things like vocabulary — even that which may not be obviously related to voter status — the query “sentiment” (e.g. the emotion behind the words), or the time of day an individual search was made. These more nuanced considerations are important, because people often search in a more personal way, phrasing queries like sentences or questions, sometimes treating the search engine as something of a confidant.

If search engine queries proved to be predictive of personal information like whether or not someone turned out to vote and which party they prefer, this could have a variety of consequences for polling, privacy, and democracy. Polling could be augmented through a cheaper, faster method than traditional surveys, and the lack of social censoring when making searches could perhaps overcome some of the issue of social desirability bias, which is basically when respondents aren’t truthful in polls in an effort to be seen more favorably by the pollster. For example, if someone felt shame about being a Trump voter in a liberal state, they may not be honest about who they intend to vote for, which can make forecasting difficult. Predicting political preferences on the basis of search engine data would also open doors for political marketers and campaigns, both domestically and abroad, to target voters. This is why it’s critical that we understand the limits of what can be done with personal digital data of this kind.

Patterns in the Data

My initial look into descriptive differences across the sample showed promising signs in terms of the predictive capacity of the data. Interestingly, no user in the data used multiple search engines to make queries: If the participant used Google for searches, they never used Bing, for example. Also, the user-bases of different search engines varied dramatically in terms of ideology, age, gender, and other demographic factors. On average, Democrats tended to make more queries than Republicans, though Republicans wrote longer queries. Non-voters used search engines later in the day than voters in general.

Certain keywords were clearly associated with partisanship: Democrats and Republicans both have the terms “2018,” “day,” “new,” “trump,” and “us” in their top 10 query terms, but Republicans are much more likely to search for the words “photos,” “flowers,” and “American,” while the terms “best,” and “news” are more associated with Democrats. Differences in vocabulary can be further analyzed based on keyness, which relies on chi-squared to compare the relative frequency of terms between two documents to identify the terms that are most strongly associated with each group. Among the words more associated with Republicans were neutral terms like “flowers” and “cat,” as noted in their top search terms, but also conservative news outlets like “drudge,” and “daily” “caller.” In contrast, Democrats were more associated with the words “lesbian,” “flirt,” and “vegan.”

Beyond looking at searches generally, I also wanted to see if there were patterns when considering specific, theoretically-relevant searches. I made a few keyword lists, including voter-registration related keywords (like “absentee,” “voting,” and “ballot”), the names of candidates for the U.S. House of Representatives election in each state, well-known partisan political figures, and general non-partisan political terms (such as “filibuster” or “gerrymander”).

Unsurprisingly, voters made searches containing items from all of these lists more frequently than non-voters. The variation was less extreme when considering Democrats vs. Republicans.

Methodology

I wanted to see if it was possible to predict 1) whether or not someone reported to have voted, and if so, 2) for the Democrats or Republicans.

The first step was to feature engineer five unique datasets:

  • Search Behavior: The first dataset looked at search behavior. This includes things like which search engine was used, the time of day, query sentiment, and whether the participant searched for any of the words relating to politics from the predefined lists.
  • Top 1000 Unigrams and Top 1000 Bigrams: For the second and third datasets I found the top 1000 most-searched for individual words and top 1000 two-word phrases (like “Waffle House,” or names), and noted whether or not the participant searched for any of these top terms or phrases.
  • Entire Search Text and Entire Political Search Text: The last datasets looked at the participants’ entire query text. Basically, I just squished all of a respondent’s queries together into one big paragraph. This block of text was turned into a numerical representation that captures the relationship between words, called word embeddings. I used BERT from the transformers library for this. There was one dataset with all of their searches, and one where I only condensed the searches that contained one of the “political” words that were defined before.

Both research questions — whether someone voted or not, and if so, for which party — were tested on the following models:

  • The Search Behavior dataset was implemented with logistic regression, k-nearest neighbors, XGBoost, and a basic neural network (MLP).
  • The Top 1000 Unigrams and Top 1000 Bigrams datasets were explored with regularized logistic regression, a support vector machine, and XGBoost.
  • Lastly, the Entire Search Text and Entire Political Search Text datasets were modeled with the neural network.

All models rely on supervised learning, and were hyperparameter tuned (as appropriate) using repeated cross-validation before testing on the test set for the final results. To ensure balanced training data I relied on SMOTE before training any models. When feature scaling was relevant for the particular model/data combination, this was also considered. Coding was completed in R for all of the models, with the exception of the neural network and BERT embeddings, which were done in Python.

In order to evaluate the models’ performance, I first created two baseline logistic regression models that used demographic characteristics like age and education as features. The metrics used for comparison are accuracy, or how many cases were correctly identified, and F1, which is a measure that also considers false positives and false negatives. The baseline metrics to beat were accuracy of 93% and F1 of 95% on the test dataset for the turnout model, and accuracy of 74% and F1 of 65% on the test dataset for party choice.

Results

Turnout Overall, the results from the first research question on whether it is possible to predict if a participant reported to have voted based solely on their search history, were modestly promising. Given the high bar set by the baseline model, it is not surprising that few models were able to compete, though a few do manage that task. The best performance came from using word embeddings and a neural network on a participant’s entire search query history where a politically-relevant term was present, which achieved accuracy of 96%, and F1 of 94%.

The tricky thing with this model, though, is interpretability. Not only are neural networks known for being difficult to interpret, common methods would not produce meaningful results on the embeddings. While variable importance modeling methods do exist (such as LIME or the VIP package in R), the results would be the most-relevant word vectors, which are essentially meaningless numerical vector representation of the most relevant words for prediction. There is, unfortunately, not a way to consistently re-translate these vectors back into written language that I am aware of.

Party Choice Unlike the first research question, none of the models on party choice reached even close to the level of accuracy or F1 of the baseline socio-demographic model. For example, the best configuration achieved an accuracy of 63%, 11% below the baseline model.

Overall, the best performing model was Search Behavior utilizing the neural network. I looked into these results a bit with LIME, and found that how often a participant searched, mean search length, and the search engine used were consistently noted as highly important for prediction.

Conclusions

So, I guess the answer is — while yes, there are interesting insights we can gain about whether someone voted and which party they prefer, I actually wasn’t able to consistently beat the predictive power of a simple logistic regression and demographic variables. Which is probably for the best, for privacy’s sake. ;-)

It is quite possible that given a larger sample size and more computational power, future research could achieve highly accurate results, though. Luckily, search query data on the individual level is not typically available, so even with that, it seems unlikely that search query data will significantly add to the toolbox of those seeking to predict political preferences for the purposes of targeting or forecasting.

If you’re curious to learn more, all the code (and the full paper) are available here: https://github.com/MarinaWyss/search-engine-thesis

--

--