Using web crawlers + ML to aggregate data on candidates running for local office

How we built a web crawler and dedicated machine learning server to aggregate data on candidates running for local office

Published in

Dartmouth CS98

13 min readMar 16, 2021

MyBallot is a tool that allows users to get more information about local elections. For more information on our product and what it does, click here.

We built MyBallot with a technology-forward approach. We knew gathering information on candidates running for local office would be hard. Data online can be limited and the process of quantifying political stance can be complex.

In this article, we detail exactly how MyBallot works. We talk about the choices we made and the architecture of the entire system. We hope this serves as an interesting case study for analyzing sentiment of sources online.

How It Works

MyBallot sits upon a micro-service architecture that provides a fast and intuitive user experience while allowing us to embrace complex crawling and analysis processes behind the scenes.

Our main backend server acts as a gateway and connection between the frontend and our database. Users of the website connect to this server to grab data from our database and interact with it.

In the background, we have a separate web crawler that scrapes the Internet for information on candidates running for local office. We also have a separate machine learning server dedicated to parsing, cleaning, and analyzing data from the web crawler.

Our ML server takes scraped data from the web crawler and converts it into numerical estimations of candidate’s stances on issues. This involves extracting semantically relevant topics for individual races and computing scores for each candidate.

When this process is complete, the ML server updates our database. Users then immediately see those changes on the frontend.

We dig into more information on our web crawler and ML servers below.

Behind the Scenes: Web Crawler

In order to collect as much relevant information about candidates as possible, we designed and built our own internet crawler. The crawler is run on a standalone server and can crawl either a specific race in our database or all of them at once. The crawling process involves a number of steps:

Get race information from our database. This includes information like the location of the race, the name of the position that the race will fill, and the candidates themselves, which we add to a queue for crawling.
Spawn a number of candidate crawling threads. Currently, we crawl five candidates simultaneously, each in a separate thread. Once a thread has finished crawling, it grabs the next candidate from the queue.
Within each candidate crawling thread, we extract keyword information based on the candidate’s name, location, and position name.
For many candidates, the VoteSmart API and website provide access to the candidates’ homepages, which may be websites, Twitter profiles, or Facebook profiles. If these are available, we perform a crawl of those pages and make sure to follow any links to other pages on their website.
Next, the crawler checks to see if there are any public statements by a candidate, which are often available on the VoteSmart website. We built a module to fetch a number of these public statements and filter them for relevancy to make sure we aren’t mixing different candidates’ positions. We decided to prefer VoteSmart data over web-crawl data as it is already vetted to ensure that it is what a candidate actually said. This way we can be sure we are getting a candidate’s opinion and not just a journalist or random person’s opinion. Further, since the website follows a specific HTML format, it is easy to scrape just the text we want, without gathering things like navigational or header text.
Finally, if we have limited data from the homepages and public statements, we perform a web-crawl on a candidate. This process begins by collecting jumping-off article links for our crawl. In order to find links, we automate the act of searching using a search engine and the keywords collected in step 3. Once we have found some starting points, we visit these websites, extracting their content and scanning it for potentially relevant information. If we think a website might be useful, we save it and grab any links containing the candidate’s name from the page, which we check next.
After gathering a bunch of content, we begin to filter out irrelevant information about a candidate. We do this by getting text from elements where our candidate’s name is mentioned, or fetch the whole text if we deem the entire article page to be about our candidate.
Finally, once we have a collection of relevant text, we remove duplicates and join the text into one string per webpage. We then update our information on this candidate by adding the new text to our database.

In building the crawler, we encountered a few main challenges. The first of these challenges involves how we programmatically load webpages. Our first approach was to use the python requests library in order to receive a simple HTML document back. In many ways this approach is great — it is very fast, uses very few machine resources, and is simple to manage. The original crawler would fetch many webpages at a time simply by having a bunch of threads send their own request calls, and aggregating these into a list of results. However, we soon realized that much of the content we were getting with requests was incomplete, or only a fraction of what we saw in our own browsers was appearing in our crawled text. This is because many websites use Javascript to enrich a site’s appearance, and often download a lot of content for display after the initial request is made to the server. With the requests library, we were not getting any of that additional information.

Enter Selenium. Selenium is a browser automation library that allows you to control browsers using code, instead of the mouse and keyboard. This solution provides us with the most complete data, as it allows us to view webpages just like any human web surfer would. It is, however, much more resource expensive. In order to manage resources, we designed a Scripting Spider module, which manages a number of headless browsers. Headless just means that the graphical interface does not show up, which helps save resources. The spider works by creating a bunch of browsers and then maintaining a queue of requests, which are sent by multiple candidate crawler threads. Once initiated, the spider polls for new requests and assigns them to one of its browsers as they become available. Here’s a demo of the spider system with headless mode disabled so you can see what that looks like:

When the entire crawling process has finished, we can then tell the spider to shut down its browsers. By using a project-wide settings module, we are able to tweak the number of candidates (and therefore browsers) easily, so that we could scale up or down the crawler process based on available hardware.

Perhaps the biggest challenge in crawling is determining what’s relevant. We do this by using a number of filters for web pages, which have to do with what text is on the page and where that text occurs. For example, in order to determine if we should keep a page while crawling, we check that it contains a number of first and last name mentions that match our current candidate’s. We also check that it contains a number of keywords that are found in the name of the race the candidate is running in, say congress, 5, and district. Earlier on in our development, we were often finding pages from phone lookup sites containing people with the same names as our candidates, but by filtering for race keywords as well, we have mostly avoided these false positives. In order to make sure we throw out as little good data as possible, we normalize the keywords we look for, by stemming them using NLTK and a custom set of mappings. That way, a page that says ‘bob for congress’ will match with a ‘congressman’ keyword pulled from the race name.

After we have a bunch of potentially relevant pages, we enter the filtering stage. In the filtering stage, we also check to see that a page is associated with the correct state, if state applies to this candidate. This helps us avoid problems where, say, a district 5 race with a candidate named “Jim Brown” is happening in two different states. It is also in the filtering stage that we determine how much of a page’s text we should scrape. For Facebook and Twitter pages, we check to see if the candidate’s full name is in the ‘about info’ of the page, as well as the title of the race. If so, we grab a bunch of posts, as we can fairly safely assume that this is the candidate’s social media. Note that this is especially useful when VoteSmart doesn’t have data on a candidate’s homepage. For websites, we grab most of the text on a page if the title and/or url contain the candidate’s name. We also pay attention to the number of keyword occurrences in headers on the page, as a large number of name occurrences in headers may mean that the page is mostly about one candidate. If we cannot determine that a page is almost all about one candidate, then we selectively scrape text from tags that do contain the candidate’s name. This allows us to get relevant information on a candidate from the types of articles that may give a broad overview of a number of candidates in a race.

The last main challenge in crawling was finding and removing duplicate text. We had to find an approach that would work for text that was mostly the same, not just the same string. In order to do this, we first make shingles of a document, which means generating a number of overlapping chunks of the text. For example, shingles for the sentence “I’m a candidate for the district 5 water commission board” may include “I’m a candidate”, “a candidate for”, “candidate for the”, and so on. Shingling makes comparing two texts more meaningful, as two texts may have a large number of common shingles while not being equal. For more information on shingling as it relates to text processing, check out this article. To find texts that are within a certain similarity threshold, we use the MinHash algorithm, which, very simply, uses the Jaccard probability to quickly find the similarities between the hashes of the shingled texts. In order to make our implementation efficient, we use the DataSketch library. Our implementation also provides a setting that can be used to easily change the threshold for duplicates. If two texts are within this threshold, we remove the one with less information, before uploading them to the database.

While our approach doesn’t find the best data 100% of the time, we have found that when there is data on a candidate online, our crawler tends to pick up much of the best text available.

Behind the Scenes: Machine Learning Server

Once the crawler has collected data on each candidate in each race, the machine learning server takes over. There are three main steps to our machine learning process.

Step One: Compile and Clean Data

Crawled data is messy. The most important step in any NLP problem is cleaning the data. Don’t clean enough and you will have a lot of bad data in your analysis. Clean too much and you are removing semantically relevant pieces of information from the analysis. For our cleaning process we:

Tokenize each text object (split it into words)
Remove any numbers
Fix all contractions (e.g. don’t → do not)
Replace all instances of “covid”, “covid19”, “coronavirus”, and “covid-19” with “pandemic” because covid was not a recognized word in our dictionary and is frequently picked up by our crawler
Remove all “stop words”, which are common words that don’t carry much semantic weight (e.g. and, if, when, a…etc)*
Remove punctuation*
Get rid of all words that are < 3 characters in length*

This cleaning process required a bit of trial and error, and some of the steps (those with an asterisk) are not applied to all of the text that we have to clean depending on where it was used in the machine learning process.

Step Two: Extract Relevant Keywords

The next step of the machine learning process is to extract relevant keywords or “topics” from all of the data that the crawler collected on a single race. Initially tried many topic clustering algorithms, from a simple k-means to LDA and NMF. The issue with unsupervised clustering is that once the sentences are clustered, there is no way to extract a label for that cluster with these algorithms.

We ended up using the Rapid Automatic Keyword Extraction (RAKE) algorithm. This algorithm relies on the fact that oftentimes key-phrases are surrounded by stop words or punctuation (therefore when cleaning the main text for topic extraction, we chose not to remove stop words or punctuation). After extracting the words and phrases that are commonly surrounded by stop words and punctuation, RAKE looks for words that frequently appear together and words that have a high frequency in the text and ranks its extracted keywords and phrases accordingly.

We run RAKE on each piece of text pulled by the crawler. After doing this, we have a lot of keywords, and therefore we need to figure out which ones were the most relevant across the race. We compiled a list of 177 national political issues from I Side With. We run this through the text cleaning process so that each issue is boiled down to only the most relevant words. We then put the issues into a Text-Frequency Inverse-Document-Frequency (TF-IDF) model. TF-IDF is based on the idea that the most common words in a text are only relevant if they are not common across all of the documents. We then create a sparse term similarity matrix with the tf-idf model of the issues along with the lexicon of words from the issues and a vector space of all words. The vector space we used is gLoVe. gLoVe allows us to take all English words and understand mathematically how semantically similar they are to each other. So this matrix that we create basically allows us to understand where all of the words in the political issues lie in a vector space based on the meanings of the words. So “abortion” and “health care” might lie close to each other while “guns” will lie very far from “peace”. Now we can grab each keyword that we extracted using RAKE and find out how “close” each keyword is (using soft cosine similarity) to each political issue. We then organize each key-phrase into a dictionary based on which political issue it is most semantically similar to (this way we don’t only pull key-phrases that relate to the same political issue), only keeping the most relevant key-phrase for each issue, but keeping track of how many key-phrases related to the issue. From there we return the most relevant key-phrases across the national issues.

Along with the extracted key-phrases, we have a set of national topics that we run for every race. We tried to choose these in an unbiased way, however, these can be changed by an admin user on the front end. The national topics are then appended to each list of extracted topics.

Step Three: Analyze Candidate Stance

Finally, once we have extracted topics relevant to the race, we run a very similar process to find which sentences are relevant to each topic. We clean all of the text and place it into a TF-IDF model along with the political issues text. We then map it into the gLoVe vector space and find how semantically similar each sentence is to each key-phrase. If the similarity score passes a certain threshold (70%), then we add it to an object that organizes the text by candidate and topic. Since the crawler pulled text for each candidate, we know which candidate each sentence relates to. Once we have all of the relevant text to each topic for each candidate, we use IBM’s Watson NLU to get a sentiment score (from -1 to 1) on the text for each topic for each candidate. Although there is some noise in the text (some sentences might be irrelevant or incorrectly attributed to the candidate), since we only get one score for all of the sentences, this noise should be canceled out by the majority of the sentences which are correctly bucketed.

We chose to use Watson rather than build our own sentiment analysis for two main reasons. Firstly, if we built our own, it would make the most sense to do some sort of unsupervised learning model to bucket the sentences into 5 groups (strongly against, against, neutral, in favor, strongly in favor) but we would have the same problem as we had in the topic extraction: we wouldn’t know how to label those buckets. Secondly, Watson is simply the best in the business, and we favored accuracy and over implementing our own sentiment analysis model.

After the machine learning process is over, the race object is updated in the database to hold each candidate’s sentiment score for each topic, as well as some example sentences so we can display to the user what types of sentences were deemed relevant.

In summary, the machine learning server takes race data from the crawler, extracts relevant topics for each race, and finds each candidate’s sentiment for each topic. The structure is as follows

Behind the Scenes: Fighting Misinformation

In an age of big data, how does MyBallot accurately quantify politicians’ stances and grab relevant information off the Internet while avoiding misinformation?

First, our web crawler blocks domains of websites known to spread misinformation. We gather this list from Snopes’ Field Guide to Fake News Sites and Hoax Purveyors and PolitiFact’s Pants-On-Fire accuracy ratings. Additionally, our web crawler has safeguards in place to only pull information from relevant and reliable sources that have multiple links.

Lastly, we are transparent in our machine learning process. Users of the website can see all links that we crawled for candidates, and can flag any that look suspicious. Our team of moderators review all flags, and can pull down and re-run analyses at any time of day.

About The Team

We’re the MyBallot team, a group of CS majors at Dartmouth College! We’re passionate about how technology can help make voting for local office less confusing, and wanted to create a centralized website to provide curious citizens critical voting information. For more information, read here.

Try it out here: https://my-ballot.netlify.app!