Stabilizing Alloy’s Voter File Pipeline

Sam Szuflita
By Alloy
Published in
4 min readFeb 22, 2021

Introduction

At Alloy, we built and maintained a national voter file by combining 51 input files (50 states + DC). This file included voter registration status, demographic data, and contact information and powers Alloy Source, Verify, and Protect. We worked hard to keep this file as up to date as possible and ultimately rebuilt the file more than 800 times.

When I arrived at Alloy, the challenges of maintaining and rebuilding an up-to-date voter file were readily apparent. The build process involved manually running five commands that each ran for hours and often failed. Time spent building the file was not accounted for in our sprint budgets and often bled into nights and weekends. On top of that there was only one engineer who knew how to build the file.

With the 2020 election cycle looming, we decided to fix this process: running the pipeline had to take less engineering time and any engineer had to be able to run it. We knew that we had a few more features to add to the pipeline, so we had to remain flexible. We also knew that if we did nothing, our flaky build process would limit our ability to provide voter data to the progressive ecosystem.

I worked with our engineering team to implement 3 process and engineering improvements to stabilize our voter file pipeline. I (1) mitigated common sources of failure, (2) invested in pipeline automation, and (3) designed a scheme for sharing pipeline duty. With these 3 improvements in place, we were able to make our pipeline reliable. As Alloy winds down its operations, we wanted to publicly share what we learned from making these improvements in the hopes that it may help others facing similar challenges.

Mitigating common sources of pipeline failure

When I started this process, Alloy did not have reliable data on which failure modes were most common. After asking around the team, two errors stood out as frequent: running out of disk space in MapReduce jobs and schema breaks in input datasets.

This first issue was straightforward enough: we use cloud managed MapReduce clusters, so we could just request nodes with larger disk size. We initially used 32GB to keep costs low. Given the engineering time spent rerunning failed jobs, I advocated increasing this to 128GB. This decision did not make a dent in our cloud bill and these errors disappeared.

Schema breaks posed a larger challenge. In some of our sprints, several states made format changes to their voter files without any notice. The task of fixing these schema breaks could be straightforward (changing date formats, adding or removing header rows to CSVs) or back-breaking (supporting a new ID scheme and mapping old IDs to the new IDs).

Unfortunately, this was not something Alloy could control. However, we could plan for schema breaks. We started assigning one engineer to be the “schema fixer”; their top priority was to resolve any schema breaks that came up during the sprint. While schema breaks continued to occur, they no longer caused other deadlines to slip and we had a process to fix them swiftly.

Investing in Pipeline Automation

Previously, each step in our pipeline had to be manually configured with several parameters (input URI, output URI, etc.). At best, configuring these commands was a pain; at worst, this could lead to stale data winding up in our voter file or wasted engineering time. The person running the pipeline also had to watch for steps of the pipeline to finish. This was a serious distraction and limited their ability to get other work done. We decided that some lightweight pipeline automation could improve this process.

I built a new command that ran our entire pipeline from end to end. I used our cloud provider’s APIs to wait for each pipeline step to complete. If a step failed, I logged as much diagnostic information about the failure as possible and cancelled the pipeline. With this wrapper in place, the entire pipeline could be run with a single command. This also meant the pipeline runner didn’t have to pay attention to running jobs and could focus on other work.

Sharing Pipeline Duty

Now that we had a more sane process for running the pipeline, we wanted to share this responsibility across our engineering team. Each sprint, we assigned a “pipeline runner” and accounted for this work in our engineering budget. When new input data landed, their top priority was to run the pipeline as soon as possible. This allowed us to identify schema breaks early and plan accordingly.

Closing Thoughts

After stabilizing the pipeline, we started collecting data about failures in our production system using Sentry. We also extended the pipeline wrapper to enable running the pipeline for multiple states in parallel. Previously, running the pipeline required too much babysitting for this to be feasible.

The pipeline wrapper task logs became a centralized way to observe the state of a running pipeline. They enabled us to communicate about when we expected the pipeline to finish, and to quickly diagnose failures.

Sharing pipeline duty made our engineering team stronger. Engineers gained an improved sense of what can go wrong in production and began logging proactively. Engineers also learned about parts of the pipeline that they did not develop and became better at reasoning about the pipeline as a whole.

The months leading up to the 2020 election were stressful, to say the least. For Alloy, the stability of our voter file pipeline was something we didn’t have to worry about.

--

--