Identifying PAMBUNT Voters: Presidential And Mid-terms, But Unfortunately Not Town board
In a previous article, I wrote about how to setup a database and load it with the Board of Elections voter registration data. In this article I will describe how to clean and use one of the more useful fields of that data, the voter history, specifically to identify voters who vote in major elections, but skip local elections. To borrow a concept from the development world (LYBUNT/SYBUNT), I’ll call these voters PAMBUNTs: Presidential and Mid-terms, But Unfortunately Not Town board.
The voter history field describes which elections each registered voter has voted in, but not how they voted. This is extremely useful for identifying voters for canvasing based on the frequency of their voting (or non-voting). Identifying PAMBUNT voters is simply creating a list of people who voted in 2016 and 2018, but not in 2017. This work requires no fancy data science, machine learning or artifical intelligence!
In my small town in New York’s Hudson Valley, out of the 2,451 registered voters (all affiliations), 532 of them did not vote in the 2017 local election, but did vote in both the 2016 and 2018 elections. Meanwhile, our 2017 town board supervisor election was decided by a margin of only 41 votes. Even in a larger town, such as the Town of Poughkeepsie, local elections can be extremely close. The Town of Poughkeepsie has 26,510 total registered voters, with 5,893 that voted in 2016 and 2018, but not in the 2017 local election. Yet, the Poughkeepsie town board supervisor position was decided by a margin of 54 votes. Further, while it breaks my PAMBUNT acronym, the same trend holds for city councilmember elections. In the City of Poughkeepsie, the councilmember at large election was decided by a margin of 226 votes, while there are a total of 3006 PAMBUNT voters. For these elections, being able to identify voters that are active, but don’t usually participate in local elections is very important.
Before we can easily and accurately create lists like the above we need to examine the voter history field. Unfortunately, in New York State at least, this field is not very clean, and will need some processing. Lets start by exporting the voter history field, along with the county voter number field (so we can reassociate the cleaned data properly). Using SequelPro, here is the query I used:
Note that I also only export voters with a status of ‘ACTIVE’. The dataset includes all voters, even if they are currently inactive.
The primary problem with the voter history field is that a given election might be described by several different strings. There is no standardization. This makes it very difficult to do a search to capture, for example, everyone who voted in the 2016 general election.
This is shown below using a tool I wrote:
~/projects/nysboetool ] .build/debug/nysboetool --showuniques state_boe_active.csv | grep 20162016 General Election 2016 Presidential Primary 2016 Federal Primary Election 2016 Primary Election PRESIDENTIAL PRIMARY ELECTION 2016 Pres Primary Election 2016 2016 PRESIDENTIAL PRIMARY General Election 2016 2016 GENERAL ELECTION GENERAL 2016 GENERAL ELECTION 2016 PRESIDENTIAL PRIMARY 2016 General Election, 2016 20161108 GE 20160419 PP 20160628 PR Presidential Primary Election 2016 20160913 PR GE 20161108
The above list is only for elections in 2016, a similar problem exists for all years. You will notice that sometimes the year is first, sometimes last, sometimes the words are capitalized, etc. In order to fix this problem, we are going to standardize the election description, and then adjust each raw election string to match the corresponding standardized election string. Once adjusted, this data will be output to a new field, and we will load it into our database. This will allow the original data to be used (should we need it), but also allow us to do easy searches based on the cleaned up voter history data.
Cleaning up the data is an iterative process of examining the output and replacing strings such that an input string (left side of the arrow in my debug output) falls into the proper bucket (right side of the arrow). Each bucket is our standardized election. We want to standardize on GE for general elections (PP and PE for primary elections), followed by a two digit number for year (GE16, GE18 etc).
Looking only at 2016, let’s first capitalize each string, remove whitespace, remove punctuation, and make sure all letters come at the beginning:
~/projects/nysboetool ] .build/debug/nysboetool --debug -c state_boe_active.csv | grep 20162016 General Election --> GENERALELECTION2016 2016 Presidential Primary --> PRESIDENTIALPRIMARY2016 2016 Federal Primary Election --> FEDERALPRIMARYELECTION2016 2016 Primary Election --> PRIMARYELECTION2016 PRESIDENTIAL PRIMARY ELECTION 2016 --> PRESIDENTIALPRIMARYELECTION2016 Pres Primary Election 2016 --> PRESPRIMARYELECTION2016 GENERAL 2016 --> GENERAL2016 20161108 GE --> GE20161108 20160419 PP --> PP20160419 20160628 PR --> PR20160628 20160913 PR --> PR20160913
You’ll notice that already the number of unique elections (ie: buckets) is already shrinking. Lets replace ‘GENERALELECTION’ and ‘GENERAL’ with ‘GE’:
~/projects/nysboetool ] .build/debug/nysboetool --debug -c state_boe_active.csv | grep 20162016 General Election --> GE2016 2016 Presidential Primary --> PRESIDENTIALPRIMARY2016 2016 Federal Primary Election --> FEDERALPRIMARYELECTION2016 2016 Primary Election --> PRIMARYELECTION2016 PRESIDENTIAL PRIMARY ELECTION 2016 --> PRESIDENTIALPRIMARYELECTION2016 Pres Primary Election 2016 --> PRESPRIMARYELECTION2016 20161108 GE --> GE20161108 20160419 PP --> PP20160419 20160628 PR --> PR20160628 20160913 PR --> PR20160913
Replacing other strings using the same iterative process results in the correct number of election buckets:
~/projects/nysboetool ] .build/debug/nysboetool --debug -c state_boe_active.csv | grep 20162016 General Election --> GE2016 2016 Presidential Primary --> PP2016 2016 Federal Primary Election --> PE2016
I am not sure if I should bucket together the federal primary and the presidential primary, but most of my searches have been focused on the general elections, so it really doesn’t matter for our purposes. Lastly, we will remove the ‘20’ from the year:
~/projects/nysboetool ] .build/debug/nysboetool --debug -c state_boe_active.csv | grep 20162016 General Election --> GE16 2016 Presidential Primary --> PP16 2016 Federal Primary Election --> PE16
Now, lets remove the debug output option to generate the SQL code necessary to update our database. This will result in lines like this:
Redirecting this output to a file will allow it to be imported into SequelPro:
~/projects/nysboetool ] .build/debug/nysboetool -c state_boe_active.csv > update_voter_history.sql
Then, we duplicate the ‘VOTER_HISTORY’ field in SequelPro, naming the new field ‘CLEANED_VOTER_HISTORY’, and import the file we just created. With that, we can do much more accurate queries based on voter history. For example, this is the query for finding PAMBUNTs:
I have shared my code on Github. The entire process of cleaning the voter history field is very brittle, so modifications may be necessary for other datasets. The code I have written should be a good starting point. If anyone reading this has done similar work, I’d love to hear from you!