In this article I will build upon my previous articles and show how to use public property tax records to identify unregistered voters. In addition, New York State allows voters with second homes (or students in school) to register in either location. Your vote might not have a huge impact in New York City, but can mean much more in the rural counties of the state. Public tax records can help identify these second home owners as well. Outreach efforts can then use these lists to get more people registered in areas where they can make the biggest impact.
First, we need to get the public tax roll. For Dutchess County, the tax rolls are available online in PDF format. Initially, I tried converting these files to a machine readable format. It did not go well. Each file is laid out spatially in columns, with the most important information being the parcel address (the address being taxed), and the primary owner address.
The closest I got was by using Tabula. Tabula is a open source tool for “liberating data tables locked inside PDF files”. Unfortunately, due to the layout of our PDF, the resulting CSV was not ideal and needed quite a bit of processing. Some lines were formatted such that extracting the information was difficult or inaccurate. Through trial and error, I was able to extract the information I needed, but I knew there had to be a better way.
I contacted the Dutchess County Real Property Tax Service Agency via email and inquired about getting the data in a machine readable format. They responded that custom reports are available in Excel format, but for a fee.
After sending off my check (check?!), I received a zip file with 2 Excel files that included all the records for Dutchess County. Loading the data into our database was simply a matter of importing into Excel (or Numbers on a Mac), and then exporting as a CSV file. This file can be imported to a new table using SequelPro, or whatever MySQL client you prefer.
The basic idea is to match the parcel address from the tax records to the address in the voter registration data. If there is no match, then that parcel address potentially has unregistered voters. Obviously, this is error prone, but can be a good starting point for identifying lists of voters for outreach purposes.
With our tax data imported, after much trial and error, I found the best approach was to make a custom address field in the voter registration data table that most closely resembles the address in the property tax table. This is done by taking the address number field and concatenating it with the street name field:
After creating this new custom field (and indexing on it!), we can run a query to join the two tables. I run a
RIGHT JOIN looking only for cases without a county voter number, meaning non-matches. In addition, only including residential property classes, and I filter out obvious cases of owners that are not individuals (LLCs etc).
With this simple query, I identified 16,903 potentially unregistered voters in Dutchess County. It is a rough approximation, but a good starting point for outreach efforts.
Second Home Owners
By slightly modifying our above query we can identify second home owners. Here we want to use the tax records to find cases where the parcel address does not match the primary owner address. This can be done most simplistically by just adding one line to our query.
The above query identifies 8,933 potential second home owners who are not registered to vote at their second home address. However, matching addresses using strings is inherently error prone. Typos, multi-address properties & other cases are common. We can help clean up some of the error cases by using the Levenshtein distance between the parcel address and the owner address. Levenshtein distance is simply the number of differing characters between the two strings.
First, I added a function to my database following the tutorial here. Then, with the function in place, I altered the above query to the following:
This gives 7,105 potential second home owners, unregistered in Dutchess County. I used a Levenshtein of greater than 4, which seemed to be a sweet spot of catching most cases, without introducing more error. The query is much slower, but I didn’t spend much time optimizing, and could probably be improved with appropriate indexing.
It is important to note that this process is not exact. Using these techniques is a merely a starting point.
The biggest issue with these techniques is the error inherent in matching street addresses from differing sources by comparing strings. The Levenshtein Distance was able to help for comparing the parcel address to the owner address because we knew that those two address would be very different, if they were different at all. However, when matching an address from tax records to voter registration records, the difference might be only one character, yet be a valid difference.
One way to approach this problem would be to geocode all the addresses, then compare the resulting latitude and longitude coordinates. Essentially this would offload the street address resolution to the geocoding service. Perhaps in a future article I will explore the improvement we can gain by using this technique.
For now, if anyone reading is involved with groups that might benefit from this data, please let me know! (Especially groups in Dutchess County, as I already have the data ready). The data is the easy part, the difficult & time consuming part is the outreach efforts that use the data. That might be printing labels for existing mailers, using a service like lob.com to automate sending, or just old fashioned door to door outreach.
Whatever outreach work is done, having a good foundation of data is key, and while large campaigns already have databases, many smaller groups & campaigns can benefit from a grassroots approach.