Open Precinct Data: Schema 0.1 Proposal
Here’s a preliminary exploration of a new data format for storing U.S. electoral precincts, a follow-up to Open Precinct Data (April 2018).
Imagine if you could easily correlate detailed voting results from OpenElections.net or state boards of elections with mapped polygons and census geography over time.
Open precinct data supports a variety of needs. It provides missing geography to other data projects and makes community additions and conversation possible via a git-style mechanism. It makes precincts visible on a map and supports current and future elections by collecting forward and backward in time, all with unambiguous references to specific precincts in time using stable, internally-assigned, opaque, unique identifiers for every individual geometry.
The schema here is based on the ecosystem around GTFS, the data standard for transit schedule data successfully pioneered by Portland TriMet and Google Maps over ten years ago. Some design goals include:
- Ease of use by data scientists and political scientists, whose tools include Python or R notebooks, spreadsheet applications, or GIS software.
- Accessibility to parsing software similar to Partridge (by Danny Whalen and Remix).
- Compatibility with parallel electoral data projects such as Open Elections.
Instead of simply documenting a tabular schema, this post shows the data in-use to answer basic questions about electoral geography. We can get to feature documentation later but first we need to validate that the format addresses basic needs. The remainder of this post is excerpted from a Python Notebook.
The sample data here covers a small part of North Carolina. shapes.shp
includes geometries for Congressional districts with 2012, 2014, and 2016 borders, Johnston and Alamance counties, and all their voting precincts. U.S. House candidates for each included district over three general elections are included. The data is also available in Google Docs for easier browsing.
Loading Data
Open Precinct Data is stored in a zip file with six contained files:
elections.csv
– one or more electionsdistricts.csv
– electoral districts for each electioncandidates.csv
– candidate details such as political party and incumbency for each districtprecincts.csv
– voting precincts where candidate votes are tallied for each electionshapes.shp
– geographic areas for precincts and districtssources.csv
– names and links for official sources of data
We start by loading data from each of these files into Pandas DataFrames, using GeoPandas for shapes:
One Precinct
Let’s look at a single precinct. The term precinct is used loosely in this project. A precinct is any geographic area where votes are counted. For example, absentee ballots for entire counties may be included in this list. Here, we select a single precinct covering Haw River in Alamance County.
PSID:1158849879
is a unique, opaque identifier for geographic shapes inshapes.shp
OPID:1360711279
,OPID:1360711281
, andOPID:1360711283
are unique, opaque identifiers for three elections inelections.csv
OPID:1360711289
,OPID:1360711291
, andOPID:1360711297
are unique, opaque identifiers for three districts indistricts.csv
- The precinct changes name over time, but it’s always the same geographic area
- For the first two elections the precinct covers two districts,
OPID:1360711289
andOPID:1360711291
- In the final election held after North Carolina’s 2016 redistricting, the precinct covers just one district,
OPID:016
- Both
OPID:1360711291
andOPID:1360711297
can be called “District 6” but each belongs to a different plan - Identifiers like
OPID:nnn+
andPSID:nnn+
are completely opaque and internal to Open Precinct Data - Identifiers like
FIPS:nnn+
are defined by the U.S. government and used by the Census
Connecting Precincts to Elections
Our sample data includes two counties and three elections. Let’s look at candidate incumbency for an election in one county: 2014 in Alamance County. We start by matching on elections.election_date
and precincts.county_name
, and using pandas.merge()
to select a subset of precincts.
Party incumbency is useful when predicting election outcomes: candidates running for re-election often have a track record and name recognition which can help them in the polls. We perform another pandas.merge()
to connect our Alamance 2014 precincts to incumbent candidates, this time with a left join.
The complete list of 38 precincts above has a mix of Republican and Democratic incumbents along with an empty seat. In 2014, three U.S. House districts overlapped with Alamance County.
Output to GIS
Now we can link the table above to geographic areas to see how this county election looks on a map. Both precincts and districts include a shape_id
foreign key that we can find in shapes.shp
. After merging, we use a geopandas.GeoSeries.intersection()
to split multi-district precincts like Haw River (precinct 13) among their districts.
Finally, we can see precinct 13 in the map below, with its Northern portion in David Price’s Democratic District 4 and its Southern portion in the open-seat District 6. A small corner of the county falls in Renee Ellmers’s Republican District 2.
Conclusions
The example above is small and contrived.
Next steps might include putting the sample zip file through more scrutiny. Open Elections precinct-level results might be connected to the unique IDs used here. A larger sample file covering several elections in North Carolina would support a more consequential exercise. The data could be extended to cover more chamber elections that U.S. House of Representatives. Linking tables like candidates.csv
might belong in another data project like Open Elections. Spatial data in shapes.shp
could be stored in a different format such as GeoJSON or Geopackage.
Get in touch if this sounds interesting to you.
Thanks to Brian, Danny, Derek, Katie, Nelson, and Stephen for their feedback on early drafts of this post.