High-Volume Offline Scoring of Plans

Published in

Dave’s Redistricting

4 min readNov 9, 2023

In this post I describe how and why I used code and data that Dave’s Redistricting (DRA) makes available to researchers (post) to “score” large ensembles of congressional redistricting plans.¹

Figure 1: Sample code using analyze_plan()

A friend and colleague, Todd Proebsting, and I are doing some independent research in which we’re generating large ensembles of congressional redistricting plans — 1 thousand, 10 thousand, maybe even 100 thousand randomly generated plans for a state. In order to better understand the trade-offs inherent in congressional redistricting for a state, we want DRA-style analytics for those plans.

This presented a problem, as we didn’t want to — couldn’t really — import each one manually into DRA, and then manually record some subset of the analytics DRA computes. So, I created a Python project (alecramsay/rdafn) that uses the Python port of the DRA analytics (dra2020/rdapy) and the precinct-level census & election data from DRA (dra2020/vtd_data) to “score” these plans offline.

Update 11/12/23: I pulled the data out of the rdafn into its own rdadatarepo (alecramsay/rdadata). The two README’s have updated installation instructions.

Update 12/11/23: I’ve archived the old repositories in my personal GitHub account. You can find them now (slightly renamed) in repositories in a project account:[rdatools/rdabase](https://github.com/rdatools/rdabase) and [rdatools/rdascore](https://github.com/rdatools/rdascore).

Data

As you can see in the sample code in Figure 1, the function analyze_plan() takes five arguments. The first is a plan to score; the rest describe the state and are the same for every plan.

The plan is represented as a list of precinct assignments — dicts with “GEOID” and “DISTRICT” keys. For a few reasons — including easy access to data — we use whole precincts and not more granular census blocks in the plans our ensembles.

The other four inputs are precinct-level data for the state:

Census and election data from dra2020/vtd_data abstracted and combined using the extract_census.py, extract_elections.py, and join_data.pyscripts. The last script pickles the result for fast loading.
Precinct² shape data abstracted using the extract_shape_data.py script. Again the results are pickled for fast loading. Much more on this below.
An adjacency graph for the precinct shapes. These were copied from the alecramsay/baseline project where I created them to support finding root maps.³
Some metadata for the state, such as the number of congressional districts and the number of counties.

Scorecard

With those inputs, I can “score” plans. Here’s a pretty-printed example for a North Carolina plan:

population_deviation: 0.0091
estimated_vote_pct: 0.4943
pr_deviation: 0.0447
pr_seats: 7
pr_pct: 0.5
estimated_seats: 6.3746
estimated_seat_pct: 0.4553
fptp_seats: 7
disproportionality: 0.039
efficiency_gap: 0.0332
gamma: 0.0191
seats_bias: 0.0188
votes_bias: 0.0041
geometric_seats_bias: 0.0185
global_symmetry: 0.0054
declination: 2.3489
mean_median_statewide: -0.007
mean_median_average_district: -0.0071
turnout_bias: 0.0
lopsided_outcomes: -0.0004
competitive_districts: 5.85
competitive_district_pct: 0.4179
responsiveness: 4.4657
responsive_districts: 5.85
responsive_district_pct: 0.4179
overall_responsiveness: 7.8116
avg_dem_win_pct: 0.5626
avg_rep_win_pct: 0.5741
opportunity_districts: 0.0
proportional_opportunities: 5
coalition_districts: 7.3259
proportional_coalitions: 5
reock: 0.4947
polsby_popper: 0.3361
county_splitting: 1.2739
district_splitting: 2.07
proportionality: 81
competitiveness: 56
minority: 50
compactness: 78
splitting: 14

For a variety of reasons, DRA’s production TypeScript package dra2020/dra-analytics does not calculate a few minor things that show up in the UI. Hence, the Python port (dra2020/rdapy) doesn’t either. My project uses the latter, so those few things also aren’t in the “scorecard” output above.

To keep the results simple, I also suppressed district-level results. The scorecard is a simple flat dictionary of plan-level metric key/value pairs.

Performance

More consequently, to maximize throughput, I chose not to compute KIWYSI compactness (see Compactness). Here’s why.

The straightforward way to calculate compactness metrics for a plan is, of course, to first create district shapes based on the precinct assignments and then compute the metrics using those shapes. Unfortunately, the simple naive approach to creating district shapes — “dissolving” precinct shapes into district shapes — is a very expensive operation. Even with just precinct shapes (i.e., not blocks), that can take ~60 seconds for a North Carolina plan.

As you might expect, DRA uses a much faster approach. For each state, it preprocesses the shapefile for census blocks into a topology using TopoJSON. Then at runtime in DRA proper, it uses the TopoJSON merge() function to produce district shapes. That’s all done in node and TypeScript and is very fast.

You can call Javascript from Python, so I first tried to use that same function from within Python. It works but was still way too slow for our purposes: ~5 seconds / plan. Virtually all of the time was in calling TopoJSON merge() from Python and marshalling the result across processes back from JavaScript.

I could have chosen to implement a Python native version of merge(). Instead, I chose to skip KIWYSI compactness — which requires the actual district shapes — and just calculate the two metrics that DRA uses to rate compactness — Reock and Polsby-Popper. They only depend on the area, perimeter, and diameter of districts. If you can infer those properties of districts, you don’t actually have to make the district shapes themselves!

I wrote the extract_shape_data.py script to preprocess the precinct shapefile for a state and abstract the precinct shapes down to just the information I needed to impute the area, perimeter, and diameter of districts without actually creating the district shapes. As a result, running analyze_plan() on a congressional plan for North Carolina — i.e., calculating all the analytics shown above — takes a small fraction of one second 🤓. This gives us the kind of throughput we need to score large ensembles.

Conclusion

The code and data that DRA makes available is letting us do some really interesting and important research!

Footnotes

I wrote the analytics package for DRA. I also do independent research on various aspects of redistricting.
I use “precinct” instead for the more formal “VTD” throughout.
For more information on root maps, see Redistricting Trade-offs Illuminated.