Custom Datasets in DRA 2020

Terry Crowley
Dave’s Redistricting
17 min readJul 17, 2024

DRA 2020 now supports taking existing datasets, available as shapefiles, GeoJSON files or CSV files, and importing them. These datasets can be added to a map where they will be aggregated and show up in details panels for blocks, precincts, districts, cities, counties and states. An imported election-type dataset can be specified as the primary election dataset and used for doing partisan analysis.

Datasets can also be used to color precincts and districts to aid in analyzing and building your map. Aggregates of custom datasets will be exported as part of the precinct and district data export features to aid in further analysis outside the app.

Datasets can also be published and made available to other users to incorporate into their maps, or shared using the existing Groups mechanism. In order to achieve this level of integration, we need to impose certain constraints on the kind of data that can be imported.

Datasets are State Specific

A custom dataset is imported for one state at a time. If you have data available for multiple states, you need to import it as multiple datasets. When adding a dataset to a map, you see a filtered list of only the datasets that have been imported for that state.

Data are Whole Numbers, Disaggregated to Blocks

The data will be represented as whole numbers (integers). Fractional numbers are rounded on import. The import process maps the shapes you provide to 2020 census blocks and then allocates a whole integer fraction of the data associated with a shape to each mapped block, weighted by the total 2020 Census population of that block.

This is basically the same process we use to allocate precinct-level voting results to the blocks making up the precinct.

Aggregation above the block level then involves simply summing the block-level data for every geographic unit above.

CSV Import Uses Census GEOIDs

If data is provided in CSV (comma-separated-values) format, the first line contains the field keys and the first column of each line contains a 2020 Census GeoID: either census block (15 digits), block group (12 digits), census tract (11 digits) or county ID (5 digits). Precinct (voting tabulation district) IDs are also supported, but only the identifiers used for the base 2020 precinct shapes in DRA.

For convenience, a CSV file may contain data for multiple states; only the data associated with a specific state is extracted during import.

Imported Files May Contain Data for Multiple Datasets

A shape or CSV file may contain the data for multiple datasets; the import process allows you to sequentially select different sets of fields and import them as separate datasets.

Imported Shapes are Mapped to 2020 Census Blocks

As mentioned above, shapes are mapped to 2020 Census blocks. If your shapes do not follow Census block boundaries, there will be some “fuzziness” to the disaggregation process.

To explore an example, the Census provides income information in a variety of different formats. One format specifies the number of households in each blockgroup in each one of a range of income levels (e.g. number of households with income less than $10K, less than $20K, etc.). This disaggregates and aggregates nicely and in fact we have included income (and education) data as an initial Published custom dataset for all states.

An alternate Census dataset specifies median household income for each blockgroup. This data does not disaggregate and aggregate in any straightforward way and therefore would not be appropriate to expose using this feature.

Map Properties

The most common way you start using the feature is to add an imported custom dataset to a map in the Data Selector section of the Map Properties panel.

Add Custom Datasets to a map from the map properties panel

The Choose Custom Datasets menu displays the datasets available for the current map’s state. Multiple datasets can be selected. This menu works just like the Choose Datasets menu for built-in datasets. After selecting multiple datasets from the menu, you click away and the menu will dismiss.

Choose Custom Datasets menu

Once you have selected the datasets you want to use, selecting Apply in the Map Properties panel applies these changes. These datasets now show up wherever dataset details are displayed.

Details panel displaying custom dataset

If you have imported custom election datasets, these datasets can be selected using the Election button under the Primary Datasets section.

Choose Primary Selected Datasets Dialog with Custom Datasets

The dialog displayed then allows you to choose a custom election dataset.

Once your dataset is selected as the primary election dataset, it will be used for all election features, including precinct and district partisan coloring, precinct labeling and partisan analytics. This is an area you should use with care since statewide analytics can give misleading results if used with poorly chosen elections (for example congressional elections results are frequently misleading since races where a candidate runs unopposed can distort analysis, as just one example). The analysis is only as good as the data that feeds it.

Imported election datasets can also be selected in the election swing dialog (available from the Colors panel).

Any datasets that have been attached to a map are visible to any user that can access that map (through the publishing, sharing or group features), although they can only attach datasets to their own maps that they have direct access to.

Dataset Lists

There are three new list options under the My Maps drop-down.

List selector with Custom Dataset options

My Datasets shows all the datasets you have already imported for your own maps. It works like My Maps and allows you to view and set properties on these datasets. This also provides the entry point for importing new datasets (the main import button also now supports a dropdown that lets you import maps, custom layers or custom datasets from My Maps as well).

Import Datasets available from My Datasets list

Published Datasets shows a list of datasets that have been published by other users (including 2022 income and education Census data published by the official dra2020 account as well as additional official datasets we will add over time).

Like maps, the dataset lists can be filtered using the text box at the top of the list. Fields to filter include name, modified, owner, state, type, year, official, published, and show.

You can also edit metadata properties for your existing datasets, including name and description, as well as field ordering, captions and coloring options.

Other options available here include publishing (or unpublishing) deleting and showing (or hiding). Deleted datasets will appear in the Trash list where they can be undeleted. Hidden datasets do not appear as choices in the places where datasets are selected in the UI but continue to be available to existing maps that already use them. Datasets published by other users are by default hidden so they don’t pollute your dataset selection UI. You can browse the published list of datasets and explicitly unhide them to make them available to add to your own maps.

Publishing a dataset makes it available to any DRA 2020 user. Make sure you have rights to any data you share. Also, take extra care to provide good dataset names and descriptions and field names and captions. Good practice is to include a URL or other information in the description that explains the provenance of the dataset.

My Group Datasets shows datasets that have been shared through the Groups feature. Datasets can be assigned to one or more groups and group members can access them from this list as well as mark them unhidden so they show up wherever datasets can be selected in the UI.

Importing a Dataset

Everything starts with importing a dataset using the Import Dataset button at the top of the My Datasets list. When you select this, the following dialog is displayed.

Import Dataset Dialog

Once the state has been specified and a file chosen, the file is analyzed and you can start to specify how you want the data imported.

Import New Datasets Dialog

If you are importing using shapes, we will start running a process in the background to map those shapes to blocks while you specify additional information about how to import the dataset. You cannot finish importing until that background process completes.

Providing a good short name for your dataset is important, since this is how it will show up within the UI for selecting the dataset to display and as a header in the details panels. The description is important if you share or publish the dataset (or just remembering where it came from for yourself). We also require specifying a four digit year (e.g. “2022”) to help sort collections of datasets.

I’ll come back to the “Type” field in a moment.

The main section of the dialog shows the fields that were found in the file and that are available for import. An icon next to the field indicates and controls its status: import, don’t import, not available for import (e.g. contains string values) or included as part of another field (more on that below). You can use the filter text box to filter to specific fields (this is especially useful if the dataset file you are reading from combines lots of different data fields).

You specify which fields to import or exclude by selecting the field (or fields) and then selecting the check button (include) or the exclude button.

Note that most datasets will include fields that look like numbers (e.g. a unique shape identifier) but that you should exclude from the import. When you select a field, the available metadata for that field displays below the list of fields. You can only specify the metadata for fields you are importing. Datasets often include a Total field which you can omit since we will compute that automatically. If you do include it, you must use “Tot” as the short caption so it is handled correctly.

Editing a Field’s Properties during Dataset Import

The “Short Caption” for the field displays as the label in the details panel. The “Long Caption” will display as a tooltip if a user hovers over the label. Note that in most cases in the wild, field names within a dataset file tend to be pretty obscure, so you will typically want to go through and make sure you provide good captions here (often by cross-referencing with some other document that describes the meaning of each field key).

For imported CSV files, we support a common format used by Census CSV files where the first line contains the field keys and the second line contains short descriptions. The short descriptions are pre-loaded as the suggested “Short Caption” (although Census descriptions are generally longer than you really want).

There are several other additional properties. The “Add Together” drop-down allows you to perform some simple aggregation during the import process. For example, with our education dataset example, we might want to add together the “no high school diploma” and “high school diploma” fields and import a single “high school or less” field that adds those numbers together. When you specify an additional field to add, the status icon of that field in the field list changes to a “+”.

Adding Fields during Import

Once a dataset is imported, only the field where the summing occurred appears in the dataset and displays in the details panel. This can simplify presentation in details panels (e.g. you could create a simpler “College/No College” dataset by adding together the various fields in the Census educational attainment data file).

The “Show in Color By” checkbox controls whether you can color precincts and districts by the value of this field using the Custom color dropdown in the Colors panel. Coloring will be by grayscale, with increasing darkness by the increasing percentage value of this field compared to the total of all the fields in the dataset.

This simple default coloring scheme typically only makes sense if each field represents a distinct category (for example as we use for race and ethnicity). If you specify custom coloring options (described below) the “Show in Color By” setting is ignored.

Election Datasets

You can import new election datasets for partisan analysis and coloring by selecting Election from the “Type” dropdown.

An election datasets has strict schema requirements in order to integrate with all the other features in DRA 2020. These include:

  • It must have three fields.
  • It must have a field with Short Caption “D” that represents the Democratic vote total.
  • It must have a field with Short Caption “R” that represents the Republican vote total.
  • It must have a field with Short Caption “Tot” that represents the total vote count.

Tot does not need to add up to R + D; any remainder will be treated as Other votes.

If your metadata does not meet these requirements, it will be imported as type “Other” rather than type “Election” and will not be available as a primary election dataset choice.

After importing, the dataset is available from the My Datasets list and can be further customized (described below).

You can use this ability to create new custom election datasets to build a custom election composite. This is explored in detail in Building a Custom Election Composite.

Troubleshooting Import

The shape import process we use for datasets is identical to the process we use for importing or coloring maps from shapes. The process involves using a well-chosen interior point within each block and determining which shape contains that point. The block then gets assigned to that shape. If multiple shapes overlap a block, we still just pick a single shape/block mapping based on that interior point.

If the shapes follow 2020 Census block boundaries, this mapping process is very precise. If they follow non-block boundaries, then the assignment is inherently approximate.

You can get insight into the mapping we used by importing your shapes as a map and examining the assignments. Also importing the shapes as a Custom Overlay (which does not need to map to block boundaries) can provide additional insight as to where approximations will occur.

Once the block/shape mapping has been determined, we allocate the values for each field in the shape to the blocks mapped to that shape, weighted by the total 2020 Census population for each block. We use the Hamilton method for allocating any remainder as whole integers.

Finally, if you want more precise control over the block mapping, performing the mapping yourself using some other tool and then importing using the CSV format with explicit Census GEOID’s will guarantee an exact mapping to blocks. You can take complete control by performing disaggregation yourself and importing a CSV using Census block IDs.

If you do have the Census block or block group information, importing using CSV rather than shapes is much more efficient. The shape mapping process is computationally intensive, especially for large states.

Dotmaps

The dot map feature has been available as a background layer to help show both distribution of race and ethnicity and population density in one visualization. We have extended this feature so that both income and educational attainment can now also be visualized as a dot map. This control is accessible from the Overlays panel. When the dot map is enabled, the details panel entry for the selected dataset will have fields color coded with the dot color mapping.

Editing Datasets

From the My Datasets list, you can select a dataset and view or edit the metadata for that dataset. This includes the name and description, as well as the field ordering and the short and long captions.

Editing Metadata for a Custom Dataset

None of these metadata edits affect the underlying data. Once a dataset is imported, the data is immutable.

This dialog also provides access to advanced dialogs for overriding the default formatting for the details panels and customizing the coloring options this dataset adds to the Custom precinct and district coloring menus in the Color panel.

Customizing Dataset Formats

Customizing Dataset Formats

Like the simple dataset metadata, each row in the custom format dialog specifies a short caption for the row label and a long caption for the label tooltip. But custom format rows also include an Expression field which allows you to specify the value of the field using a simple arithmetic/logical formula referencing fields from the dataset.

The default set of formatting options just duplicates the duplicate formatting. The Reset button will restore to this default state. But the real power comes when you do more interesting customizations.

Let’s look at our “2020 Density” custom dataset. This dataset includes an aland field that specifies the land area of the block in square meters, an awater field that specifies the area of any water in the block in square meters and a pop field that specifies the total 2020 Census population of that block. This is all information that is directly available from US Census data files.

Custom Dataset Format with Density Expression

These values aggregate nicely and the default formatting would just display the sums for your precincts or districts.

Our custom formatting contains one row to specify the land area. The formula is simply “=aland/2590003”. This constant converts the units from square meters to square miles which are easier to interpret. Each expression must begin with “=” and may contain an arithmetic or logical combination of the field values and constants. Similarly, there is a row for water area with expression “=awater/2590003”. The more interesting formula is density, which uses the expression “=pop/(aland/2590003)”. This gives the population in units of people per square mile (2590003 is the approximate number of square meters in a square mile).

This example is interesting because although the Density value cannot aggregate by simple summing, we can compute and display it for each geographic unit by using the underlying fields that do aggregate by summing.

We can also specify the number of significant digits we want to display. In the Density dataset example, we use two significant digits for land and water area (since they can often be less than one square mile) and 0 significant digits for population density.

You may also specify “Display Percentages” which will add a “percentage of total” column and “Total” row to the details panel. That is only appropriate for combinations of dataset formats that reference distinct categories of a single composite (which is not the case for the Density dataset, since the rows use different units).

Customizing Color By

Custom colors also gives you control over how the dataset can be used to color precincts and districts.

Customizing Dataset Colors

Like Custom Formats, Custom Color By rows contain a short caption and long caption. These will be used as the display value and tooltip respectively for an entry in the Custom coloring drop-downs in the Color panel for both precincts and districts.

The Expression field works a little differently. This should be an expression that returns an intensity value between zero and one (return values outside that range will be clipped).

The Stops field contains a comma-separated list of numbers ranging in sorted order between zero and one. This defines start and end ranges for linearly interpolated color values.

The Colors field contains a comma-separated list of numbers in CSS #000000 format (hexadecimal rrggbb values). These are the color values that are interpolated by the value of the expression, interpolating between the stops.

The default greyscale demographic stops and colors we use are “0.0,0.4,0.5,1.0” and “#fafafa,#aaaaaa,#666666,#111111”. While the simplest gray scale specification might interpolate from zero to 1 and white to black (so “0,1” and “#ffffff,#000000”), this ends up appearing a bit stark, so we stay away from full white and full black at the ends of the range. For demographics, we also add stops just less than 50% in order to have more differentiation of color for values in that “close to majority” range. Exactly what color/stop specification you want to use is dependent on what data values you are interested in differentiating.

For example in the case of the Density coloring above, density values vary by a lot across virtually all states, with a median value well below 1000 per square mile but a maximum value in the 10’s of thousands. The approach we use here is to to return a value in the 0 to 1 range by normalizing the value by dividing by 15000. Any population densities greater than 15000 will be clipped to 1 so will not be differentiated in the coloring. We are using grayscale, so all the colors will be grays with each rgb value ranging from 0 to 255 (00 to ff in hex).

In this case, we use place a stop at 0.13, so the overall stops are 0,0.13,1. The value 0.13 maps to around 2000 per square mile. We then use color stops of #fafafa,#858585,#11111. This allocates half of the color range to the interesting range of 0 to 2000 so we get more differentiation for those lower population areas. As with the regular demographic grayscale gradient, we stay away from the stark full white and black at the ends of the range.

For your datasets, you might want to define and make available different coloring options for precincts and districts since one is used while building a map and the other while viewing or analyzing a map. For example, the built-in partisan coloring adds color differentiation for districts around the 50–50 split because these are the values we are most interested in (and typically don’t care just how safe a district is once it is really safe). In contrast, when adding precincts, we are more interested in knowing which precincts are going to add a lot of votes in one direction or another so precinct partisan coloring maintains more differentiation throughout the whole range.

This color-stop mechanism can also be used for displaying discrete colors (like the DRA demographics “all” color scheme) by specifying the colors as a series of stops and having the color expression return exactly the stop value — no interpolation happens in that case. Making use of the conditional trinary operator (test ? true-value : false-value) in your expression is typically helpful in this use case. Another useful trick for generating discrete colors is to specify the same stop value twice and then specify both sides of that single stop value. So the combination “0,.5,.5,1” and “#ffffff,#ffffff,#000000,#000000” would display white for any intensity value less than or equal to 0.5 and black for larger values.

Load and Save

In many cases, you might have multiple datasets that share lots of the custom format and custom color properties. These might be multiple datasets for the same state but for different years, or might be the same dataset (like the Education and Income official datasets we included with this feature release) that have the same properties across multiple states. To make it easier to specify the dataset properties in these scenarios, you can save the dataset metadata for one dataset (not the underlying data, just the properties that describe the data and how it integrates in DRA2020) to a local file on your computer and then load that metadata into another dataset.

The only constraint when applying custom dataset metadata to a different dataset is that the number of underlying data fields are the same in both datasets.

Load and Save are available from the Edit (or View) Dataset dialog.

Additionally, the file format is a relatively self-descriptive JSON file and can be opened and edited from your favorite text or JSON editor. This can sometimes be easier than using the UI to navigate if you have many fields.

Tips and Tricks

Duplicating a dataset (either one of your own or a Published dataset) does not change the underlying data, but allows you to customize the formatting and coloring and is an easy way to get custom results without having to track down a copy of the original data in the right format.

Additionally, exporting precinct data will export both built-in datasets and attached custom datasets in a CSV file format that is suitable for re-import as a dataset. Exporting the precinct data CSV file, editing in a tool like Excel (be careful to import GeoIDs as text and not numbers!) and filtering and combining there before re-import is an easy way to mix and match datasets for your own purpose if custom formats and colors is not sufficient for your needs.

Feedback

We expect our users will find lots of interesting uses for this new feature. Let us know how it is working for you and if there are any additional changes we can make to make the feature more helpful.

--

--

Terry Crowley
Dave’s Redistricting

Programmer, Ex-Microsoft Technical Fellow, Sometime Tech Blogger, Passionate Ultimate Frisbee Player