Our Data Science Clearinghouse

How DNC Tech manages billions of model scores 😱 without breaking the bank (or losing our minds)

Published in

DNC Tech Team

5 min readAug 24, 2020

The DNC Data Science team builds and manages dozens of models that support a broad range of campaign activities. Campaigns rely on these model scores to optimize contactability, volunteer recruitment, get-out-the-vote, and many other pieces of modern campaigning. One of our responsibilities is to deliver the best available model scores in an accessible, actionable form.

As part of Phoenix, the DNC’s data warehouse, we developed infrastructure that keeps our focus on delivering products to win elections instead of on ever-growing technical complexity. In this post, we’ll walk through the infrastructure that manages over 70 billion (and counting!) model scores for the country’s 200+ million registered voters.

The Challenge

At this point in the 2020 cycle, we’re managing about 20 different models. These models come from a mix of sources (our internal modeling infrastructure and multiple vendor syncs), and are a mix of regression, binary classification, and multi-class classification model types. Across those 20 models, we have around 80 distinct model versions that comprise more than 70 billion point estimates.

So, how do we get from the complexity of mixing so many model sources, model types, and versions-per-model to the clean, accessible set of scores our users can seamlessly integrate into their campaign programs?

A Clearinghouse for Model Scores

Our solution is a pair of carefully designed tables for model versions and model scores, and an accompanying code base to cleanly manage model and score life-cycles.

Together, these tables are a clearinghouse for model score publication. Just as a financial clearinghouse ensures a clean exchange between parties in a transaction, our model score clearinghouse sits between a model’s source data and its downstream pipelines to ensure a clean hand-off from one to the other.

Model Version Ledger

The first table of our clearinghouse, model_versions, keeps metadata on model versions. Vendor-sourced model version metadata is merged to this table as part of loading processes. Models we score in-house have their versions checked against and merged into this table as part of every scoring job. With many models and versions spread across our small data science team, we are thrilled to have this bookkeeping maintained programmatically.

Screenshot of model_versions table in database — First table of our clearinghouse, model_versions.

An Authoritative Table of Model Predictions

The second table, scores, holds, well, scores. As we load incoming model scores, we tag both the model version and scoring job that generated them. For multi-class classification, we store scores in a normalized form and note the predicted class label in the score_name column.

A key part of this table’s architecture is that it’s partitioned by the model version’s date. This keeps all of the scores for a given model version on the same partition, allowing us to query just the few gigabytes of data for the scores we need instead of scanning the entire multi-terabyte table.

Screenshot of scores table in database — Second table of our clearinghouse, scores.

Once the new scores have passed automated checks, their current_score_flag is flipped to TRUE and their datetime_approved field is set to the current timestamp. In the same step, previous scores of the same model_version that overlap with the new scores by external_id are flipped to FALSE and have their datetime_deprecated set to the current timestamp.

This operation makes the current_score_flag an authoritative marker for which scores are current within the model_run_id, a key assumption for when it’s time to materialize the scores for downstream use.

Finally — Simplified Operations on Model Versions

All of the bookkeeping in the tables above pays huge dividends when it comes time to query our model scores. This short script extracts the current scores for the current production model version of “My Model” with only the name of the model as an input!

How We Use It

Having a strong, consolidated schema for model versions and scores makes downstream use cases much cleaner and simpler.

Here are a few examples of how this infrastructure is used in our work in the 2020 cycle:

Sourcing Publishing Pipelines

The most mission-critical use of this infrastructure is to serve the most current scores of each model to downstream pipelines in a consistent location and format. Our scoring and loading pipelines run a “materialize model” task that writes a query similar to the one above to a dataset holding one “current” table per model.

With our model and score version bookkeeping managed upstream, our downstream code can always find the model’s current scores in the same place. But more importantly, this approach insulates downstream processes from modeling issues: If a scoring job fails for some reason, or if newly loaded scores do not pass automated quality checks, the model version will not be re-materialized with the problematic scores.

Breadcrumbs for a Model’s Life Cycle

With Election Day right around the corner, we have to respond quickly and confidently to potential problems with our models. We maintain a complete picture of a model and its scores by linking the model_run_id and scoring_run_id fields to the metadata and artifacts we store in an MLflow tracking instance.

This allows us to trace a published score back to its scoring job, its training job, and, through our other metadata systems, the exact training data that built the underlying model. This piece is critical for diagnosing issues and anomalies users encounter in the field.

It’s also a gift to our future selves: when it’s time to revisit our models and improve their future versions, we’ll be working with a complete understanding of their inputs and outputs.

Model Reversions

Sometimes, we detect a problem with a model after it’s been published. When this happens, we can adjust the current_model_flag in the model_versions table to toggle a model version out of production, or simply suppress scores from a problematic scoring job. After adjusting the metadata in the clearinghouse, we can then re-materialize the model for downstream pipelines and address the lingering issues.

Point-In-Time Snapshots

Political data folks are fanatics for historical analysis, and model estimates can be a key part of that. With the datetime_approved and datetime_deprecated fields, we can reconstruct a model version’s ‘current’ scores from any point in time just as we materialize ‘current’ tables. This can give us a quick view into a model’s predictions over time, or even a snapshot of predictions as of a past Election Day.

Conclusion

Our investment in this infrastructure has allowed us to develop and deliver data science products that are more transparent and sustainable than ever before, and we’ll continue building on this infrastructure for the 2022 election cycle and beyond.

Interested in making a difference? Join our team.