A first step toward creating a digital planning laboratory is populating it
Introducing Doppelgänger, Model Lab’s open-source population synthesizer.
An underappreciated challenge of urban planning is that public servants must confront complicated trade-offs in which some members of the community win and others lose. A new light rail service may improve accessibility, reliability, or comfort for many people but create noise and vibrations for others and replace a more convenient bus service for others still. Transportation network companies like Uber and Lyft have dramatically improved the ride-hailing experience for most, but have likely made getting around more difficult for others. Retrofitting a very old subway service will benefit all travelers in the long run but cause widespread, short-term disruption.
At Model Lab, we believe robust simulation tools can help illuminate and inform the benefits and costs of transport-related service, policy, and infrastructure decisions. To understand how transportation interventions impact communities, the models we build need to adequately represent every person living in a community today and every person expected to be living there tomorrow. Our first step in creating a model system that achieves this goal is a toolkit we call Doppelgänger. What’s unique about Doppelgänger is that it pairs two cutting-edge technical capabilities — convex optimization and Bayesian Networks — into the same open source modeling tool, enabling the urban planning community to take a significant step forward with population analysis.
A planner’s first stop in describing the existing conditions of a community is usually the Census Bureau. To protect the privacy of respondents, Census data is delivered at different geographies and across different periods of time. For example, the best estimate of the number of households in a community may be available for each Census block from the Decennial Census (last conducted in 2010), and the best estimate of household income may be the five-year rolling data product from the American Community Survey for each Census tract. Combining these disparate data sets to create a coherent and complete representation of what is happening in a community at any point in time is difficult. It’s a bit like trying to completely understand a subject from photos that are taken from different angles, at different points in time, from different distances. Further complicating the problem, urban planners like to use non-Census data sets, such as school quality, that may introduce yet another set of geographies (e.g., school districts).
Doppelgänger is here to help. It belongs to a class of tools that urban modelers refer to as “population synthesizers.” As their name implies, population synthesizers create synthetic populations — virtual communities with detailed descriptions of the households and people that live in them. These tools attempt to consume all of the data sets created by the Census Bureau (as well as other sources) to create a complete and internally consistent virtual representation of a given community. More broadly, Doppelgänger enables planners to create a set of virtual households that accurately reflects real neighborhoods, cities, regions, or states, along any dimension relevant to the problem at hand. Doppelgänger can create households that have accurate numbers of children, seniors, teachers, persons with disabilities, electric-vehicle owners, swing-shift workers, and so on.
To further Model Lab’s goal of building transparent and accessible modeling tools, we have released Doppelgänger as an open source library. Though we are only at the beginning of our work with Doppelgänger — it’s not yet plug and play, for instance — we are excited enough by its technical features and capabilities that we want to share it with researchers and practitioners interested in population synthesizers. We have created a Jupyter Notebook to walk you through Doppelgänger’s functionality and features. So please download the repository, walk through the examples, inspect the code, and use GitHub’s tools to suggest or implement improvements — we’d love to hear what you think.
The first technical aspect of Doppelgänger that excites us is convex optimization. To introduce the idea, consider an urban planner tasked with describing the existing conditions of a community. The planner starts with the following data from the Census Bureau:
- Number of households, for which the best source is the 2010 Decennial Census and the data is available at block geographies;
- Age and Income distributions, for which the best source is the 2011 to 2015 American Community Survey (ACS) and the data is available at tract geographies; and
- Household structures (i.e., relationships between parents and children), for which the best source is the 2011 to 2015 ACS and the data is available at Public Use Micro Area (PUMA) geographies.
The planner also has data from administrative records on unemployment insurance claims at county geographies and on school enrollment at school district geographies.
Our planner would like to create a complete representation of the community, which allows her or him to quickly and easily summarize the community along any relevant dimension at any relevant geographic scale. In so doing, the planner may obtain a deeper understanding of the community’s characteristics and how people may be impacted by potential transportation or development interventions.
Addressing this challenge is why we built Doppelgänger. The tool starts with a list of sample households, which are typically drawn from a data set that describes individual households in great detail, such as the Census Public Use Micro-sample (PUMS). Doppelgänger then allocates households to small geographies (e.g., parcels, blocks, or tracts), such that aggregated households match the data sets we started with — meaning, for example, that when you add up all the households in a Census tract, the statistics closely match the age and income distributions from the ACS data. The age and income distributions by tract, households by block, household structures by PUMA, unemployment insurance claims by county, and school enrollment by school district are referred to as “marginals” in population synthesizer jargon.
If the marginals presented to Doppelgänger are internally consistent with each other and to the sample household data, Doppelgänger will create a synthetic population that matches each of the marginals perfectly. If the marginals are not internally consistent, which is almost always the case in practice, the user must tell Doppelgänger which of the marginals are more or less important. We formulate this problem as a convex optimization in which subjective weights allow the user to prioritize marginals (e.g., assert that matching on number of households is more important than matching the age distribution) and generate a consistent solution when one exists.
We believe convex optimization gives analysts a more logical framework to make trade-offs among competing “truths” and is superior to more common iterative proportional fitting/updating approaches, in which each of the marginals is given equal importance. Our work is inspired by the efforts of Peter Vovsha and colleagues.
The second technical feature of Doppelgänger that excites us is Bayesian Networks (often referred to as Bayes Nets). Bayes Nets can be used in population synthesizers in two important ways. First, Bayes Nets act as a means of extracting useful relationships from one data set that can then be applied to other data sets. For example, consider a data set that, for a relatively small sample of households, contains information on each household’s number of people, income, and number of vehicles. We can train a Bayes Net on this data to understand the relationship between these three variables. The outcome can be illustrated in a directed graph that looks like this:
The relationships labeled in the graph as A, B, and C are, in a Bayes Net, probability vectors relating outcomes in the destination box conditional on the outcomes of the origin box. Now consider a much larger data set of households that describes only the number people in each household and their household income — this data set is silent on household vehicles. If we believe that the Bayes Net trained on the smaller data set is relevant to the larger data set, we can use the Bayes Net to estimate household vehicle levels in the larger data set. In other words, we can use the Bayes Net to infer the number of vehicles each household owns.
We’ve set up Doppelgänger to easily create Bayesian Networks from Census data — specifically, to create PUMA-specific Bayes Nets from the PUMS — though it is generalizable to any data set.
This functionality can also be used to to extract important relationships from proprietary data sets while not revealing any of the private aspects of these data sets. For example, a proprietary data set may reveal interesting relationships between household characteristics and electric vehicle ownership. With the data provider’s permission, we can extract these relationships out of the proprietary data sets using Bayes Nets, then bring these Bayes Nets into Doppelgänger to apply them to non-proprietary data sets. We will develop and refine these techniques to create representative communities in which planners can test ideas. We need not compromise customer privacy to make better urban planning tools. But to do so we’ll need to change our thinking: we must get used to the concept of experimenting with our ideas in demonstrably representative, but virtual, worlds.
The second useful application of Bayes Nets is adding variability to a synthetic population. Data sets such as the Census Bureau’s PUMS include a limited set of observed households: PUMS is a 1 percent annual sample of the United States population. A typical approach to population synthesis is to replicate the households observed in the PUMS data over and over to create the right number of households for the subject neighborhood, city, region, or state. Doing this results in the same households with the same characteristics being replicated in the synthetic population. This outcome is undesirable because it understates the variability of characteristics in the community, and this variability may be important to certain urban planning problems. If we build a Bayes Net that captures all of the relevant characteristics of the synthetic population, we can traverse the Bayes Net to create populations with different sets of characteristics.
These features of Bayes Nets have the potential to increase the heterogeneity, representativeness, and usefulness of our synthetic populations, and when combined with helpful models they can improve the effectiveness and efficiency of policy interventions.
We are not the first to identify the replication of households in a population synthesizer as undesirable, nor are we the first to propose using Bayes Nets in population synthesizers. We are inspired by these previous efforts as well as the foundational work of Judea Pearl and the open-source ethos of the Urban Data Science Toolkit’s synthpop. By releasing Doppelgänger as an open source library now, we hope to create a community collaborating with and improving on these ideas, technologies, and source code.
Before we can understand how transportation services, policies, or infrastructure impact a community, we must understand who lives in the community today, tomorrow, and 20 years from now. Once we understand that, we can begin to think about how and why community members move around to carry out activities in service to themselves and their families over a typical day or week — a next step we are working on now. At Model Lab, we strive to translate our understanding into simulations that help urban planners experiment and learn. Doppelgänger is a foundational step in our journey and we hope it’s the beginning of a strong relationship with the urban modeling community.
If you’re an urban planner or model developer interested in population synthesizers, we’d love to hear your thoughts on Doppelgänger, features that could improve your work, and opportunities for collaboration. Please reach out via GitHub. And if you’re a technologist interested in helping us advance our vision of better models, please reach out to learn more about joining our team.