Illustration by Alysheia Shaw-Dansby

New Tools Are Needed to Unlock Private Data for Better Policymaking

Data@Urban
Urban Institute
Published in
7 min readFeb 15, 2024

--

Government agencies hold a vast array and amount of data about individuals and households — information about health, earnings, income, and much more. These data — which largely originate from voluntary surveys, such as the decennial census, and administrative sources, such as tax records — can help researchers and policymakers better understand people and the challenges they face. But while these data can help craft good, effective policy, they can also be used for harm.

As many of those involved in recent data breaches know, bad actors may seek to steal Social Security numbers, health information, and other data elements to harm people and families. And they may seek to use these breaches, combined with publicly available data, to compile increasingly detailed pictures of American’s private lives.

Government data stewards need new tools and technologies to improve lives while protecting the privacy of the individuals in the data. The Safe Data Technologies team at Urban is working to create tools that can address these issues and help the broader data and research community. Our work involves creating new data access options composed of synthetic data and a validation server. We believe the practical implementation of these tools can unleash a flood of high-quality research, generating valuable insights and better policymaking.

What are synthetic data?

Synthetic data (PDF) replace actual records in a confidential dataset with statistically representative pseudo-records. This process enables data curators and researchers to release data that would otherwise be too sensitive for public use. These data can be generated by estimating probability distribution models or empirical distributions of the confidential data and often account for real-world constraints. The synthetic data can be evaluated by comparing distributions and other measures of dispersion between the confidential and synthetic estimated datasets.

Imagine a study in which survey data are matched to administrative data. The combined dataset holds private information about people in the data, so a researcher may prefer to work with the synthetic version of these data first to avoid privacy risks. They can test their code and explore their findings on the synthetic file.

In some cases, synthetic data must “smooth out” the data to protect the privacy of individuals who have more-unique responses. While this smoothing protects their privacy, those individuals’ uniqueness could be very important to the analysis. To obtain the “true” results, the researchers would need to conduct their analysis on the original, confidential data. For example, the individual may be a high-income tax filer, in which case their tax rate may have a large effect on the analysis. Without the raw, original data, the researchers would be unable to see how that one tax rate, applied to the original income values of a few high earners, may affect the analysis. This is where a validation server comes into play.

What is a validation server?

A validation server is a digital tool that creates an intermediate layer between a researcher and the original, confidential data. With this intermediate layer, a researcher can analyze the confidential data without actually seeing them. To use a validation server, a researcher would first develop analysis code using synthetic data, then apply for access to the server. Once accepted, the researcher would submit their analysis code to the validation server, which would return a table of results, either showing the tabulations or regression results from the analysis with noise added to protect privacy.

The validation server we are developing will be automated, unlike previous manual ones, such as those at the US Census Bureau, to reduce staff burden and staff review subjectivity. Our validation server is the first of its kind–that the team is aware of–to use a privacy budget to automate the process and ensure access to public statistical releases is limited. Researchers “spend” this privacy budget to get more accurate results or produce more statistics, but when the budget is fully spent, no more data can be released.

With our automated validation server, the researcher can iterate once they’ve received the initial table of results. Using their allotted privacy budget (allocated by the government agency), the researcher can choose how much they want to spend to make selected statistics more accurate (meaning less noise was added to the true results) and have the final result be calculated on the confidential data for public release. Researchers can use the validation server to ensure their results with the synthetic data were accurate or to discover true patterns obscured by the smoothing and noise inherent in the synthetic data.

By controlling the amount of noise infused on key statistics of interest through the privacy budget, researchers can receive much more accurate results using the validation server than they could have on the synthetic data.

What is our vision?

In an ideal future workflow, a researcher could download and use newly available synthetic data produced by a government agency to conduct an analysis or run a microsimulation model. If the researcher wanted to obtain a more accurate estimate by analyzing the underlying confidential data, they could submit a request to use the validation server, be quickly approved, run their code on the underlying data, and receive the most precise set of results while maintaining privacy. This entire process could run its course without the research team obtaining clearance, going through an overly drawn-out approval process, or undergoing a lengthy review to release results.

This streamlined process would leverage the government’s wealth of data and the existing ecosystem of university and nonprofit research partners to unlock better and more timely insights on the effects of policies and their alternatives.

We envision these new tools and technologies would be built into a future “National Secure Data Service,” as recommended by the federal government’s Advisory Committee on Data for Evidence Building (PDF) and more recently, funded at the National Science Foundation as a demonstration as part of the CHIPS+ Act. We believe a future system can be rooted in modern data privacy methods and model design to enable users to conduct a much broader set of analyses while enabling them to easily understand the trade-offs between privacy and accuracy in the final analysis.

In 2021, Urban built a first-of-its-kind prototype (PDF) validation server and made the code used to build it open source. We are now building a next generation prototype based on extensive user feedback from privacy experts, economists, and government agency staff.

To build out this next-generation prototype, which we hope will eventually live in the future National Secure Data Service, we are working with the Statistics of Income Division of the Internal Revenue Service, the National Center for Science and Engineering Statistics at the National Science Foundation, and an advisory board consisting of experts in the fields of data analysis, economics, data privacy, data governance, information technology, and statistics.

What have been some of our biggest challenges and lessons learned thus far?

Our team consists of talented people with many distinct areas of expertise from across Urban. Because the challenges and lessons learned differ from one area to another, we asked a few team members building the validation server infrastructure — Josh Miller (web development), Deena Tamaroff (user experience design), Silke Taylor (software engineering), and Erika Tyagi (data engineer) — to share their thoughts.

· The learning curve on a project with this much cross-field complexity can be steep. But the idea of creating a tool that could be widely adopted and aid data scientists, social scientists, and other data users in uncovering patterns and running simulations against privacy-protected datasets is very exciting once you climb that learning curve.

· It’s challenging to distill an enormous amount of complex information into an elegant and user-friendly solution. The team was cognizant that many end users likely have little exposure to differential privacy or other data privacy concepts. Through the infrastructure build, we sought a balance between providing detailed instructions and streamlining the range of possible user behaviors so there were fewer opportunities to make a mistake. For example, particular elements of the interface, such as the mechanism for allowing the user to edit their privacy budget spend, were challenging to design because we felt that we had to choose between prioritizing efficiency and flexibility.

· We had to gain a deep understanding of how researchers would use this tool. It’s not always easy to see things from another person’s perspective or to think across fields to solve problems. However, getting to the point where we can release the tool for testing and see how researchers use this tool will allow us to make improvements more rapidly and bring the validation server to the next level.

· It was hard, yet exciting, to work on a project with so many unknowns and novel problems to solve. We’re dealing with data privacy methods that are still very much under development, and there are numerous unknowns (e.g., revealing information about errors in submitted programs). But it’s these imperfections that motivate us to keep pushing the boundaries and refining our work.

· When translating theory into practice, more decision points came up than we could have initially imaged. Both big (e.g., how do we prevent user-summitted programming errors from becoming attack vectors for malicious users?) and small (e.g., should the default privacy cost be defined per script or per analysis?) decisions cropped up, many unforeseen at the start of the project. Some of these decisions were specific to the privacy algorithm we chose, but the majority would be inherent to any system that accepts user-submitted input and automatically returns privacy-preserving results back to the user. For the team, the most challenging (and rewarding) part of developing the validation server was the ongoing and iterative process of identifying these decision points, thinking through their implications, and choosing a path forward to balance complex trade-offs and ultimately get closer to a tool that is useful for both data stewards and researchers.

What’s next?

We believe that creating these new tools and technologies — along with many other related projects in the privacy protection and data space — and seeing them implemented in government is critical to improving our nation’s ability to create better, more informed policy that works for all. We will continue to develop these tools in partnership with government agencies, our funders, and partners across academia and fellow research organizations.

If you’re a government agency, data partner, or funder excited about our vision as much as we are and interested in learning more, please reach out to our team at safedatatech@urban.org. We will follow up or help answer your questions.

-Graham MacDonald

Want to learn more? Sign up for the Data@Urban newsletter.

--

--

Data@Urban
Urban Institute

Data@Urban is a place to explore the code, data, products, and processes that bring Urban Institute research to life.