Building the Smarter State: The Role of Data Labs

By connecting policy makers, government data owners and university data scientists, data and policy labs are helping government and the social sector get smarter and improve public policies and services through data science

The GovLab
Data Labs
Published in
19 min readDec 13, 2017

by Anirudh Dinesh

Government at all levels — federal, state and local — collects and processes troves of data in order to administer public programs, fulfill regulatory mandates or conduct research¹. This government-held data, which often contains personally identifiable information about the individuals government serves is known as “administrative data” and it can be analyzed to evaluate and improve how government and the social sector deliver services.

For example, the Social Security Administration (SSA) collects and manages data on social, welfare and disability benefit payments of nearly the entire US population as well as data such as individual lifetime records of wages and self employment earnings. The SSA uses this administrative data for, among other things, analysis of policy interventions and to develop models to project demographic and economic characteristics of the population. State governments collect computerized hospital discharge data for both Government (Medicare and Medicaid) and commercial payers while the Department of Justice (through the Bureau of Justice Standards) collects prison admission and release data to monitor correctional populations and to address several policy questions, including those on recidivism and prisoner reentry.

Though they have long collected data, increasingly in digital form, government agencies have struggled to create the infrastructure and acquire the skills needed to make use of this administrative data to realize the promise of evidence-based policymaking.

The goal of this collection of eight case studies is to look at how governments are beginning to “get smarter” about using their own data.

The goal of this collection of eight case studies is to look at how governments are beginning to “get smarter” about using their own data. By comparing the ways in which they have chosen to collaborate with researchers and make often sensitive data usable to government employees and researchers in ethical and responsible ways, we hope to increase our understanding of what is required to be able to make better use of administrative data including the governance structures, technology infrastructure and key personnel. The hope is to enable other public institutions to know what is required to be able to make better use of administrative data. What follows is a summary of the learnings from those case studies. We start with an articulation of the value proposition for greater use of administrative data followed by the key learnings and the case studies themselves.

Evidence-based Decision-Making in the United States: Benefiting from Administrative Data Use

Over the years, administrative data has served as evidence for many academic research projects studying, for instance, the impact of simplifying the college financial aid process on access to education² or the influence of housing vouchers programs on moves to lower-poverty neighborhoods³. Apart from its use in academic research, administrative data has also been used by government agencies themselves to evaluate program effectiveness and target interventions in a more impactful manner. There are several examples of government agencies in the United States who use administrative data such as the Center for Medicare and Medicaid Services (CMS) who use hospital billing and payment data to improve service delivery and to reduce costs.⁴ In fact, CMS and the Department of Housing and Urban Development (HUD) are collaborating to combine their administrative data to study the impact of housing assistance on health care utilization especially on low-income senior citizens.⁵

In 2016, following a bipartisan call to “improve the evidence available for making decisions about government programs and policies,” Congress established the Commission on Evidence-based Policymaking. Earlier this year, the Commission submitted its final report, detailing a number of recommendations to strengthen government’s ability to make use of its own data and highlighting the challenges government agencies face in trying to do so. The report calls for, among other things, a review of the relevant statutes and laws in order to allow statistical uses of administrative data while also ensuring stringent privacy protections for such data.

But even before the Commission started its work, several initiatives in the United States, both within government agencies as well as outside, have been focused on bringing more evidence-based decision making into government. They provide several services to the end of helping government make use of its own data to improve programs and policies in the public interest. These services include improving the quality of the data by cleaning and integrating datasets from several agencies, as well as providing evaluation services to gain insights from the data that can eventually lead to action. To undertake this work of making data usable while protecting privacy, several governments have turned to the use of so-called “data labs.”

What are Data Labs?

“Data Labs” or “Policy Labs” are institutions with small groups of data analysts working inside or in tandem with government agencies to make administrative data more usable for evaluation and research. And while organizations vary widely in their implementation, they have all developed models to tap into the skills of highly talented data analysts and to access valuable government datasets responsibly. Each model comprises all or some of the following elements:

  1. Infrastructure to share and/or house data securely;
  2. Full-time or part-time employees with the skills to integrate data-sets from various sources and use the data for analysis.
  3. Governance rules to determine who has access to the data, and under what conditions;
  4. Contracts in place to institutionalize those governance rules;
  5. Statistical and research methods which are iteratively improved to generate more useful insights — including methods to define the problem as well as data science approaches.
  6. Well-defined data responsibility principles, processes and tools to share, analyze and use data responsibly and ethically and to share insights from the analysis without harming individuals or groups.

Methodology

In consultation with program staff at the Laura and John Arnold Foundation and the Alfred P. Sloan Foundation, both funders that support initiatives which promote evidence-based decision-making in the United States, we identified a number of domestic labs that are involved in data-driven policy making to complement earlier research we conducted for the NHS about the UK Ministry of Justice Data Lab created by New Philanthropy Capital. Although there are many more than we could document, we attempted to cover examples from different geographic regions and to include range of models (such as those completely internal to government, others wholly external, some university-government partnerships and some which cater to the social sector rather than just to government).

Our case studies are based on as written questionnaires as well as interviews with directors and senior staff of these labs as well as with their funders. The questions we asked are reflected in the the structure of the case studies, each of which follows a common taxonomy. The initial interviews and drafts were prepared by Professor Beth Noveck’s students in Yale Law School’s Governance Innovation Clinic in Spring 2017. Their work was subsequently supplemented with a second round of interviews and completed by the Governance Lab in Summer 2017 with the support of a grant from Stanford’s Markets for Good and in collaboration with our partners at New Philanthropy Capital.

We know the work is far from complete and while we haven taken utmost care to ensure accuracy in our profiles of these labs, we welcome any factual corrections or comments you may have regarding the case studies. Please contact Anirudh Dinesh.

The Case Studies

This white-paper contains a detailed analysis of six data lab models in the United States and one from the United Kingdom. It is meant to better our understanding of the existing landscape of data labs and to learn from their successes and drawbacks. In particular, we look at the governance processes that they employ to share or access data and how these labs solicit expertise to enable program evaluation. Our goal is to provide data owners, particularly those who have personally-identifiable data, with a blueprint to securely, responsibly and ethically share that information to the end of improving social programs in the public interest.

The case studies include:

  1. Actionable Intelligence for Social Policy (AISP): AISP brings together state and local governments and their university and non-profit partners in a professional learning network of institutions which maintain and run Integrated Data Systems (IDS) and engages in federal advocacy to support data sharing to inform evidence-based policy, and conducts research using administrative data.
  2. California Policy Lab: A university-government partnership that aims to help cities, counties and the State of California improve public programs through empirical research, program evaluations and technical assistance provided by UCLA and UC Berkeley to the end of improving the lives of Californians
  3. Ministry of Justice Data Lab: An analytical service run by the UK Ministry of Justice which uses UK reoffending administrative data to conduct impact evaluations for organizations that provider offender rehabilitation services.
  4. Rhode Island Innovative Policy Lab (RIIPL): In partnership with the State of Rhode Island, works with state, local and federal government agencies to unlock the power of data, economics, and behavioral science to improve policies, alleviate poverty, and increase economic opportunity in Rhode Island and beyond
  5. The Center for State Child Welfare Data (Chapin Hall, University of Chicago): A membership-based network which enables public sector agencies to access data securely and use it for analyses that support service improvements that target children placed away from their parents.
  6. University of Chicago Urban Labs: Building on the model of the Crime Lab, which partners with policymakers and practitioners to help cities design and test the most promising ways to reduce crime, the University of Chicago launched Urban Labs in 2015 to help cities tackle urban challenges in the crime, education, energy & environment, health, and poverty domains.
  7. Washington State Institute for Public Policy (WSIPP): WSIPP helps the Washington state legislature and other Washington state policy makers make evidence-based policy decisions by conducting public policy research and carrying out cost-benefit analyses of the state’s programs and policies.

The case studies endeavor to answer the following specific questions to accelerate the ability of others to create such labs:

  • What are the enabling conditions for the creation and successful operation of the data lab/policy lab?
  • How is data 1) accessed, 2) shared and aggregated, 3) analyzed?
  • How (and with whom) are the insights from the analysis shared?
  • What are the questions it seeks to answer and how are those questions determined?
  • What are the risks involved? How are they mitigated?
  • How is impact measured and what is the current evidence?

Key Learnings

Current Practice

In every case the goal of the data lab was identical — to improve public services using evidence-based decision-making which, in many cases, leveraged administrative data. But, beyond the goal, the models differ and we endeavor to evaluate the pros and cons of different approaches. Models in the case studies can be broadly segregated by 1) “Owners” of the lab; 2) “Customers” of the lab; 3) Services they offer and 4) Source of their analytic talent

  1. “Owners” of the data lab: Defining whether the data lab is positioned inside government or outside is important since it informs several other design elements of the data lab. For example, a data lab inside government, such as WSIPP and the Justice Data Lab, tends to have comparatively easier access to data by virtue of being part of the government agency as opposed to those positioned outside, such as the California Policy Labs, who need to negotiate individual data sharing agreements with multiple agencies before they can access data. But, the former requires more buy-in from government to set up and works primarily on projects mandated by its parent agency while the latter model affords more freedom in terms of the research questions the data lab might want to pursue.
  2. “Customers” of the lab: Some data labs cater exclusively to government agencies while others cater to government, academic and, much rarer, are those labs that service NGOs, In other words, they facilitate analysis by the social sector to evaluate their programs.
  3. Services offered: Since there are many elements that go into improving the ease of use of administrative data, these labs include diverse services such ascleaning data sets, combining data sets from multiple sources, storing data in secure facilities, analyzing data and sharing the findings. While some data labs, like CPL and WSIPP, have the resources, capacity and talent to do all of these themselves, others, like the Justice Data Lab and the IDS sites in AISP’s network, are focussed on providing more specialised services. The Justice Data Lab, during its initial pilot, only used one database, the police national computer, to provide NGOs one service- impact evaluation in terms of re-offending rates. The IDS sites in AISP’s network are focussed on integrating data from various agencies, across counties and states in order to make it easy for researchers to generate insights.
  4. Source of analytic talent: The talent gap is one of the primary issues faced by both government agencies as well as by NGOs. Broadly, there are two ways to fill this gap. In either case, the solution is not simply to have a group of data analysts or researchers who can perform statistical operations on datasets. They should also be able to generate useful and relevant insights from the results. This talent can come either from hiring full time data analysts, like in the case of the Justice Data Lab or by partnering with local universities and leveraging the skills of the faculty and researchers there as is the case with the California Policy Lab.

Cutting across jurisdictions and implementation models there was overwhelming agreement that the technology (both statistical tools as well as storage infrastructure) only creates opportunity but not the motivation and capacity to undertake evidence-based decision making. The importance of the human capital in setting up these labs, according to Fred Wulcyzn, Senior Research Fellow at Chapin Hall in Chicago, is generally understated. The aim of the data labs, apart from merely providing their services, is to lower the barrier of entry for individuals who want to be engaged in using administrative data for improving public programs and tackling difficult problems. In other words, to fill the talent gap highlighted earlier.

A multiplicity of factors contributes to the ability to stand up a data lab. The availability of relevant administrative data-sets, buy-in from government agencies to share data and from analysts to use it, as well as clearly defined frameworks regarding the ethical and responsible use of these data-sets are important enablers of the creation of a data lab.

A multiplicity of factors contributes to the ability to stand up a data lab. The availability of relevant administrative data-sets, buy-in from government agencies to share data and from analysts to use it, as well as clearly defined frameworks regarding the ethical and responsible use of these data-sets are important enablers of the creation of a data lab.

What are the Enabling Conditions for the Creation and Successful Operation of a Data/Policy Lab?

Government and Researcher Buy-in

“Often the people most interested in research, do not have the relevant data,” said Evan White, executive director at CPL-Berkeley, “and the people charged with stewarding the data do not have the resources to pursue research.”

The conundrum that people interested in doing research do not have access to the data and those with access to the data are not doing research is a familiar sentiment that resonates across the labs we studied. According to New Philanthropy Capital, the think tank behind the UK Justice Data Lab model, political support is one of the key criteria required to set up a data lab. WSIPP was set up when the state legislature was convinced of the need for evidence-based policy making while the Rhode Island Innovation Policy Lab (RIIPL) was set up for similar reasons (in the Rhode Island governor’s office). Since government is the data owner, it is of paramount importance that the relevant agency agrees to make the data available in a form suitable for analysis.

But that’s only one half of the equation. It is equally important that there are analysts who are available and willing to perform these evaluations. In some cases, the analytic talent comes from full time data analysts while in others, university faculty and researchers contribute their skills. Especially in the case of labs where the analytic talent comes from university faculty and researchers, there needs to be sufficient overlap between their field of research and the focus area of the data lab, failing which they will have little incentive to devote their time and skills to the lab.

How is Data Accessed, Shared, Aggregated and Analyzed?

Data Access

In the cases covered in this study, the data being used already exists within government. Administrative data captures data about the citizenry from birth through childhood and adulthood and are relatively inexpensive to obtain⁶ when compared to other types of data such as conducting a survey of an equivalent, large sample size. But since this data was not collected for the purpose of policy research or evaluation, not every administrative data set may lend itself to useful analyses and so it is important to know what information is contained in the data-set ahead of time.

Organizations like New Philanthropy Capital in the UK and AISP in the United States have identified some key characteristics that an administrative data set should have in order for it to be useful for evaluation.⁷ But this information about the data-set may not be easily available to anyone trying to access the data. In 2013, the Office of Survey Methods Research (OSMR) developed a tool to help users who are “unfamiliar with the structure, content and meaning of the records” gauge the “quality” of a data-set by posing a series of questions to the data owner, such as the scope of the records included in the administrative data set, legal and regulatory restrictions on its use, description of each variable in the dataset and so on.

Data-set catalogs like J-PAL’s Catalog of Administrative Data Sets make this information more accessible by listing important information about some administrative data-sets, including cost, unit of observation (individual or groups), years available, identifiers available for linking and more. But data catalogs are only tools to help the process of data identification. Humans with good knowledge of government datasets and with the skills to identify relevant datasets for a particular evaluation are irreplaceable pieces of the puzzle.

Data Integration

Individual administrative datasets are seldom sufficient for performing useful evaluations. This is because administrative data are collected by agencies for non-statistical purposes and may not contain all the variables necessary to answer a specific research question. AISP, for example, has developed a network of organizations which focus on creating Integrated Data Systems (IDS) or are involved in longitudinal research projects which involve linking multiple data sets. Due to the lack of technology standards, setting up an IDS from scratch can be an expensive process costing anywhere between $2 million and $4 million per site.⁸ But there is an abundance of examples to illustrate the benefits of integrating data from multiple sources. Take, for example, the case of the Institute for Social Capital whose database was used by the North Carolina Urban Institute for a study⁹ to “describe veterans experiencing homelessness” by comparing data from two sources — Veterans Services and the Homeless Management Information System (HMIS). The study revealed a lack of coordination among agencies that provide services for veterans experiencing homelessness (only 6% of veterans who used HMIS systems were also connected to Veterans Services). Access to integrated data systems will not only significantly reduce the cost of performing evaluations but also increase the quality of the findings.

Data Analytic Capacity

Even when data is sufficiently available, NGOs and some government agencies suffer from the lack of capacity to utilize the data effectively. This is not simply because of a shortage of data scientists but rather because a data analyst not only needs to have the skills to perform statistical analyses and tests on a given dataset, but also the capacity to extract useful insights from the results. In the long term, this deficit can be addressed, as suggested by Kevin Desouza and Kendra Smith, by opening up more avenues for data scientists and social scientists to equip themselves with the skills needed to identify patterns in datasets and think creatively about using data.¹⁰

This expertise does exist in educational institutions across the country and can be leveraged by giving academics and researchers access to the data, as evidenced by these cases. The California Policy Lab’s focus is to solve the most urgent public problems that affect the lives of Californians. They achieve this goal by, connecting top researchers from the State’s universities with administrative data from government agencies. The academics help the agencies make more evidence-based decisions.

An alternative model involves the creation of small units, like WSIPP, within government to provide data analytic and policy research services. The success of WSIPP’s cost-benefit analysis tool has led to its adoption by the Pew-MacArthur Results First Initiative to help other states develop similar capabilities.

Another interesting model comes out of the University of Chicago where the Center for State Child Welfare Data has developed a suite of online analytic tools that help analyze “critical outcome related questions,” including time to permanency and re-entry into care. Member states can perform a set of simple to moderately complex statistical analyses on data sets that have been cleaned and integrated by researchers of the center.

The Justice Data Lab, which caters to NGOs, staffed by just four employees and yet churned out over 170 evaluations in two years.

What are the risks involved and How are They Mitigated?

Privacy concerns: The need for responsible data use

The use of personally identifiable data raises questions of privacy, both legal¹¹ (HIPAA¹² for healthcare information and FERPA¹³ for education records) and ethical. But as Evan White explained, laws like HIPAA and FERPA are important to safeguard privacy and while they place certain limitations on the use of the data sets, they’re not prohibitive. While some data owners will only provide de-identified data (and researchers opined that this was sufficient for the purposes of their research), it doesn’t solve the issues regarding privacy.

We found there was broad consensus that linking datasets¹⁴ from multiple sources and tracing individuals across data-sets¹⁵ was an important part of deriving useful insights from evaluations. But with increased data linking, the risk of re-identifying individuals in the data also increases. To mitigate these risks, oversight mechanisms, including technical redundancies such as hierarchical access control, must be in place to ensure that no individual’s privacy is compromised even when de-identified data is being used. Organizations like those in AISP’s network are almost exclusively focused on building Integrated data systems while AISP have done several extensive studies on the technological¹⁶ and governance mechanisms¹⁷ to protect confidentiality in these systems. Meanwhile, the UK Ministry of Justice Data Lab mitigates this risk by allowing a requestor to query but not directly access any administrative data. When Ministry of Justice data lab employees receive a query from a program provider to run an analysis, the employees will identify a matched control group with similar characteristics to the cohort provided to them and perform the required analysis. The program provider receives a report with the results which contains no personally identifiable data. The Center of State Child Welfare data mitigates similar risks, both with technical measures like strict data transfer protocols as well as thorough governance mechanisms such as the establishment of an advisory board which provides guidance on substantive issues including data access for those outside the network. The Center, for example, does not share any data directly with outsiders. The request is sent to the relevant state agency which will make the decision on whether or not to share their data.

Just as the creation of innovation labs within the public sector have helped to train more public servants in entrepreneurial problem-solving skills and support citizen-centered service design, the evolution of the data lab is an important development toward more evidence-based government.

Conclusion: The Data Lab as Institutional Driver of Innovation

Just as the creation of innovation labs within the public sector have helped to train more public servants in entrepreneurial problem-solving skills and support citizen-centered service design, the evolution of the data lab is an important development toward more evidence-based government. There is no single “best” way to go about creating a data lab and the approach is contingent on the particular needs of the owners and customers of the data labs as well as on the legal and technical constraints they need to operate under. But our analysis of these case studies has shown that it is possible to work around the constraints with some important enabling conditions including 1) availability of the relevant data-sets and the ability to gauge the quality those datasets; 2) buy-in from government agencies not only for sharing data but to do so in a form that is suitable for analysis; 3) frameworks to enable the use of these datasets in a responsible and ethical manner; and 4) availability of the necessary human capital to enable analysis.

Despite the progress made so far, several challenges remain, especially the lack of technical and legal standards for data sharing as well as the talent gap within government to make use of its own data. There are many examples of government agencies around the world developing innovative methods to address the talent gap — the policy lab model being only one among them. As institutions, data labs complement these other efforts toward training public servants in data science.

Several policy labs have well-documented evidence to show their impact on government programs and policies. But it was harder to find as many examples of labs (with the exception of the UK Ministry of Justice Lab) who were focused on providing similar services to social sector organizations, including nonprofits and charities which represent over 5% of the GDP in the United States. This is a mistake that needs to be rectified. We hope that going forward, more and better resourced labs will make it easier for both the social sector and government to deliver more evidence-based

With comments or suggestions, please contact Anirudh Dinesh.

We’d like to thank Tris Lumley, Director of Innovation and Development, New Philanthropy Capital and Tracey Gyateng, Data Labs Project Manager, New Philanthropy Capital for their help with drafting this post.

FOOTNOTES:

¹ “Barriers to using administrative data for evidence-building”, White House Office of Management and Budget, July 2016

² “The lessons of administrative data, J-PAL North America

³ Raj Chetty et al, “The effects of exposure to better neighborhoods on children: New evidence from moving to opportunity experiment”, American Economic Review, vol 106 (4), April 2016

⁴ “Barriers to using administrative data for evidence-building”, White House Office of Management and Budget, July 2016

⁵ The Lewin Group, “Picture of housing and health: Medicare and medicaid use among older adults in HUD-assisted housing”, U.S. Department of Health and Human Services, March 2014

⁶ Lisa I. Iezzoni, “Assessing Quality Using Administrative Data”, Ann Intern Med. 1997;127:666–674

⁷ Aileen Rothbard, “Quality issues in the use of administrative data records”, AISP, 2013

⁸ Ken Steif, “AISP Innovation: New technology for integrated data”, Urban Spatial

⁹ “Service utilization of veterans experiencing homelessness: 2007–2012”, UNC Charlotte Urban Institute, November 2015

¹⁰ Kevin C. Desouza and Kendra L. Smith, “Big data for social innovation”, Stanford Social Innovation Review, 2014

¹¹ “Barriers to using administrative data for evidence-building”, White House Office of Management and Budget, July 2016

¹² Public Law 104–19, 1Health Insurance Portability and Accountability Act of 1996,

¹³ 20 U.S.C. § 1232g; 34 CFR Part 99, “Family Educational Rights and Privacy Act

¹⁴ Aileen Rothbard, “Quality issues in the use of administrative data records”, AISP, 2013

¹⁵ Robert F. Boruch, “Administrative record quality and integrated data systems”, AISP, 2011

¹⁶ David Patterson et al, “Towards state-of-the-art IDS technology and data security solutions”, AISP, March 2017

¹⁷ Linda Gibbs et al, “IDS Governance: Setting up for ethical and effective use”, AISP, March 2017

FURTHER READING:

--

--

The GovLab
Data Labs

The Governance Lab improving people’s lives by changing how we govern. http://www.thegovlab.org @thegovlab #opendata #peopleledinnovation #datacollab