When the Data Isn’t Perfect

Izzy Youngs
Georgetown Massive Data Institute
8 min readJun 15, 2022

Why we need higher quality place-based indicators

By Izzy Youngs and Joseph Scariano

Legal records, social service programs, online transactions, and many other aspects of modern life routinely generate large amounts of data on people all across the country. This data can offer meaningful insights into what’s going on in our communities–helping us answer important questions such as “has public safety increased in my city” or “how have poverty rates in my town changed over time?”

Despite this substantial upsurge in new data, access and methodological barriers prevent us from using it to understand the problems that our communities are facing and how we can solve them. Instead, researchers must rely on data that are too old or too imprecise, or repurpose existing indices where the data or methodology does not quite fit. This occurs consistently in research measuring wealth and poverty, vulnerability and disadvantage, the social determinants of health and much more.

To unlock more, timely data streams to create reliable, easy-to-use measures of community impact–what we call “place-based indicators”–we must overcome several barriers:

  • Standardization: There is a lack of standardized definitions for key terms. Consider studies assessing “gentrification,” where the public health researcher defines gentrification in terms of displacement while the economist defines it in terms of mean household incomes. Due to non-standardized definitions, these two projects with similar goals of measuring gentrification may use wildly different indicators to determine whether or to what extent gentrification is occurring, or they may use the same indicator despite it not being fit-for-use for both studies. Non-standardized definitions pose issues for the development and application of place-based indicators. Without clear definitions and descriptions of the inputs, it will be difficult to compare data or aggregate data across multiple datasets, and the results of similar studies or analyses of interventions could be wildly divergent.
  • Temporality: When users rely on data tied to a particular geography, they must be mindful of the timing of data collection. For example, demographic variables from the 5-year American Community Survey are often used as the denominator for place-based indicators, but demographics may shift over the course of a 60-month period, affecting the measurements or inferences produced by such indicators. Approximately 8.4 percent of U.S. residents change residence every year. Nearly half of those who change residence move across county, state, or national boundaries. The movement of people away from and into a geographic location over a long period of time requires that surveys select a very large, representative sample to account for respondent falloff or movement away from the community over time, and data users must be cautious when using a multi-month or multi-year collection of survey responses. Furthermore, aggregating a constantly-shifting demographic profile up to higher geographic levels decouples the relationship between a single individual and their exposure to a specific place or intervention. For example, an individual who lived in a certain census tract for ten years has a higher exposure to a specific phenomenon such as blight or access to an urgent care than someone who lived in the census tract for two months; however, determining the impact of this exposure would be nearly impossible in a multiyear survey sample. Research shows that the length of exposure to a neighborhood effect is important for measurement. For this reason, certain industries or domains, such as public health or social services, may consider longitudinal studies comparing individuals over time, rather than a statistical snapshot of the demographics of a neighborhood at a single point in time, depending on the research questions being answered.
  • Geography: It is critical that linked data or time series data utilize equivalent geographic boundaries or the integrity of the indicator may be compromised. For example, ZIP code data and census tract data are commonly used by different practitioners, but linking across them can be challenging and using them together may be inadvisable. Census tracts tend to avoid some of the pitfalls of other geographies, such as census blocks (which have a variety of data quality issues) and ZIP code tabulation areas (which often do not conform to political boundaries). Census geographies tend to remain more static than other geographic units, and the US Census Bureau helpfully provides crosswalks when boundary changes do occur. However, even census tracts can be insufficient or problematic for certain uses. Aggregating to higher levels of geography presents persistent statistical problems when conducting place-based analyses. When neighborhoods or areas with a high level of poverty are split by census tract boundaries, the concentration would be distributed across two or more geographies, masking or reducing isolated pockets of high poverty within a larger area. This is called the Modifiable Areal Unit Problem (MAUP) and is pervasive in spatial analysis. In fact, when using tract level data in medical research, 52 percent of high-risk patients are missed due to the issues of aggregation. This indicates that even as the field of place-based indicators utilizes a semi-standard level of geography, that the field still needs to consider the limits and steps for improved accuracy that can be taken.
  • Access: The creation of high-quality place-based indicators may rely on data that is not easily or legally accessible. Researchers rely on consistent survey methodologies, standardized geographies, and repeated collection schedules. This often limits options to government databases from agencies like the US Census Bureau or the Bureau of Economic Analysis. The smallest unit of geography that maintains high quality tends to be census tracts, which makes it harder to track certain interventions which require more precise geography levels. For access to microdata, there may be serious legal and cost barriers. In some instances, the data may exist, but be in poor condition for digitization, stuck in legacy databases or in scanned paper format. When the data does exist in local, state, or federal administrative data, officials may be fearful of privacy violations, face technical and logistical issues, be unsure of data ownership or provenance, and more. While many of these statistical and administrative data sources could be put to greater use, reducing barriers to this disparate and often unaligned data can be a huge challenge, increasing reliance on trusted third-party data providers and dominant repositories of federal statistical data such as the Federal Statistical Research Data Centers, which are expensive and time-consuming to navigate.

We Need Reliable Place-Based Indicators

Imagine a user-friendly web-based interactive portal where researchers, urban planners, and evaluators could identify place-based indicators, explore options for obtaining access to the indicators, and receive information that explains best practices for their use. This portal would only require the urban planner or researcher to search the database with their specific project requirements. With easy and standard access to reliable and consistently produced indicators, individuals would be able to produce higher-quality and more impactful projects. There would be limitations on the use-cases of these indicators, but they could be used for evaluation, certification, funding, monitoring, measurement, and more. The indicators may rely on data that comes from administrative data, federal statistical data, private data, open data, and anything in between. Each indicator could come with its own list of considerations based on industry or use, and a profile of the data collection methodology. In 10 minutes, a researcher could have a list of potential, high-quality indicators to utilize in their research. Fortunately, much of the data required to produce the consortium of indicators we describe already exists and many reliable, consistent indicators already exist as well. A world of standardized and accessible indicators is not as far away as it may seem.

Join us!

In an attempt to maximize the potential of place-based indicators, the Massive Data Institute at Georgetown University has launched the Place-Based Indicators Project: Designing Standard Measures of Community Impact. Our goals are to 1) point people towards high quality indicators which already exist and 2) outline processes for developing new, high-quality indicators, when they do not already exist, with already-established administrative data when possible. We aim to do this by supporting the improvement of existing indicators and providing frameworks for creating additional indicators.

Through one-on-one meetings and small-group convenings with stakeholders and users of place-based indicators across a diverse range of domain areas, we are identifying programs, frameworks, indicators, and metrics which are frequently represented in industries and literature; documenting quality issues with existing indicators; and pinpointing access and ethics barriers to new data sources. This information will inform the development of a series of tools and best practice guides.

We will also hold a series of data challenges where we will engage public data users in designing new methods for enabling administrative data for use as place-based indicators.

We hope you will sign up for our newsletter to follow our progress and learn how you can get involved!

Note: Statistic and citation on the percent of household moves per year was updated on 7/29/22.

--

--