Effective use of Demographic Information

Mona Khalil
In the weeds
Published in
4 min readApr 6, 2020

A Recap from Speaking at the NYC School of Data

Demographic data plays a major part in a large proportion of research projects, machine learning models, and scientific findings. Ideas about people based on their race and gender permeate all areas of society and can inform our decisions in ways we aren’t even aware of. Many of these models and projects are later found to produce biased results, doing harm to already marginalized groups in society.

My colleague Devin Johnson and I recently presented on this topic at the annual NYC School of Data, introducing an aspect of data science ethics we’ve been investigating for years. We prepared for a casual session with 10 to 15 people…and instead packed the room to its limits with 50+ attendees! We generated incredible conversations as we shared ideas with our colleagues, and were met with wide enthusiasm for this topic everywhere we went. We’re excited today to share the top 3 problems we identified in our talk, and 3 recommendations for approaching demographic data analysis in your work.

Problems with Demographic Data Collection and Analysis

The importance of demographic information in a specific analysis is rarely questioned. The inclusion of demographic questions assessing study participants’ race, gender, age, and income will rarely receive pushback from research teams and ethics committees. Asking a research team why they need to collect demographic information returns nearly identical responses — “how else will I compare groups?”, or “that’s how I identify group differences”. The question of why these categories are important is rarely asked. It’s just inferred.

Commonly used demographic groups (i.e., race/ethnicity, income categories) differ widely throughout history and across countries. Race/ethnicity categories commonly used in research tend to be based on those used in the U.S. Census, which assume differences between people based on country of origin and/or observable characteristics. Categories differ across countries that are otherwise culturally similar (e.g., Canada), and instead tend to reflect the relationship between a majority group and multiple minority groups.

Statistical tests comparing large numbers of subgroups in a dataset are prone to Type 1 (false positive) errors. The more groups you compare using methods such as a t-test or ANOVA, the more likely you are to yield results that are false positives. Even with statistical tests that make adjustments for multiple groups, each additional comparison adds to the likelihood that your results are a fluke. Additionally, the further you subset your dataset, the smaller the sample, thus adding to this likelihood.

What can I do instead?

We proposed a set of ideas involving the use of demographic information in designing research projects and analyzing data. In our previous work, we’ve found that these approaches have been effective at generating useful conversations with stakeholders, refocusing our analytics efforts, and making more concrete recommendations about the populations we were studying.

Don’t collect demographic information. Simply put, unless your hypotheses directly involve the needs of a specific demographic group. Are you studying the impact of a health and wellbeing program on a specific immigrant population in New York City? Go for it. Are you conducting a study to understand how people learn at different ages? Maybe don’t collect data on race.

Have demographic information already? Ignore it. Before exploring demographic data, consider what hypotheses you would come up with if you did not have that data available. What other sources of data would you collect or find in order to add value to your work? What assumptions would you make? If you’re studying topics such as drug usage among teenagers, try looking at data about each student’s family income, relationship to parents, relationship to peers, and drug usage among adults in the region. Many studies looking at drug and alcohol usage focus on poor, minority students, even though research shows that affluent students use drugs and alcohol at similar if not higher rates[1]. The routine focus on demographic data in this area of research detracts from the larger scale of the problem and a comprehensive solution.

Can’t get out of analyzing demographic data? Use fewer categories. If you or your stakeholders are used to expecting results comparing demographic groups, consider reframing the way in which you break down your categories. Which demographic group is the dominant group (i.e., the group you suspect has an advantage) in the context of your work? Which group(s) do you expect are at a disadvantage? For example, if you are examining pay equity by race and gender, you may want to compare male vs. non-male (women, non-binary people) employees and white vs. non-white employees. This allows groups in small minorities (i.e., the two non-binary employees) to retain anonymity in your analysis and evaluates groups in accordance with power structures. It also reduces the number of group comparisons, minimizing your chance of false-positive differences in your statistical tests.

Concluding Thoughts

We expect that these recommendations are only the beginning of a longer conversation about the use of demographic data in analytics and data science. If you have thoughts on more sophisticated ways to make use of demographic information, I invite you to reach out. A concerted effort to evaluate our approaches will help us understand populations, subpopulations, and develop more robust research methods using this information in the coming years.

Resources

[1] Luthar, S. S., & Latendresse, S. J. (2005). Children of the affluent: Challenges to well-being. Current directions in psychological science, 14(1), 49–53.

--

--

Mona Khalil
In the weeds

Data Scientist @ Greenhouse. Co-host @ Bad Methods Podcast. Passionate about ethics in data science. Twitter @mona_kay_