Bias Takes a Swing at Data Science

Published in

Thinking Fast

3 min readApr 27, 2022

Top Tips for Punching Back

Ethics hits hard for me because my background is in social psychology. I spent 5 years working closely with social psychologists and bio-psychologists to understand how social disadvantage affects both the brain and the body.

My dissertation was on understanding how stereotyping is used to further disadvantage people.

So yeah, ethics is an important topic, and I am glad to see it rising to the surface in data science. Unfortunately, most data scientists are not also psychologists.

Thus, they lack a clear understanding of how data science can affect the socially disadvantaged, which in turn also affects the accuracy of our models.

Probably the most visible area in data science and ethics right now is model bias. That is, the degree to which our models demonstrate bias against specific socially disadvantaged groups.

For example, computer vision models have been shown to perform poorly at identifying people of color in images. Predictive models have been shown to learn the bias inherent in training data and lead to favorability towards majority social groups.

The problem isn’t new. Models have always been biased. That’s because the data they are trained on are often biased in one way or another.

Take, for example, healthcare. Training a model on healthcare data means our models will learn the biases inherent in providers when they perform diagnoses. It is a well-known fact that providers demonstrate bias in diagnosis patterns. For example, women are far more likely to be diagnosed with an emotional disorder compared to males, when countless studies show that men suffer from emotional disorders just as much, if not more, than women.

The reason this old problem is such a big one for data scientists today is due to the scale and ubiquity of data science and AI. Perhaps no other time in history have we seen so many companies leveraging data science in their products. Thus, bias in our models now has the potential to affect millions of people depending on the application.

Dire topic, I know but important nonetheless. Thus, I want to challenge you as you traverse your data science learning journey to consider how you might work to overcome bias when building models with data. Here are just a few tips to consider:

1. Make sure that the variables used in models are not tuned to exacerbate bias. For example, using ethnicity as a variable in a model may have this effect and so it is often better to leave them out.
2. Monitor your models in production by looking at how positive predictions are distributed across different social groups. Start with gender and expand if the data are available to do so.
3. Keep in mind that just because your model may predict a greater likelihood for one social group over another does not mean it is biased. We must also consider what distribution we should expect to find by looking at other sources such as primary research studies.
4. Always be looking for more sources of data to blend with your training data.
5. Always be aware and ask questions of your models, especially as they get closer to touching the people in our communities we want them to touch.

Like engaging to learn about data science, career growth, life, or poor business decisions? Sign up for my newsletter here and get a link to my free ebook.

Bias Takes a Swing at Data Science

Written by Brandon Cosley