Dirty Data or Biased Data?

Ethical AI Basics for Non Data Scientists

Kathy Baxter
Oct 15, 2018 · 11 min read
Source: pixabay.com

All of us in tech are in a privileged position. We are creating and implementing technology that impacts not just our users or customers, but also society as a whole. I grew up in rural Georgia to a single mother of two who had to work two jobs to make ends meet. Each day, I represent the people that are not in the room with us when we create and implement the technology that impacts them. This is what drives me as the Architect of Ethical AI Practice at Salesforce.

This post is designed for people that are responsible for implementing AI systems but are not data scientists and do not have a background in statistics or probability. The intention is to create an approachable introduction to the most important concepts with regards to identifying potential bias (error) but this is no way intended to replace actual courses on data science or statistics. Perhaps you will find this as fascinating as I do and decide to learn more!


AI: A Dual Use Technology

However, AI is a dual use technology and with the good comes the bad. Make no mistake: AI is not neutral. It is a mirror that reflects back to us the bias that already exists in our society. Until we remove bias from society, we must take active measures to remove it from the data that feeds and trains AI every day. If you haven’t already, I recommend pausing for a few minutes and reading my previous posts, “How to build ethics into AI — part 1 and part 2.

“Hope is the belief that our tomorrows can be better than our todays. Hope is not magic; hope is work.”

—DeRay McKesson, author of “On the Other Side of Freedom: The Case for Hope”


Working to Mitigate Bias

Biased or Fair?

Since there will always be some amount of error, AI administrators must decide how to handle that error. One way of asking that is, “How much risk are they willing to accept based on the amount of error in the prediction?”

From this question usually comes the question of what it means to be “fair”? Fairness is usually defined as a decision made free of self-interest, prejudice, or favoritism. But fairness is in the eye of the beholder. What seems fair to me may seem extremely unfair to you.

Let’s use an example of a bank trying to decide whether an applicant should be approved for a car loan. An AI system can make a risk assessment and recommend whether an applicant should be approved or rejected. Remember that the AI can never be perfect so there is always a chance that the AI will make the wrong prediction.

An AI may incorrectly predict that an applicant cannot repay a loan when actuality, he or she can. It could also predict that an applicant can repay a loan when in actuality, he or she cannot. As the AI administrator, you have to decide what your comfort level is for each type of error. The risk of harm here will either be greater for the bank or the applicant. Most companies feel a responsibility to minimize the risk of financial loss. However, companies also have a responsibility to ensure the data and algorithms they use are making decision based an applicant’s merits and not on systematic bias that runs through our society.


Minimizing Harm by Maximizing Data Quality

Error in AI often results from bad or dirty data. So dirty data = biased data. But how do you know if your data are biased? According to a Harvard Business Review study in 2017, only 3% of companies are making decisions based on data that meets basic quality standards. Do you think your organization is part of the 97% or 3%? A more recent article predicted that most companies attempting to implement AI will fail and one of the primary reasons is lack of enough clean training data.

I am going to share a few ways that you can look at your data and understand if there are problems that could result in error or bias.


The 5 W’s of Data Quality and Accuracy

Einstein can help spot potential bias or data quality issues but ultimately, YOU know the context of your business and your data so you must be the ultimate decision maker.

WHERE is your data being sourced from?

For example, in many customer support centers, the default priority for a new support case is “medium” until a service agent reviews the case and determines if the priority needs to be changed. The final assigned priority is typically used in reporting to help companies see trends in the severity of their issues. If agents aren’t actually updating that field, when the AI system predicts case priority, it will “learn” to mark every case as “medium.” The result is that reports will erroneously show a sudden drop in the severity of cases. Systemic bias and inaccuracies can enter your data set from the very beginning and need to be removed.

WHEN are your data coming in (i.e., freshness)?

Let’s imagine that a company manufactures and supports complex industrial machinery and uses AI to predict when a machine needs maintenance. If the data the AI is using is days or even hours old, it may not predict and warn about maintenance needs until it is too late and the machine has already failed.

Can you increase efficiencies in the system to improve the freshness? This might mean upgrading your systems or decreasing the number of integration points between systems.

WHO are you getting data from or about?

Are you getting the same information for all of your users? If not, you will see null values or zeroes for some fields. This is a signal that your data are not representative of your users or customers.

Below is an example from Einstein Analytics Data Manager with data from a fake Pharmaceutical company.

Einstein Analytics Data Manager

We can see we have WAY more males represented than females. If you are marketing an asthma drug that could equally apply to men and women, this kind of skew should give you real concern. You either need to collect more data to even out the distribution or decide how to handle it statistically. For example, you might randomly sample a subset of data from the larger or group or you might weight the data to enhance the smaller subset.

You can also see that 35% of your data set contains null values. That’s quite a lot. In the screenshot below, we see Einstein offering to help you out by predicting your missing values.

Einstein Analytics Data Manager

The final thing to look at when understanding the representativeness of your data is whether or not you have outliers. An outlier is a data point that lies an abnormal distance from the other values in the data set.

In the screenshot below, we can see Einstein has identified outliers in five categories and recommends removing them. An important question to ask is, “When should I remove outliers?” You’ll want to look at your data to better understand why those outliers exist and what they mean. A typical reason is that they are the result of incorrectly entered or measured data. Not only will you want to remove those outliers, you will want to understand what caused them to be entered or measured incorrectly so you can fix it.

Einstein Discovery

WHAT categories are being collected and used?

Special data categories
You may also hear this referred to as “protected classes of data” or “prohibited categories.” These are categories of data that you may not consider in certain decision making such as hiring or housing eligibility. These include:

  • Race, ethnicity, or country of origin
  • Gender
  • Age
  • Sexual orientation
  • Religious or political affiliation

If any of these categories of data are used to train a model, you may be making a biased decision. Or you may not.

Let’s look back at our asthma example with far more females than males. If the same pharmaceutical company manufacturing the asthma drug were instead manufacturing a prostate cancer drug, something applicable only to males, then using gender is not biased. Remember: bias means error. In this case, you want to market the drug to men so using gender in your predictions makes sense.

Einstein Analytics Data Manager

Again, if it’s a drug for prostate cancer, having women in your data set is something you might want to dig into further. Is this a labeling error? Or are these transgender individuals who identify as female but are biologically male?

Proxy variables
A “proxy variable” is a variable that is not directly relevant by itself, but serves in place of another variable. Even if you are not using a special data category in your predictions, you might be using other variables that are a proxy for it. For example, because of a historical practice redlining in the US, zip code + income = race. That means even if you are not making predictions using the explicit “race” variable, if you are using zip code and income, the predictions will be influenced by race. Below you can see Einstein Analytics calling out this potential proxy in the data.

Einstein Analytics Studio

Data Leakage
Also known as “hindsight bias,” data leakage results from using variables in your prediction model that are known only after the act of interest. The definition can be confusing, so let’s look at an example. Let’s imagine that you are selling outdoor gear. You want to know if it is going to rain so you can increase the marketing of your rainy weather gear.

One observation you might have captured historically is whether or not someone has used an umbrella. In the graph below we see a strong correlation between rainy weather and umbrella use. This is referred to as “co-linearity.” Since umbrella use comes AFTER the thing you want to predict (rain), you should not use it in your predictive model. Data leakage variables can be so strong that they can overwhelm the entire model so you will want to remove them.

Below is what you might see in Einstein Discovery if data leakage is detected. It is highlighting the strongest predictors and warning you that these could be the result of data leakage. Since you know the context of your data, you need to make this determination yourself and decide whether or not to include them.

Einstein Discovery

“AI models hold a mirror up to us; they don’t understand when we really don’t want honesty. They will only tell us polite fictions if we tell them how to lie to us ahead of time.”

—Yonatan Zunger, Distinguished Engineer on Privacy at Google

AI is powerful, and with great power comes great responsibility. Until our society of free of bias, we must meticulously manage our data to remove it.


Thank you Richa Prajapati, Greg Bennett, Joe Veltkamp, Steve Fadden, Raymon Sutedjo-The, and Steven Tiell for all of your feedback!

Follow us at @SalesforceUX.

Want to work with us? Contact uxcareers@salesforce.com

Salesforce Experience and Design

A collection of stories, case studies, and ideas from…

Kathy Baxter

Written by

Architect, Ethical AI Practice at Salesforce. Coauthor of "Understanding Your Users," 2nd ed. https://www.linkedin.com/in/kathykbaxter einstein.ai/ethics

Salesforce Experience and Design

A collection of stories, case studies, and ideas from Salesforce design teams

Kathy Baxter

Written by

Architect, Ethical AI Practice at Salesforce. Coauthor of "Understanding Your Users," 2nd ed. https://www.linkedin.com/in/kathykbaxter einstein.ai/ethics

Salesforce Experience and Design

A collection of stories, case studies, and ideas from Salesforce design teams

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store