Paradoxically, data is both a company’s most important digital asset and its most problematic.
Consider the booming business of DNA testing. For consumers interested in their ancestry and health, such tests promise to uncover vital information. Yet data bias has raised major questions about the accuracy and credibility of these tests. Some customers have received very different findings from different companies. Other customers, particularly those from ethnic and racial minorities, have found the results lack the kinds of details they are looking for.
These issues are largely due to the fact that many DNA test companies rely heavily on artificial intelligence (AI) algorithms to speed up the research process and reduce reliance on human scientists. Although the algorithms are trained on historic data samples, these samples may not be sufficiently representative of a customer’s specific genetic background.
The consumer genetics sector is not an isolated case. Across industries, businesses relying on AI must deal with data bias and the associated ethical, legal, and financial risks.
This requires a solid understanding of the root causes of data bias. By knowing where bias originates, companies will be better able to detect and mitigate it and develop AI systems responsibly.
Bias and the Data Lifecycle
Data bias occurs when a data set that is used for building an AI system contains human prejudices or discriminations towards any subgroup in the population. It’s often assumed that the bias is against a protected group of people, but in fact, bias pertaining to any group is problematic.
Bias can originate at each step of the dataset lifecycle: creation, sampling, collection, and processing. The point in the dataset lifecycle where bias emerges is closely tied to the root cause of the bias. (See Exhibit 1.)
Creation. Bias already exists in data because of historical or societal dynamics.
For example, creation bias can be found in recruiting practices. In 2015, women held 47% of all U.S. jobs but only 24% of science, technology, engineering, and mathematics (STEM) positions. If a company builds an AI-driven recruiting system for STEM positions using historical recruiting data without taking appropriate mitigating steps, the system is likely to have a bias against female candidates due to historical bias and a lack of diverse representation in the training data.
Design. The way an AI system or a machine-learning model is designed can introduce data bias. If the design of the AI system (including product design, experiment design, survey design, etc.) is inherently biased, the data collected will be biased as well.
Say, for example, a digital entertainment company develops an AI system for making customized movie recommendations. The algorithm asks various questions to establish different customers’ likes and dislikes. An unbiased question might be: “Do you believe children should be allowed to watch horror movies with their parents?” whereas a biased question would ask, “As a concerned parent, do you believe children should be allowed to watch horror movies with their parents?”
Sampling. This type of bias occurs when the population represented in the training data set does not represent the population that is the subject of the AI system.
In 1936, a Literary Digest poll predicted Alfred Landon would get 57% of the vote against the incumbent Franklin D. Roosevelt’s 43%. But Roosevelt ended up defeating Landon by a large margin (62%). Although the Literary Digest poll’s sampling size, 2.4 million was very large, the exercise was undermined by sampling bias. That’s because the sample was based on telephone directories, club memberships, and magazine subscription lists. In the middle of the Great Depression, that meant mostly middle- and upper-class voters, who were not representative of the voting population.
The same holds true of AI systems. If the input data is not representative, the outcome of AI system cannot be accurate, regardless of the data set’s size.
Collection. During the collection process, bias can emerge from three sources.
First, the people who collect or label the dataset may have a personal bias. This is a common occurrence in the development of image recognition models, such as convolutional neural networks, because they require a huge labeled (tagged) training set. Many companies are using online data labeling services to minimize the cost of generating labeled data set for machine learning. If workers who are doing the labeling don’t receive unconscious bias or cultural training, their biases could infiltrate the data labeling process. For example, workers who have had limited exposure to the LGBT community or inadequate training might mislabel a photo of a same-sex couple.
Second, outliers (data points that differ significantly from others) or incorrect data are collected because of machine error. For example, parts of sensors on a machine are broken and return abnormal values. Without scrutiny and mitigation, these outliers could compromise the training of the model and even the performance of the entire AI system.
Third, users are often reluctant to rate products. Netflix found, in fact, that when it replaced its five-star rating system with a thumbs-up/thumbs-down system, the number of ratings rose 200%. In addition, very few people write reviews, and people are much more likely to write reviews if they feel very strongly about the product in question. As a result, online purchase reviews are usually polarized, and customers in the middle are less represented.
Processing. Data bias can also be introduced when the data is processed in preparation for model training.
Various data engineering techniques can help improve model performance. These include filling in missing values, normalization (adjusting values on different scales to the same scale), and tokenization (separating text content into smaller pieces of tokens). But bias can be introduced to the data if such techniques are deployed without a full understanding of the specific context.
Take, for example, a demographic data set where 10% of the entities are missing data for the column representing height measurements. Filling in the holes with a median value of all the data in that column will introduce bias to the data set because differences in men’s and women’s heights have not been accounted for.
Since data is the raw material of the system, every bias, regardless of when it first appears, will affect the rest of the lifecycle. The earlier data bias can be identified, the more effectively the development team, end-users, and the AI system’s owner can mitigate the issue — and the more time and cost savings there will be.
How to Manage Data Bias
While we can’t completely eradicate data bias, it’s possible to reduce it significantly. Four practices are especially important:
- Provide training. Give annual or semi-annual unconscious bias training to employees, including senior stakeholders, designers, and developers. A focus on hiring a diverse workforce is also important.
- Use “responsible” vendors. Make sure that any third-party provider of data, data labeling, or other AI-related offerings support Responsible AI principles and give regular and relevant unconscious bias training to their employees.
- Proactively search for data bias and raise concerns. Be aware of protected groups and have open discussions about the potential risks of data bias during the product and experimental design phases. Adopt a rigorous exploratory data analysis (EDA) process. If the EDA uncovers any potential bias, data scientists should not hesitate to express their concerns.
- Mitigate data bias. After identifying any bias, companies should immediately take steps to mitigate their impact on the AI system.
AI practitioners should not underestimate the importance of identifying and mitigating data bias in the early stages of a project. Proactively addressing the upstream root causes of data bias is far easier than dealing with its consequences.
In our next article, we’ll provide specific recommendations on how to identify, mitigate, and avoid data bias.