Eliminating data bias from machine learning systems

Algorithms must follow human logic and values, while trying to avoid human bias, writes Mike Mullane

Mike Mullane
e-tech
6 min readNov 13, 2018

--

E.B. White said that bias was impossible to avoid. He is perhaps best known nowadays as the author of children’s books, including ‘Stuart Little’ and ‘Charlotte’s Web’, but he was also a regular contributor to the ‘The New Yorker’ magazine and the co-author of one of America’s most influential writing style guides, known to generations of high school and college students. White claimed there was no such thing as objectivity: “I have yet to see a piece of writing, political or non-political, that does not have a slant,” he said. “All writing slants the way a writer leans, and no man is born perpendicular.”

Whether or not White was right about writers, human bias is certainly a fact of life in machine learning. In data science, it usually refers to a deviation from expectation, or an error in the data, but there is more to bias than that. We are all conditioned by our environments and experiences — “no man is born perpendicular” — and carry with us different kinds of social, political or values-based baggage. Sometimes our horizons are not as broad as we would like to think and as a result, the vast volumes of data used to train algorithms are not always sufficiently variegated or diverse. More often than not there is actual human bias in data or algorithms, which simply look for patterns in the data we feed it: garbage in, garbage out.

The good news is that bias can be detected and mitigated quite easily. The bad news is that it can be difficult to get to the bottom of how algorithms are making decisions in order to solve the problems, as more often than not algorithms operate within a “black box”.

It is one of the most important challenges we face as algorithms are increasingly at the centre of our daily lives, from search engines and online shopping to facial recognition systems and booking flights. In our daily interactions with machine learning systems we commonly encounter four kinds of bias.

Stereotype bias

Algorithms are only as good as their developers. As ‘The New Scientist’ reports, machine learning is prone to amplify sexist and racist bias from the real world. We see this, for example, in image recognition software that fails to identify non-white faces correctly. Similarly, biased data samples can teach machines that women shop and cook, while men work in offices and factories. This kind of problem usually occurs when the scientists who train the data unwittingly introduce their own prejudices into their work.

Sampling bias

Biases can also occur when a sample is collected in such a way that some members of the intended statistical population are less likely to be included than others. In other words, the data used to train a model does not accurately reflect the environment in which it will operate.

A sampling bias could be introduced, for instance, if an algorithm used for medical diagnosis is trained only on data derived from one population. Similarly, if an algorithm meant to operate self-driving vehicles all year round is trained only on data collected during the summer months, falling snowflakes might confuse the system.

Systematic value distortion

Systematic value distortion occurs when the true value of a measurement is systematically overstated or understated. This kind of error usually occurs when there is a problem with the device or process used to make the measurements.

On a relatively simple level, measurement errors might occur if training data is captured on a camera that filters out some colours. Often the problem is more complex.

In health care, for instance, it is difficult to implement a uniform process for measuring patient data from electronic records. Even superficially similar records may be difficult to compare. This is because a diagnosis usually requires interpreting test results and making several judgements at different stages in the progression of a disease, with the timing of the initial decision depending on when a patient first felt unwell enough to see a doctor. An algorithm must be able to take all of the variables into account in order to make an accurate prognosis.

Algorithmic bias

Algorithmic bias is what happens when a machine learning system reflects the values of the people who developed or trained it. For example, confirmation bias may be built into an algorithm if the aim, whether intentional or unintentional, is to prove an assumption or opinion. This might happen in a business, journalistic, or political environment, for example.

There have been several high profile cases of algorithmic bias related to social media and search engines and even in the field of corporate recruitment.

International Standards

Wael Diab, a senior executive at Huawei who is leading international efforts to standardize artificial intelligence, has identified the mitigation of data bias as a priority challenge for eventual future standards work. Diab recently told the IEC General Meeting, in Busan, South Korea, that a broad standardization approach was necessary.

IEC and ISO set up the joint committee that Diab chairs a little over six months ago. It has already formed a working group to look into a wide range of issues related to trustworthiness and related areas such as robustness, resiliency, reliability, accuracy, safety, security, and privacy within the context of AI.

Leading industry experts believe that ensuring trustworthiness from the outset is one of the essential aspects that will lead to the widespread adoption of AI. Connected products and services, whether in a vehicle, smartphone, medical device or building security system, must be safe and secure or no one will want to use them. The same goes for critical infrastructure like power plants or manufacturing sites.

“One of the unique things about what IEC and ISO are doing is that we are looking at the entire ecosystem and not just one technical aspect,” explained Diab. “Combined with the breadth of application areas covered in IEC and ISO technical committees (TCs), this will provide a comprehensive approach to AI standardization with IT and domain experts.

“The resulting standardization efforts will not only be fundamental to practitioners, but essential to all stakeholders interested in the deployment of AI,” he said.

White paper

At the meeting in Busan, the IEC officially launched a new White Paper on artificial intelligence. The aim of the authors is to help bring clarity to the current status of AI, as well as the outlook for its development in the next five to ten years. The paper describes the main systems, techniques and algorithms that are in use today and indicates what kinds of problems they typically help to solve. It provides a detailed overview of four areas that are likely to develop significantly as a result of deploying AI technologies: homes, manufacturing, transport and energy.

On the issue of data bias, the White Paper notes that even removing attributes prone to biases from training data (such as race, gender, sexual orientation or religion) may not be enough as other variables may serve as proxies for bias in the model. The authors call for further interdisciplinary work to develop more refined approaches to controlling bias.

OCEANIS

In addition to the joint committee with ISO on AI, IEC is a founder member of the Open Community for Ethics in Autonomous and Intelligent Systems (OCEANIS). It brings together standardization organizations from around the world with the aim of enhancing awareness about the role of Standards in facilitating innovation and addressing issues related to ethics and values.

It is vital that machines continue to follow human logic and values, while avoiding human bias, as they participate increasingly in some decision-making processes. International standards offer an answer to many of the concerns. Creating consensus-based standards means opening the ‘black box’ to provide the transparency needed to ensure the quality of the data used.

The standardization process will also require understanding and taking steps to mitigate the impact of potential biases resulting from algorithms. Above all, standardization will increase knowledge about the way algorithms are built and operate, making it easier for the victims of bias to challenge data-supported decisions.

The views and opinions expressed in this article are those of the author and do not necessarily reflect the official policy or position of the IEC.

--

--

Mike Mullane
e-tech

Journalist working at the intersection of technology and media