Data Ethics in Artificial Intelligence & Machine Learning

Published in

Analytics Vidhya

7 min readAug 27, 2020

Ethics is an important aspect of life and unethical of anything is simply harmful and scary. The same principle is also valid and legitimate in the technical world. With the evolution of big data and high performant computing machines, artificial intelligence (AI) has been making a leap and bound progress. We know in order to make an efficient AI system we need well-curated data and an algorithm that can work for unseen data in a super performant way. So, DATA is the main fuel for an AI or ML (Machine Learning) algorithm, and collected data for these purposes can have several biases and unethical elements which can confuse or deviate an algorithm to behave differently and generate a whole unethical system which could be dangerous for the society.

Targeted advertisement, society bias, fake news are some relevant examples but there are several other instances that happened in the past where an ML algorithm was misused (sometimes unintentionally because all possible behavior of a model was not tested) and proved to behave in an undesired way. Few examples of such use cases are below-

UK’s Grading algorithm — Recently, the UK Department of Education discarded grades generated by an algorithm designed to predict the performance of annual A(advance) Level qualification. This initiative was taken due to the COVID pandemic and the result predicted by the algorithm was downgraded by more than a third of A level results in the UK. The developed model primarily focussed on two features ‘student’s past performance’ and ‘school’s historical performance’ to predict the grades of the students. The prediction of the algorithm went in favor of the private schools and the secondary selective and sixth form schools where teacher assessment used to be good severely impacted.
Unethical facial recognition — In the Washington Post article, it was published that how US’s Immigration and Customs Enforcement unethically collected a large volume of data to analyze day-to-day activities of immigrant communities. This is an example of the unethical use of AI to abuse the civil rights of targeted communities.
Amazon’s AI recruiting tool — The developed tool for hiring by Amazon started to bias against female job applicants. See the full story here
Unemployment benefit Fraud — MiDAS (Michigan Integrated Data Automation System) an unemployment system that was launched to replace its old COBOL-based legacy system booked many people for fraud to claim unemployment benefits. It wrongly accused at least 20,000 claimants of fraud, a shockingly high error rate of 93 percent. The problem was the alleged “Robo-Adjudication” system, which lacked human oversight. The application seeks out discrepancies in claimants’ files and if it finds one, the individuals automatically receive a financial penalty, and then, they’re flagged for fraud. Have a look at this metrotimes post for more details.
Microsoft’s unveiled Tay — A Twitter bot launched with the idea of “The more you chat with Tay, the smarter it gets” got corrupted within 24 hours from its launch with the supply of all misogynistic, racist messages from Twitter. Check this post.
Google’s hate speech detector — Google’s AI tool developed to catch hate speeches started to behave differently towards black people (bias effect).

So, looking at those malfunctioned AI/ML tools which were certainly developed by top developers and envisioned by great business leaders, suddenly became threats to society. And then the real question appears how can one create an ethical way of working and sensible responsibilities among all groups of collaborators (Data collector, Developer, Decision maker, Sales, Marketing, Executives, etc.)? Several papers have been published in this direction and there are no golden rules to be followed religiously but few important aspects of this problem can be summarized. I would like to highlight them as part and purpose of this article.

The 5 Cs

a) Consent — An agreed consent between the data provider and data service.

b) Clarity — Clarity is directly related to consent to tell data providers that what are they providing.

c) Consistency & Trust — Unpredictable person cannot be trusted hence trust requires consistency. These facts are important and should be the part of data ethics as we have seen many security incidents where these things were broken explicitly or implicitly. Yahoo, Target, Anthem, local hospitals, and government data are a few examples of this.

d) Control & Transparency — Once the consent is provided now it becomes important to understand how does the data is being used. Do users have any control over them? These questions have important aspect because we know how big companies generally use public data for their own target advertising and creating political and religious sentiments. To address these things up to a certain extent, Europe’s General Data Protection Regulation (GDPR) is a good example that enables users to give their consent to remove the data from the system where it was submitted earlier.

e) Consequences — Risk can never be eliminated completely. The product using AI and ML gets builds and sometimes due to potential issues around the use of the data some unforeseen consequences arrive. Many regulations and guidelines have been formed to tackle these kinds of problems, e.g., Children’s Online Privacy Protection Act (COPPA) to protect children and their data and Genetic Information Nondiscrimination Act (GINA) in response to rising fears that genetic testing could be used against a person or their family.

Implementing 5 Cs is not the individual responsibility but it requires the entire team with the idea of shared responsibilities.

2) Biased Data or Biased Algorithm — This is often an arguable topic among data practitioners for the root cause of bias in the real world of AI. Is it Data or Algorithm? And of course, there are different views but in most of the case, it is human who developed these mathematical model or feed them with datasets which are often created or collected by them. So, ultimately biased is somewhere related to humans where we need to show the responsibilities and best practices while collecting the data or designing a sci-fi AI model.

3) Context — Contextual awareness plays a significant role for anyone working in AI and ML areas. Understanding the data to answer some standard questions like what and why am I trying to achieve certain things helps to design an algorithm which senses the decision with the context of the data.

4) Model Fairness & Explainability — Can a model’s result be trusted? Are they explain why particular features and their weight is important for predictable values? Can we explain that? These questions are relevant to decide that developed models are fair to use for and they are justifiable enough for the purpose they have been developed.

5) Model Drift — The analytical models need to be revised with time otherwise there is a high chance of instability and erroneous predictions. In ML/AI, this behavior is defined as Model Drift. It has been classified into two broad categories.

i) Concept Drift -It means that the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways.

ii) Data Drift- This happens when the statistical properties of the predictors change(independent variables). These changes can bound the model to fail. The classic example of data drift is seasonality in data. Black Friday time period always records good sell than other times of the year.

6) Ethics and Security training — Theory learned or taught as part of the educational curriculum lacks practical implementation and that’s why training of ethics and security is important for professional people because it enables them to implement these principles in the related field.

So, if we collect those points as a checklist and follow them while making any decision, then that could be helpful to avoid common mistakes. These points can also enable us to become more responsible and sensitive towards our work. Mike Loukides, DJ Patil, Hilary Mason have compiled the below checklists in their book Ethics and Data Science and it is worth having them in our data product checklist.

Checklist —

Have we listed how this technology can be attacked or abused?
Have we tested our training data to ensure it is fair and representative?
Have we studied and understood possible sources of bias in our data?
Does our team reflect the diversity of opinions, backgrounds, and kinds of thought?
What kind of user consent do we need to collect to use the data?
Do we have a mechanism for gathering consent from users?
Have we explained clearly what users are consenting to?
Do we have a mechanism for redress if people are harmed by the results?
Can we shut down this software in production if it is behaving badly?
Have we tested for fairness with respect to different user groups?
Have we tested for disparate error rates among different user groups?
Do we test and monitor for model drift to ensure our software remains fair overtime?
Do we have a plan to protect and secure user data?

So, in short data ethics principles can help to leverage the full benefit of AI for the good cause of society without any fear and also can create a sense of responsibility among all participants who develop data products to solve critical problems.

References —

https://hub.packtpub.com/machine-learning-ethics-what-you-need-to-know-and-what-you-can-do/#:~:text=Ethics%20needs%20to%20be%20seen,and%20building%20machine%20learning%20systems.&text=But%20by%20focusing%20on%20machine,robust%20and%20have%20better%20outcomes.

https://learning.oreilly.com/library/view/ethics-and-data/9781492043898/

https://www.bbc.com/news/explainers-53807730

Data Ethics in Artificial Intelligence & Machine Learning

Checklist —

Written by Saurabh Mishra