A Practical Approach to Data Ethics

Aakash Gupta

Published in

Analytics Vidhya

6 min readOct 29, 2018

There is a Golden Rule in life; it’s a maxim which appears in various forms around the world:

One should never do that to another which one regards as injurious to one’s own self.

As a data scientist I find this principal of reciprocity very appealing!

Treat other’s data as you would have others treat your data.

The recent spurt in fake news and incidents of profiling & targeting of swing voters during elections has brought a spotlight on data privacy. We look at the following three facets of data ethics:

Failures & Checklists
Diversity & Empowerment
5C’s of Data Ethics

Failures & Checklists

Catastrophic failures are often because practitioners believe, that they are doing no harm! Until the late 1800’s doctors didn’t believe that they should clean their hands before surgery. This led to countless deaths due to infections. Even when it was scientifically proved to be true; it took many decades for them to accept it as part of their standard operating procedures (SOP).

A data scientist (or any tech entrepreneur) doesn’t want to do any harm. We sincerely believe that the (data) products that we are building will improve the lives of our users. However, failures do occur.

So how can a data scientist avoid such a scenario? Using a checklist is one such approach.

A checklist ensures that data is used ethically. Review of the process (using a checklist) can happen 1) during conceptualization of the project; 2) During project execution and finally once 3) the project is completed.

A sample checklist is given below:

What kind of user consent is required?

2. Have we explained clearly what users are consenting to?

3. Have we tested for disparate error rates among different user groups?

4. Do we have a plan to protect & secure user data?

5. Have we tested our training data to ensure its fair & representative?

6. Does the team reflect diversity of opinions, backgrounds, and kind of thought?

7. Does the algorithm look at the correct artifacts/features before making a prediction?

This list isn’t exhaustive but evolving in nature…but it forces us to ask difficult questions while we plan our projects.

As an example to ensure that the algorithm is actually doing what it’s supposed to do — data science teams can use tools like SHAP & LIME. These tools help us identify features which are used by the machine learning algorithm to make a prediction.

This saves us from an embarrassing scenario where we use an incorrectly trained algorithm. An infamous example is of an algorithm which was using snow in the background of the image to predict the existence of a wolf in the image.

Original Source: Why should I trust you?

Diversity & Empowerment

New-age companies have the maxim of moving fast & breaking things.

Fail fast & cheap!

We want to build minimum viable products without understanding the consequences. Young engineers push their builds into production, however if they have reservations about the product, we do not empower them to roll-back products or features from the market.

In the book Ethics and Data Science (by Mike Loukides, Hilary Mason & DJ Patil)[1] the authors talk about creating a safe space for dissent in data science teams. A single team member may object to an approach but if there is no support for ethical thinking within an organization. They can easily be sidelined. Hence it’s important from an organization perspective to empower the youngest member of a data science team.

Another method to avoid blind spots is by building diversity in a team. Developers of different background & expertise add significant value to a team’s productivity.

I would highly suggest watching this talk by Joy Buolamwini. She was a graduate student when she realized that the face detection algorithms couldn’t recognize her face! It was as if for the algorithm she didn’t exist!

She found out that the algorithm couldn’t recognize her face because the training dataset didn’t have any samples with a darker skin tone. And this was not due to any deliberate racist behavior. But simply because none of the algorithm developers realized that their training data was incomplete. All of them were of Caucasian origin (white-skinned).

This underscores the fact that diversity and empowerment in data science teams ensure that blind spots are covered. This also allows us to have meaningful dialog within a team.

The 5 Cs of Data Ethics

To ensure that there is a mechanism to foster a dialog, the following guidelines have been suggested for building data products:

1. Consent

2. Clarity

3. Consistency

4. Control (and Transparency)

5. Consequences (and Harm)

Consent doesn’t mean anything…unless the user has clarity on the terms & conditions of the contract. Usually contracts are a series of negotiations, but in all our online transactions it’s always a binary condition. The user either accepts the terms or rejects it. Developers of data products should not only ensure that they take consent from the user. But the users should also have clarity on 1) What data they are providing; 2) How their data will be used and 3) What are the downstream consequences of using the data.

Remember “I have read and agreed to the terms and conditions” is one of the biggest lies on the web. Terms of service agreements are often too long and difficult to understand for the lay-person. Hence it’s important that users know what they are consenting to. And this should be in the simplest of terms.

Consistency is important to gain trust of the user. Often people who have the best intentions can interpret the terms of engagement in strange & un-predictable ways. Controls should be present so that if the user changes in mind, he can simply delete the data.

Google recently gave users more control over their search history data. Users can now review and delete your search activity within Google Search. It will also allow users to disable ad personalisation.

Users should also be aware of the consequences of sharing their data. A prime example is that users know that their Twitter feeds are publicly available. However few know that the tweets can be used by researchers or profiling firms. This may have unintended consequences.

The “Unknown Unknowns”

We often hear project managers talking about the “Unknown Unknowns” — These are the unforeseen consequences, the risks that cannot be eliminated. However all too often these risks are unknown because we don’t want to know them. When machine learning models are trained on biased data there is a danger that they can institutionalize discriminatory behavior. A good example is Amazon’s recruitment algorithm which discriminated against women.

Similarly users have stopped trusting news agencies and consumer internet companies. This lack of trust is because they feel abused. (Read fake news and Cambridge Analytica)[3]&[4]

Data science is an evolving field. It is being built on ideas that were developed in the last few decades. Humans built buildings & bridges before their principles were codified. They are now building large-scale prediction system that involve society. Just as early buildings & bridges collapsed in unforeseen ways. In a similar manner these predictive systems will fail & expose serious conceptual flaws.[5]

And it’s good to fail! Since failure gives you the incentive to build more robust systems.

References:

Ethics and Data Science,by Mike Loukides, Hilary Mason & DJ Patil — This book should be required reading for anyone who is serious about data science
Amazon created a hiring tool using AI. It immediately started discriminating women — By Jordan Weissmann
Facebook-Cambridge Analytica Scandal — Wiki entry
Facebook Cambridge Analytica: A timeline of the data hijacking scandal — CNBC
Artificial intelligence the revolution hasn’t happened yet — @mijordan3
The Dark secret at the heart of AI — Will Knight at MIT Review

Disclaimer:

These are my own personal views and do not represent the views/strategies of my employer Edelweiss

A Practical Approach to Data Ethics

Failures & Checklists

Diversity & Empowerment

The 5 Cs of Data Ethics

The “Unknown Unknowns”

Written by Aakash Gupta