A Beginner’s Guide to Data Ethics

Big Data at Berkeley

Published in

Big Data at Berkeley

9 min readAug 11, 2020

By Sophie Lou and Mark Yang

What is the connection between Data Science and Ethics?

At first glance the two subjects don’t seem to have much in common. Data science is related to engineering and science, while ethics revolves around social science and philosophy. However, the truth is that human contexts and ethics are inseparable parts of Data Science. For the rest of this article we’re going to assume that you already have some basic knowledge about what Data Science is. If you don’t, no worries! Check some other articles by Big Data at Berkeley, or read this article by the UC Berkeley School of Information, which will give you a more detailed introduction to the subject. Whether you are interested in Data Science, or you are someone who is simply curious about human ethics and morals, this article will give you a great first look at the amazing world of data ethics.

What is Data Ethics? Why do we care about it?

Let’s start with a professional definition of data ethics. In their article, What is data ethics?, Oxford professors and philosophers Luciano Floridi and Mariarosaria Taddeo state that:

“Data ethics is a new branch of ethics that studies and evaluates moral problems related to data (including generation, recording, curation, processing, dissemination, sharing and use), algorithms (including artificial intelligence, artificial agents, machine learning and robots) and corresponding practices (including responsible innovation, programming, hacking and professional codes), in order to formulate and support morally good solutions (e.g. right conducts or right values).”

Simply speaking, in data ethics, we learn about all the ethical problems that appear during our use of data. In this era of rapid technological development, we are living in a “Data-fied World.” Data collection is a vital part of nearly every aspect of our lives, from the phones in our pockets to the cars we drive. Almost every human behavior and every operation we do with a tool like a computer can be collected as data. Over the years, as technology progressed and we aimed for a better life, we began to use data generated from day-to-day actions to conduct complex analysis with the help of strengthened computing powers and new analytical tools. Advanced technologies related to data science, like Machine Learning and AI, have brought a lot of benefits to our life. However, as humans begin to step away from hands-on analysis and let automated machines do most of the work for us, different issues such as fairness, privacy, and representation emerge. We will cover a couple of cases about those issues in detail below, so keep reading!

Why do Data Scientists need to understand Data Ethics?

Ever since Data Science became a buzzword in the technological industry, colleges and universities have been scrambling to open a Data Science Program to satisfy the world’s growing demand for data scientists, engineers, and analysts. In 2018, the University of California, Berkeley was among the first few colleges that introduced a unique Data Science Major. The program wants to:

“produce graduates who not only have deep technical expertise, but who also know how to responsibly collect and manage data, and use it to inform decisions and advance innovation to benefit the rapidly evolving world they’re graduating into”.

Besides various technical requirements such as computing, probability, and modeling, the Berkeley Data Science curriculum has an additional human contexts and ethics requirement. This shows how academic institutions recognize data ethics as a crucial skill for any future Data Scientist to develop. As Data Scientists, we often deal with big sets of data that are driven by people, so it is our duty to keep private data secured and use it responsibly. To better incorporate human values like justice and equity in data-driven technologies, we need to also understand the underlying human and social structures.

How should we incorporate Data Ethics in our work as students?

When we are doing a data science project, we need to make sure that we understand the potential ethical consequences of our work. Some tips for you to be an ethical data scientist are: first, be aware of privacy issues such as data breaches and find ways to adequately secure the data. If you are not familiar with the danger of a data breach, check out this news article about the Facebook Security Breach. Second, be transparent with your data usage. Get user consent before you use their data in any way. Read the report by CDC about the infamous Tuskegee syphilis study to see how a study with no informed consent can go horribly wrong. Third, despite the difficulty of being completely objective, you should try your best to make sure there is no bias involved in your model. In fact, to make employees follow data ethics principles, many companies and organizations have incorporated certain codes of ethics and conduct. One code of conduct that a lot of professional data scientists follow is the Oxford-Munich Code of Conduct. It addresses common ethical dilemmas that data scientists from the industry, academia, and the public sector may face. Feel free to take a look at it. Below, we also provide you with a checklist created by DJ Patil, Hilary Mason and Mike Loukides, which you can use to incorporate data ethics in all of your data science related projects.

Here’s the Data Ethics Checklist:

❏ Have we listed how this technology can be attacked or abused? [SECURITY]

❏ Have we tested our training data to ensure it is fair and representative? [FAIRNESS]

❏ Have we studied and understood possible sources of bias in our data? [FAIRNESS]

❏ Does our team reflect diversity of opinions, backgrounds, and kinds of thought? [FAIRNESS]

❏ What kind of user consent do we need to collect to use the data? [PRIVACY/TRANSPARENCY]

❏ Do we have a mechanism for gathering consent from users? [TRANSPARENCY]

❏ Have we explained clearly what users are consenting to? [TRANSPARENCY]

❏ Do we have a mechanism for redress if people are harmed by the results? [TRANSPARENCY]

❏ Can we shut down this software in production if it is behaving badly?

❏ Have we tested for fairness with respect to different user groups? [FAIRNESS]

❏ Have we tested for disparate error rates among different user groups? [FAIRNESS]

❏ Do we test and monitor for model drift to ensure our software remains fair over time? [FAIRNESS]

❏ Do we have a plan to protect and secure user data? [SECURITY]

(Loukides, Mason, Patil)

Some real-world cases that might blow your mind.

Now that we have some ideas about how to be ethical data scientists, let’s examine the following case that addresses some of the issues we mentioned above.

As the world became more technologically advanced, the use of data has brought efficiency in a variety of industries. For example, many tech companies have employed data scientists to track and understand the popularity of their products. However, Kwang-Mo Yang, a member of Samsung Medical Center, has written an article regarding the ethical concerns behind using real-world data. The problem emerges from non-governmental organizations studying the health data after de-identifying personal information. Because a patient’s health data may contain highly personal information, it is possible for the pharmaceutical companies to analyze the gender, age, and race of a patient and categorize a certain type of individual as vulnerable to a certain disease. Therefore, this is an issue of privacy and representation. Pharmaceutical companies have used this information to target advertisements for drugs. Groups of individuals who had been classified as “vulnerable to Disease A” were more likely to see advertisements for drugs targeting “Disease A” in their pharmacies¹. This categorization often disproportionately affected low-income communities and under-represented minorities, raising several questions about whether this practice was truly ethical. Many individuals have also shown concern about their personal information being used for commercial purposes. Ironically, many governmental organizations have utilized real-world data to invent new drugs that have helped a variety of patients, but they used lots of personal information to drive that marketing².

While it is legal in many countries, including South Korea, to study health data after de-identifying personal information, it begs the question: Would an individual be happy about their own information being used without their knowledge? Will a person feel completely secure in a society where he or she can’t hide their personal information?

To make any study more ethical, companies should acquire informed consent from their patients before they begin to use private data from an individual.

The Evolution of Data Ethics

We’ve already talked a lot about the present state of data ethics in data science. Let’s predict the future roles it may play in the industry. Writer Barbara Lawler has deduced some of the potential global trends related to data ethics and data privacy. Here are five trends that Lawler expects to see:

Chief Privacy Officers can expect ethics to become an explicit part of their role.
Technology companies will lead the way for U.S. Federal Privacy legislation.
Sustainable ethics codes will evolve to better address the challenges of a digital world.
Product excellence and privacy by design will become synonymous.
Companies will drive to educate policy-makers and regulators about their technologies.

What does this mean for you?

Data Ethics is here to stay, and will likely become a key part of any responsible Data Scientist’s job, if it isn’t already.

While data has yielded a wide variety of benefits in everyday lives, the purpose behind the use of data has become a vital topic. It all begins from considering the human impact from the use of data. It will be important for privacy officers to analyze the impact on people and society and whether the impact may be positive, negative or neutral.

Since the necessity for data privacy in the United States has been a long discussed topic and, as technology companies are the most knowledgeable organization within the area of data usage, Lawler believes that the United States will lead the way for U.S. Federal Privacy legislation following the regulation of General Data Protection Regulation that was implemented in European Union.

For a long period of time, there has been a shift in consensus on how to respect privacy due to the emergence of personal computing and larger network connection. Following the expansion of globalization of economy and profound alteration in the physical and digital lives of the citizens, Lawler is convinced that companies will come up with a sustainable ethics code to counter potential challenges in a digital world.

Privacy by design means to embed data privacy requirements into product design and development, embodying the “build it in, don’t bolt it on” mentality. This includes building in:

Privacy-savvy defaults
In-product transparency
Considerations for and documenting privacy risks and data flows
Assigning data owners upfront and throughout the data lifecycle, including E2E security

With advancements in technology, knowing where data comes from and why it exists has never been more vital from a strategic, operational, and compliance perspective. Data needs to be stored in a clean and accessible form, which will allow companies to learn, analyze, and tackle business issues in real-time. PbD (Privacy by Design) will play a critical role in this and it is just as important as secure coding.

Lawler writes that it is vital for policymakers around the world to develop a deeper understanding on what they wish to regulate. Given the profound shift in the digital network globally, policymakers must consider³:

What harms are they trying to protect people from?
What rights do they want to guarantee?
What problems are they trying to solve?
What are the privacy outcomes they hope to achieve for their citizens?

Therefore, the deeper the understanding the policymakers have towards the newly created technologies, the easier it will be for them to decide if they want to regulate that technology or not. As a result, the organizations that place the greatest emphasis on educating policymakers will have the highest impact on the evolution of data science.

Conclusion

Thank you for reading! We hope you’re walking away with a better understanding of how to become an ethical data scientist. Feel free to come back and refer to the collection of resources we have provided on how to incorporate ethics in your data analysis work. Also, please don’t hesitate to comment and give us feedback! We would love to hear your thoughts on our article and data ethics in general.

Good luck in all of your data science endeavors!!

[1]: Tanner A. Or bodies, our data: how companies make billions selling our medical records. Boston (MA): Beacon Press; 2017.

[2]: Budin-Ljosne I, Teare HJ, Kaye J, Beck S, Bentzen HB, Caenazzo L, et al. Dynamic Consent: a potential solution to some of the challenges of modern biomedical research. BMC Med Ethics. 2017;18(1):4.

[3]: https://looker.com/blog/big-data-ethics-privacy