Ethics of the Explore vs Exploit Tradeoff

Published in

Illinois Tech ACM

7 min readApr 9, 2019

My statistics professor invited me to talk about ethical machine learning to her graduate statistical learning class. I picked a topic that would be technically appealing to the students and help introduce ethical concepts.

Materials: Activity and Slides

Time: 45–60 minutes

Learning Objectives:

Students will analyze how their classmates estimate the potential benefits and harms of an experiment.
Students will recognize the rights of users in an experiment.
Students will evaluate multi-armed bandits as a tool for designing ethical experiments.

Explore vs Exploit

I designed a session about the explore/exploit tradeoff, which describes the tension between wanting to choose the best action according to your data (exploit) and wanting to experiment with potentially suboptimal actions to find out if they might be better (explore).

Should we give some people a potentially inferior treatment in order to learn about its effects?

Academic researchers might use randomized control trials (RCTs) to determine if a medical treatment is effective. Technology companies might use A/B tests to improve the click-through rate of a website. In both types of experiments, users are split into random groups and assigned different variations of an experience.

Example of an Amazon A/B test. Credit: Wolfgang Bremer, via Medium

Researchers who work with human subjects are bound by internal review boards (IRBs) and The Federal Policy for the Protection of Human Subjects, known as “the Common Rule.” Users must give consent to participate in such experiments. Companies obtain consent when users agree to their terms of service, but users may not be fully aware of what they have agreed to.

Actions can be legal, but not ethical. Studies can be statistically-sound, but not ethical. I wanted to introduce students to language that would help them describe to others the ethics of an explore/exploit situation.

Case Study: Newsfeed Experimentation

I choose a 2014 Facebook experiment as a case study:

“In an experiment with people who use Facebook, we test whether emotional contagion occurs outside of in-person interaction between individuals by reducing the amount of emotional content in the News Feed. When positive expressions were reduced, people produced fewer positive posts and more negative posts; when negative expressions were reduced, the opposite pattern occurred. These results indicate that emotions expressed by others on Facebook influence our own emotions, constituting experimental evidence for massive-scale contagion via social networks. This work also suggests that, in contrast to prevailing assumptions, in-person interaction and nonverbal cues are not strictly necessary for emotional contagion, and that the observation of others’ positive experiences constitutes a positive experience for people.”
- Kramer, Guillory, Hancock. 2014. Emotional contagion through social networks.

Headlines showing media outcry about Facebook’s newsfeed emotions experiment. Image: CBS News (2014).

I designed an activity on Desmos to walk students through the steps of the ethical decision making framework from the Markkula Center for Applied Ethics at Santa Clara University.

Recognize an Ethical Issue
Get the Facts
Evaluate Alternative Actions
Make a Decision and Test It
Act and Reflect on the Outcome

The Markkula Center outlines five sources of ethical standards. For this activity, I choose to focus on two: utilitarian ethics and rights ethics.

Utilitarian Test: Bentham’s Criteria

We started with a simple utilitarian test:

What could be some benefits from this experiment? List at least two.
What could be some harms from this experiment? List at least two.
On the whole, is this experiment more beneficial or more harmful?

In my experience, engineers and data scientists commonly talk about controversial decisions in terms of potential benefits and harms. However, if a coworker or manager does not agree with their assessment of how impactful or how likely those outcomes are, they may dismiss the concerns.

I asked students to analyze their classmates’ responses and figure out which of Bentham’s Criteria they were using:

Intensity: How strong is the benefit or harm?
Duration: How long will the benefit or harm last?
Certainty: How likely is the benefit or harm to transpire?
Nearness: How soon might the benefit or harm be felt?

As I had hoped, the exercise generated disagreement. Students noticed that their classmates did not use all four dimensions. Students struggled to quantify and compare the benefits and harms that others mentioned.

Rights Test: Kant’s Mere Means Principle

Statements such as “this could cause irreversible psychological damage” and “all humans have a right to consent” did not fit cleanly into the criteria. To contrast utilitarian ethics with rights ethics, I introduced the class to Kant’s Mere Means test:

Are the subjects being used merely as a means to the company’s end (goal)?
(1) In principle, could the subjects consent to the course of action?
(2) Is the company contributing to their users’ end (goal)?

Kant argues that to be just, we ought to avoid using others as mere means. To be beneficent, we must sometimes also contribute to others’ goals. For more discussion of the mere means principle, read Onora O’Neill’s article or this Philosophy StackExchange post.

Students found it easy to assign values to Bentham’s criteria, but still disagreed with each other over their conclusions.
Most students concluded that users could not reasonably consent to this experiment, but were unsure whether or not it advanced their ends.

Using multiple sources of ethical standards can bring more clarity to a decision. Most students find it easy to start talking about utility, but the mere means test provides a framework for discussing rights.

Will Multi-Armed Bandits solve explore/exploit?

With some ethical discussion under their belts, I introduced the students to a class of algorithms that have been billed as an alternative to A/B tests.

The Multi-Armed Bandit (MAB) randomly chooses between exploration and exploitation, in hopes of minimizing the number of users who are exposed to an inferior treatment.

Each time the multi-armed bandit has to make a decision, it either explores, by trying a random action, or exploits, by trying the action it current believes to be the best.

In this blog article from 2014, the Khan Academy Computing team describes one of their A/B test experiments about motivational content for learners. A commenter wonders if MAB might improve the results.

A commenter on the Khan Academy Computing blog wonders if bandit algorithms can predict whether or not a video sneak peak will motivate a student.

Engineers, data scientists, marketers, and managers who want better, faster results may be tempted to engage in the technical debate. Paras Chopra from VWO and Steve Hanov presented their opposing arguments for AB testing vs multi-armed bandits applied to website optimization.

MAB is called a “greedy algorithm,” because it makes decisions based on the local optimum. This means that it is not guaranteed to find the “true” best option. MAB captures the explore/exploit tradeoff because if data scientists choose too low of an exploration rate, MAB can lock in on whichever option initially gives it the most rewards. With too high an exploration rate, MAB will continue to test potentially suboptimal options. Students who program in Python can play with variations of MAB in this Kaggle kernel.

I have noticed some other variations of MAB in machine learning and reinforcement learning papers that should make data scientists reflect:

Upper Confidence Bound 1 (UCB1) chooses the option that has the highest upper confidence bound for reward, even if variation is high. As explained by Jeremy Kun, this algorithm minimizes “regret,” a measure of missed rewards. Could early users receive extreme options?
Contextualized Bandits (CMAB) use attributes about the user and the option to estimate reward. Researchers from Yahoo! and Princeton used contextualized bandits to personalize news articles on the Yahoo! front page. Could some of these attributes be inappropriate?
Behavior Constrained Contextual Bandits (BCCB) can learn rules about what options are permissible for certain users. IBM researchers trained a BCCB for movie recommendations to avoid showing mature content to young viewers. What could happen when bandits have to optimize both rewards and behavioral constraints?

I argued to the class that while bandit approaches may reduce the number of people exposed to suboptimal treatments, they do not change the rights of the users involved in the experiment.

How robust are these bandit algorithms to scenarios where users are allowed to opt-out of the experiment?
What is lost when minimizing regret and maximizing reward?
Are people who use bandits treating their users as mere means?

I am not convinced that better bandit algorithms alone will make experiments more ethical. Even if the students disagree with my argument about bandits, I hope this session gave them new perspectives for discussing the ethics of the explore/exploit tradeoff. Every team should be able to discuss whether or not their experiment is ethical.

Thank you to Dr. Lulu Kang for her support and encouragement.

Vinesh Kannan studies Computer Science at the Illinois Institute of Technology. He taught a high school enrichment class about ethical machine learning and contributed to open source ethical computer science activities for college courses.