Tracking Regret, the Cost of Bad Decisions

Dimitri Tishchenko
Making Change.org
Published in
6 min readOct 28, 2021

Finding the right share headline for a petition can mean more engagement, giving it a higher chance of becoming a victory and making a change in the world. At Change.org, we rely on the decision making of hundreds of thousands of autonomous learning agents (a.k.a mutli-armed bandits) to select the right headline for each petition per share channel. This article describes how we quantify and evaluate the efficiency with which these agents learn and perform.

For each petition, we create an autonomous learning agent per share channel. The goal of this agent is to select the best converting (highest engagement) headline from a list of headline variants. Each agent begins without any information about the variants and must discover on its own which variant is the best. This is accomplished by using these variants and observing the results. Over time, the agent should learn which variant is best and use it as much as possible.

High level data flow diagram of social sharing learning agents

After a period of observation, the conversion rates (the proportion of variant usage to the number of times something rewarding happens with the variant) for the variants become apparent. In our use case, the conversion rate is the number of times a user saw the variant and signed the petition over the number of times a variant is used for a share. The variant with the highest conversion rate is known as the best variant and we would have benefited if our agent had used it for all of its decisions. While it can only be calculated in retrospect, determining the best variant is required in order to evaluate the decision-making ability of the agents.

The decisions that agents make as they choose variants can be broken down into two types. Good decisions are ones in which the agent chose to use the best variant. Conversely, bad decisions are ones in which the agent chose a variant that is not the best one.

Each time an agent makes a good decision and selects the highest converting variant, it is acting optimally. For every bad decision, the agent is not acting optimally and incurs regret. Regret is a term in reinforcement learning that quantifies how suboptimal a bad decision was. Consider a petition where we have three variants v1, v2 and v3. Variant v1 converts at a high rate, while v2 converts at a medium rate, and v3 has a poor conversion rate. If an agent chose v1 for every decision, it would be acting optimally. Since the agent doesn’t know what the conversion rates are for the variants, it must select suboptimal variants to learn about them. When the agent explores, picking v3 is worse than picking v2 because v2, while not being the best, is still better than v3. Given that conversion rates are numeric we can quantify the amount of regret we accumulate with each bad decision.

Epsilon-Greedy vs UCB Exploration Behaviour

Epsilon-greedy and UCB are two methods that can be used by an agent to balance exploitation (using the best variant as often as possible) vs exploration (picking variants other than the current best to learn about them). When we switched from epsilon-greedy to UCB we saw a decrease in regret. This is because of how epsilon-greedy and UCB differ in the way they behave when they explore variants. When epsilon-greedy explores, it selects a random variant. No information about the variant is taken into account when making the decision of which variant to choose. When UCB selects a variant, it selects the variant that it is most confident about being the best and considers the number of times a variant has been used. This means that very poorly performing variants are used much less frequently than the higher converting variants. Using the example from above, when exploring, epsilon-greedy would use v2 and v3 equally even though it is known that v3 performs very poorly. UCB will use v2 more often than v3 as it takes into account the conversion rate as well as the number of times the variant has been used.

Petition Lifecycle — Decision Numbering

Each decision made by an agent for a petition can be thought of as a sequence of decisions that can be numbered. For all the petitions that the agents are making decisions for, we can consider the first decision that was made and give it a number, 1. For all the second decisions on the petitions we can give it a number 2 and continue this numbering for each decision. Since our agents are trying to balance exploration and exploitation as they learn about their variants, the first several decisions are going to contain bad decisions because they have to learn about potentially suboptimal variants. As the agents learn about the variants, they begin to explore less often and utilize the best variant, resulting in more good decisions.

Example of percentage of good to bad decisions (first 3000 decisions)

With only a small amount of decisions, the agents quickly determine which variant is best for a petition and begin to use that variant more than any others. After 3000 decisions the data becomes noisy as our pool of viral petitions shrinks. Let us zoom in on the first 300 decisions.

Example of percentage of good to bad decisions (first 300 decisions)

The first roughly 50 decisions are spent learning about the effectiveness of the variants. After that we see a continuous, but steady, improvement. The initial saw-tooth pattern at the beginning of the process is noteworthy. It is likely caused by UCB entertaining inferior solutions to build confidence about the best variant during learning but further study is needed to confirm that.

System Optimality

Regret in this system can be interpreted as unrealized signatures for our petitions due to exploration of the variants. We can visualize these unrealized signatures over time and compare that to the number of realized signatures. The following chart gives us a visualization of system optimality.

Percentage Regret to Reward Over Time

The chart demonstrates that the system is quite efficient. The number of unrealized signatures hovers in the low single digits, percentage wise, compared to the number of total signatures we receive.

Regret by Variant

It is also useful to understand the source of regret and if there is a particular variant that is accruing regret.

Regret by Variant Over Time

The chart above indicates that regret is fairly distributed between the three variants. If one of the variants shows a high level of regret, it is a good candidate for removal/replacement.

The monitoring described in this article demonstrates that the autonomous agents that are making decisions when selecting variants are making good decisions. With this type of monitoring in place, we can have confidence to iterate on the variants that these agents choose from. New variants can be evaluated and we can confirm that our modifications to the variants won’t have adverse reactions to the running system.

--

--