Maximizing Social Change

Dimitri Tishchenko
Making Change.org
Published in
7 min readJun 7, 2020

Every day thousands of petitions get created around the world on Change.org. The most effective way a petition starter can reach an audience is social media sharing. Change assists petition starters by picking the most engaging share headline. With more engagement comes more signatures, which leads to a greater chance of becoming a victory. The virality cycle of most petitions is short lived so there isn’t time to employ traditional A/B testing methodologies. Instead we employ real-time, adaptive multi-armed bandit algorithms.

Share Headline Testing

When a user shares on social media we can control the copy that appears in the post. For example on Facebook it is possible to customize the share headline and description. In the following example “Can you spare…” and the grey description below are custom text we control when the post is created.

From experience we have a set of headline and descriptions that we have found very effective at engaging users to sign the petition.

Examples of Facebook share copy post variants

Consider a set of share headlines (h1, h2, h3, h4).

One thing we can learn is which share headline is the most effective for all petitions. While this can be an effective approach, it does not take into account the fact that the winning variant could be suboptimal in many instances. To illustrate this point, consider four petitions (p1, p2, p3, p4). The following table shows the conversion rate per petition-headline pair. 0.9 means that there is a 90% chance that a share using a particular headline will result in someone signing the petition (aka a conversion). The conversion rate varies wildly between petitions per headline. Further, there is a balance in the conversion rates. Also in this contrived, worst case scenario, the best headline for any petition is the worst headline for another.

h1 h2 h3 h4 p1: 0.9, 0.7, 0.3, 0.1 p2: 0.7, 0.3, 0.1, 0.9 p3 :0.3, 0.1, 0.9, 0.7 p4: 0.1, 0.9, 0.7, 0.3

If we ran a typical A/B test we would ignore which petition is currently being used and only look at the petitions on aggregate. Our test results would show:

h1 h2 h3 h4 p: 0.5, 0.5, 0.5, 0.5

It appears there is no winner. All of our headlines seem to convert at the same rate. Usually this is not the case and some imbalance exists. In this second example the conversion rate for (p4, h4) is 0.5 instead of 0.3.

h1 h2 h3 h4 p1: 0.9, 0.7, 0.3, 0.1 p2: 0.7, 0.3, 0.1, 0.9 p3 :0.3, 0.1, 0.9, 0.7 p4: 0.1, 0.9, 0.7, **0.5**

If we ran the A/B test now we would get the following:

h1 h2 h3 h4 p: 0.5, 0.5, 0.5, 0.55

The imbalance slightly changes the aggregate test and we are able to find a new winner. However, notice that the conversion rate for this headline for p1 is very bad (0.1).

This would typically be followed up by keeping the winner, brainstorming a new set of share headlines and iterating on the test. There is a limit to this approach as it can only identify imbalances and ignores the petition as a test context.

We need a system that is able to pick the right share headline per petition. For p1 we want h1, for p2 we want h4. This would involve running a test for each petition. At our scale it is not feasible to have a human operator in the loop. We need this system to be automated.

Bandit Optimization

Let’s shift our mindset from one of testing to one of optimization. In testing we want to learn the effectiveness of a share headline to know which option we should keep and select for future tests. In an optimization process the variant selection is automated. The difficult task becomes correctly stating the problem.

How to Make Decisions

Throughout the optimization process we are going to keep track of counters to record our observations. Each time we are asked to select a variant to be used, we will consult our recorded data and make a decision about which variant should be used. We need to balance two things; exploration and exploitation. Exploitation means that we want to use the best variant we have so that we get the maximum reward. Exploration brings into doubt the fact that our exploitation is using the correct variant. We have to spend time verifying our assumption that the current best variant is indeed the best.

For every variant of every experiment we are going to maintain two counters.

First some terminology:

total number of pulls: Increment every time this variant is used

total number of rewards: Increment every time this variant experiences something good

In our share headline example above pulls are shares and rewards are signature generated from a user clicking the share and signing the petition.

The data for a typical experiment would then look like:

variant1: pulls: 10 rewards: 5 variant2: pulls: 15 rewards: 7 variant3: pulls: 20 rewards: 18

Aside: Badly Incentivized Bandits (Reward Hacking)

An early lesson we learned was that it is very important to know what you want to maximize when designing a bandit. One of our earliest bandits dealt with choosing a petition image from a set of uploaded images. It was pulled on a share and rewarded on a visit. This incentivized the selection of the image to maximize clickthrough rates and created an incentive for the bandits to select click bait images which would not increase signature conversion. We changed the reward to reward on the signature because ultimately that is what we wanted to maximize.

E-greedy

The simplest decision heuristic available to us is the greedy selection policy. We allocate some initial budget on exploration to gather some numbers and then start using the best variant based on the data we collected during the exploration phase. This is problematic because it is difficult to know how long the initial exploration time should be. If we set it too short we might be exploiting variants that are not the best. In those situations we would like to have had more time to explore and learn about which variant is best.

To address this issue we are not going to think of exploration as something that needs to be done up front but as an ongoing process that will question our assumption about the best variant. The simplest of these decision making algorithms is E-greedy. With this algorithm we will allocate 10% of our time to exploration and 90% of our time to exploitation. Consider the example above. Each time we have to make a decision about which variant to choose, we will roll a ten sided die (labelled 0–9) if the number is 0 we pick a random variant. Otherwise we will divide our rewards/pulls and pick the best variant. In this case variant 3.

Following this process will converge on the optimal variant over time which means the best variant will be used more than the others. We have automatic decision making!

Epsilon greedy has some limitations. If we maintain our ten percent explore selection strategy, we might learn what the best variant is but will continue exploring at a constant rate. We could decay (lessen over time) the explore rate but it would be difficult to know what rate would be optimal. Another issue is that our exploration is a random selection process. The information we have about our variants is not used to guide our exploration decisions.

Upper Confidence Bound (UCB)

There is a better (smarter?) algorithm that we use. The details of UCB are outside the scope of this article but when this algorithm is asked to select a variant, it uses a formula that tries to pick the variant that is most likely to be the best variant. It might choose a lower performing variant if its confidence about the current information it has about the variant is low. Another advantage of this algorithm is that it is deterministic. It does not require any dice to run which makes testing easier. The best explanation I have seen of UCB is by David Silver in his excellent course on Reinforcement Learning. UCB will converge on the optimal variant of a test and will only explore sub optimal variants rarely. The disadvantage of having to tune an explore decay for epsilon greedy is taken care of automatically. At Change we started with epsilon greedy bandits but now run all of our bandits using UCB.

Maximizing Social Change

At Change we use our internally developed bandit service to choose the best copy for each petition at scale across many parts of the product. This scale of optimization has not only made viral events more viral but has also realized gains from the long tail of petitions that get less attention than our more viral petitions. This has lead to more sharing, signing and eventually to petitions becoming successful. Multi-armed bandit optimization is a very general problem formulation with many applications. If you you would like to learn more about bandits and more generally about reinforcement learning, checkout this free textbook Reinforcement Learning: An Introduction.

--

--