Target Customers, smartly!

Introduction to Epsilon Greedy algorithm and Thompson Sampling

Published in

Coinmonks

7 min readAug 11, 2018

In Taiwanese data professional July meet up, Gary Chen at Amex gave an excellent introduction to survival bias and the algorithm to attack the challenge, Epsilon Greedy algorithm and Thompson Sampling. In this article, I’m going to explain how it works from scratch. Some of the code is modified from Bayesian Machine Learning class on Udemy.com. It’s a thorough yet concise material that I highly recommend.

Alright, here we go!

Problem Definition

Imagine your company is running marketing campaigns based on emails. Naturally, some customers are more likely to open the emails and some are not. But before running any campaign, you know nothing about your customers. The only way to learn how likely a customer will open the email is to collect the data by sending them whole bunch of emails and observe the responding rate. The question is, how many emails do we have to send to all customer to establish statistical significance?

If we send 100 emails to each customer, we probably be able get some idea of who is our optimal targets. But it is very inefficient! As the campaign goes, we want to reach to as much customer as possible as. Meanwhile, we also want to target the customer who has the highest respond rate, those are our optimal customers give the best reward. It is an explore-exploit tradeoff.

The exploration, exploitation trade-off is a dilemma we frequently face in choosing between options. Should you choose what you know and get something close to what you expect (‘exploit’) or choose something you aren’t sure about and possibly learn more (‘explore’)?

The challenge can exist in many different contexts. Such as marketing campaign, website A/B testing or your gambling strategies. The core of the challenge is essentially the same.

Epsilon Greedy

To solve the exploit and explore problem, epsilon greedy algorithm assigns a probability ε that we randomly explore more customers. In the rest of the time, we exploit the customer with the highest probability p based on we’ve known so far. The pseudo code can be written as follow:

while True:
    r = random.random()
    if r < eps:
        #explore
        #show ads to random customer
        #update p
        pass
    else:
        #exploit
        #show ads customer with highest p
        break

To set up the experiments, I wrote a python class to restore a customer’s intrinsic probability and the past records of reactions. Note that the intrinsic probability is assumed to be a gaussian distribution characterized by p_mean and p_sigma. Therefore, instead of a constant probability, each time we reach to a customer, the probability of opening a link is a random number draw from a gaussian.

class Target(object):
    def __init__(self,p_mean,p_sigma):
        self.p_mean = p_mean
        self.p_sigma = p_sigma
        self.a = 1
        self.b = 1
        self.n = 0
        
    def trigger(self):
        return np.random.random() < np.random.normal(loc=self.p_mean, scale=self.p_sigma)
    
    def sample(self):
        return np.random.beta(self.a, self.b)
        
    def prob(self):
        return self.a/(self.a + self.b)
    
    def update(self, x):
        self.a += x
        self.b += 1 - x
        self.n += 1

The class constructor __init__ take in two parameter p_mean and p_sigma. We also initiate self.a = 1 and self.b = 1, indicate the total successes and total failures respectively. a and b are set to be 1 at the beginning for convenience in later experiments. The class method trigger() will return the test result by comparing a uniform random number to a sample draw from the gaussian distribution. The prob() method will return empirical probability based on previous test. Finally, update() method records the testing history of a target.

def experiment():
    #Create target objects for epsilon greedy
    Targets = [Target(p,s) for p,s in zip(p_means,p_sigmas)] 
    n_pulling = [0,0,0]
    regret = 0
    cumulative_regret = []
    
    for i in range(num_trials):
        best_target = None
        max_prob = -1        #with probability epsilon, we explore more data
        r = np.random.random()        
        if r < epsilon: #
            index = np.random.randint(0,len(Targets))
            best_target = Targets[index]
        else: 
            for n,target in enumerate(Targets):
                prob = target.prob()
                if prob > max_prob:
                    maxprob = prob
                    best_target = target
                    index = n
                    
        x = best_target.trigger()
        best_target.update(x)
        regret += optimal-x
        cumulative_regret.append(regret)
    plot(Targets, cumulative_regret, cumulative_reward, num_trials)num_trials = 1000
p_means = [0.2, 0.5, 0.7]
p_sigmas = [0.1, 0.1, 0.1]
optimal = max(p_means)
epsilon = 0.2
experiment()

This is a piece of python code that implemented of epsilon greedy algorithm. We simplified the experiment to only three options for convenience. We also introduce a new variable regret to evaluate the performance. Regret is defined as the loss compare to optimal choice. In the other word, if we always target customer with highest probability, then we have no regret. Therefore our objective is to minimize the cumulative regret over time.

The left-hand side is the probability distribution of p for each option after 1000 trials. First thing we observed is the green line are much narrower distribution than the others, means that we are pretty sure the green one is a better target, therefore we choose green target 872 times out of 1000 trials.

Meanwhile, due to the random exploration, the other two targets are fairly explored. As a consequence of random exploration, the cumulative regret is always increasing! A solution to fix it is Upper Confidence Bond. The UBC algorithm will decide when we can stop random exploration confidently with the data we collected. Alternatively, we can think about a smarter way to explore the customer seamlessly without increasing regret.

Thompson Sampling

Thompson sampling is an application of Bayes’ theorem. The Bayes’ theorem provides a powerful tool to estimate the probability distribution of parameters.The left-hand side of the equation is the posterior, which is often interpreted as the probability distribution of parameter given data. And the right-hand side is likelihood multiple prior divide by marginal probability.

In our case, the likelihood function is Bernoulli distribution. The Bernoulli distribution gives two possible outcomes (0,1) and parameterized by p. That is to say, each time we reach to a customer, he/she has intrinsic probability p to open the link (output = 1) and probability 1-p to ignore it (outcome=0).

For the prior distribution, we use Beta distribution. In bayesian statistic, we often choose a conjugate prior for the likelihood function to ensure the posterior distribution is the same distribution as prior. As a result, we can reduce the computation.

It sounds sophisticated, but actually very simple. The beta distribution are characterized by two parameters, ⍺ and β. At the beginning, we draw a sample from Beta(⍺=1,β=1) and run a test. If the outcome is head (x=1)m we update the posterior Beta(⍺’=⍺+1,β’=β). And we run another test use new Beta as prior, update hyper-parameters. Again and again, eventually, the posterior function will give us true distribution of p. In pseudo code:

⍺1=β1=⍺2=β2...⍺k=βk=1
while True:
    # draw samples from Beta_i(⍺i,βi) i=0-k
    # show ads to sample with largest number
    # receive response x = 0 or 1    
    # update posterior ⍺'= ⍺+x,β'= β+(1-x)

To implement Thompson Sampling, we just need to add few lines of code into the previous experiment() function:

#Thompson Sampling
        for n, target in enumerate(Targets):
            sample = target.sample()
            if sample > max_prob:
                max_pro = sample
                best_target = target
                index = n

Notice that the only difference is that we compare the samples draw from Beta distribution instead of directly compare to empirical probability.

The posterior distribution of p. The solid line EG stand for epsilon greedy algorithm. The dash line TS stand for Thompson sampling. From left to right is the posterior after [50, 200, 500] trials.

The result has significantly improved! With Thompson Sampling, the algorithm shortly learned which is the best target and stick on it without wasting time randomly explore more data.

From the cumulative regret plot, we clear see that the cumulative regret is significant lower after 300 trials and turn into flat, which means the algorithm have learnt to always choose the optimal target.

We can benefit from Thompson Sampling with fewer trials and maximum rewards.

Conclusion

In the real world, the possibility to open a link is commonly around 3–5% or even lower. Also the customer base is much larger. Next week, I am going to do more experiments on the performance of both algorithm with much larger amount of targets and lower probability. I will also try other ideas to improve the algorithm such as set up a different initial prior or different epsilon.

The complete python code can be found on my github. Hopefully this post can help you better understand epsilon greedy algorithm and Thompson sampling. If you have any question, drop me a comment. Thanks!