Reinforcement Learning — Beginners

Venkatesh Chandra
Analytics Vidhya
Published in
4 min readDec 25, 2019
High CTR optimizes the PPC cost and traffic

A model that starts without any data

You may think that all machine learning models can be classified into supervised, semi-supervised or unsupervised. However, this is not true — some algorithms are not a part of any of the above types. One of them is the Reinforcement Learning model.

A reinforcement learning algorithm is trained using rewards and punishments as inputs to maximize the total benefit of the business. Learning from mistakes is the perfect phrase to capture the essence of this algorithm.

In this project, we will learn about two commonly used algorithms in reinforcement learning — Upper Confidence Bound (UCB) and Thomson Sampling.

Applications of Reinforcement Learning

Reinforcement learning is used widely in digital marketing to drive traffic to webpages and/or to sell products online. Common use cases include creating recommendations on user accounts, optimizing ad display to maximize CTR (Click through rates), predicting customer behavior and selecting the best content for an ad.

Problem Statement

In our problem , a fictitious automobile company Vesla, which sells electric cars wants to show ads to the users who browse their website. The manager of the company’s digital marketing team has received 10 great ad images of the newly launched car Fybertruck from the content team. She is confused about which ad image should be used on the website. As the pressure is high on driving more traffic to the checkout page, she prefers using reinforcement learning to A/B testing.

Dataset

The dataset was downloaded from Kaggle (Dataset name — Ads_CTR_Optimisation). Each of the 10,000 data points has information on user clicks on different ads (10 ads in total). For example — User 1 in the dataset clicked on Ad# 1, #5 and #9.

Upper Confidence Bound

UCB is a deterministic algorithm. We start by assuming that all the 10 ads generate the same benefit (#clicks in this case) at that start. The initial few rounds are experimental. We show the users any of the 10 ads randomly and start to collect info on those ads which generate clicks. For example — If a user clicks on Ad#5 and Ad#7, these two ads are rewarded and hence they have better confidence (upper confidence means that there is a better chance of generating clicks the next time Ad#5 or Ad #7 are shown). If the next user does not click on one of these ads the next time, the confidence for Ad#5 and Ad#7 goes down. This way, we continue till 10,000 iterations based on the maximum reward approach.

Thomson Sampling

Thomson Sampling is a probabilistic algorithm. Each of the ads has its distribution of rewards (#clicks). In this algorithm, we try to guess the reward distribution of the ads and then try to maximize the total reward. We start by assuming a certain type of distribution for the ads and then sample a random point on the distribution. Next, we check if the ad was rewarding and then adjust the total reward for the ad, and in turn, we adjust the distribution for the ads in every iteration. After 10,000 iterations, the law of large numbers comes into play and the total reward converges to the expected return of the ad.

The intuition behind the code

The code should be treated as the ground truth information. We randomly show 10 ads across 10,000 iterations and then match with the ground truth. The ad with the highest reward after 10,000 iterations is the best one. Hence, we do not need to wait till 10,000 users have interacted on the website to find the best one, we can generate clicks on the go.

Results

Following are the results obtained from the two algorithms used for the problem:

Results from UCB and Thomson Sampling

A/B testing results: (This is done by taking the sum of the clicks after 10,000 iterations):

Results from A/B Testing

Thomson Sampling seems to be the best method to increase user engagement. A/B testing seems to be quite expensive.

Ad #4 which shows the Fybertruck drives the highest traffic to the checkout webpage.

It has been found that Thomson Sampling provides more empirical evidence. While the results need to be updated after each round of UCB, Thomson Sampling can be run in batches and hence scores an extra point for computational efficiency.

Codes

References

Machine Learning A-Z — SuperDataScience

--

--