Reinforcement Learning with Python

Pratik Randad
Analytics Vidhya
Published in
5 min readMar 11, 2020

--

Update: This article is part of a series,check out the previous part here , to learn the basics of reinforcement learning.

To apply reinforcement learning ,different algorithms are present like :

  1. Upper Confidence Bound
  2. Thompson’s Sampling

Here we will implement the Upper Confidence Bound Algorithm using Python

Upper Confidence Bound Intuition:

Imagine you are in a casino with some money, you want to win more by playing it on a slot machine.But there are 5 different machines and each yields money a different chance unknown to us. How can this be solved?
Using Upper Confidence Bound (UCB) algorithm we have a good chance to find out the best machine.

Analyzing the problem

Each machine would have a different distribution of output.

These output distributions are unknown to us. Lets us plot the assumed average output of each machine on a single axis as we dont know the output we assume all machines have the same:

Actual vs assumed output

Here the dotted red line indicates the assumed output and the other lines the actual output.
What UCB does is that for each machine it will form a confidence band with high certainty that it will include the actual expected return as shown.

Confidence band

For the first rounds we will try each machine so that the confidence band can be formed.
Now we randomly a machine is chosen , lets say D3 is chosen. It returned no money for that round, so one value is added to its average and it decreases the level and hence confidence band becomes more stronger to cover the actual output as shown:

D3 chosen
D3 assumed value goes down
Confidence band reduces

This confidence band is going to converge to the actual value in the long run.
Now randomly machine D4 is chosen and same process is repeated.

D4 chosen

In the same manner for the first few rounds all machines are chosen to get an initial idea.
Now for next iteration the machine with the highest upper bound level is chosen for next iteration. Hence the name Upper confidence bound.

This process goes on and for each iteration only the machine with highest UCB level is chosen and it is exploited. This creates a balance b/w exploration vs exploitation.

Final result

This is the result in the long run we found and exploited the D5 machine as it was the best one.

Implementation in Python

Here we will create a generalized solution so that it can used with any no. of slot machines,ads,etc and can give output for best machine using UCB.

Problem Definition :

We have 10 different ads to be displayed to the user. For each add the user will click on it or let it go.Depending on this interaction we need to find out the best ads that can be used for marketing to boost sales.
Here is the glimpse of the data-set used :

This data-set shows the user reaction if each ad is shown to different users.
Example : for user with id=0 out of the 10 ads the user will click on ads 1,5,9.
We need to find a way that for the user who logs in next onto website the most optimum ad must be shown using UCB.

Steps :

UCB

Importing the required libraries and dataset

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv(‘Ads_CTR_Optimisation.csv’)

Step 1

#d is the no. of ads
d=10
# N is no. of times the ads is shown to different users
N=10000
# Ni(n)and Ri(n)vectors
number_of_selections=[0] * d
sum_of_rewards=[0] * d
#array of ads selected for each round
ads_selected=[]
#total reward generated
total_reward=0

Step 2 and 3

Now for 1st 10 rounds each ad will be selected so that some perception is created for creating confidence bands.Then for each next round the ads with the highest upper bound is selected and its reward is added to total_reward.
This algorithm will obtain the most optimum ads as the loops lends.
Here just by changing N and d it can be used for any no. of ads,slot machines ,etc.

import math
for n in range(0,N):
max_upper_bound=0
ad=0
for i in range(0,d):
if(number_of_selections[i]>0):
average_reward=sum_of_rewards[i]/ number_of_selections[i]
confidence_interval=math.sqrt((3/2 * math.log(n) / number_of_selections[i]))
upper_bound=confidence_interval + average_reward
else:
upper_bound=10e400
if(upper_bound>max_upper_bound):
max_upper_bound=upper_bound
ad=i
ads_selected.append(ad)
number_of_selections[ad]+=1
reward=dataset.values[n,ad]
sum_of_rewards[ad]+=reward
total_reward+=reward

1st 10 ads selected vs last 10 ads

UCB selected as the most optimum ad.
Ads selected histogram shows the most optimum ad.

Final Notes

This is my first article series and I hope you learned something! If there is anything that you guys would like to add to this article, feel free to leave a message and don’t hesitate! Any sort of feedback is truly appreciated. Don’t be afraid to share this! Thanks!

--

--