Contextual Bandits in reinforcement learning explained with example and codes

starting deep reinforcement learning using tensorflow

Mehul Gupta
Data Science in your pocket
5 min readFeb 26, 2023

--

So, continuing my reinforcement learning blog series which includes

Reinforcement Learning basics

Formulating Multi-Armed Bandits (MABs)

Monte Carlo with example

Temporal Difference learning with SARSA and Q Learning

Game dev using reinforcement learning and pygame

My debut book “LangChain in your Pocket” is out now

I would be discussing Contextual Bandits and their implementation using tensorflow for a dummy use case

Before starting, we need to know what are Multi-Armed Bandits. Though I have already explained this way back, it’s time for a revisit.

Multi-Armed Bandits provides a solution for a stateless environment where we get an immediate reward after taking action (hence no sequence of actions to reach the final termination state) where you average out the rewards gained for a particular action by running a simulation for n times where ’n’ is very large. So if you ran the simulation for 1000 iterations, out of which you chose an action ‘A’ 200 times (how? epsilon-greedy approach can be considered), the estimated reward for action ‘A’ would-be an average of the 200 values. After training is done, we would have an estimated reward value given an action hence we know which action yields the best results !!

Let’s complicate a few things

Assume you wish to show advertisements on a particular website for a user. Now, thinking out loud, you might not show every user the same set of ads. Right? for example, a user searching for shoes should see ads related to footwear or clothing while a person searching for noodles should see ads around edibles, snacks i.e. we have an added layer of information about the user now, his/her intentions or we can say the context of his visit to the website. We wish to use this information to show more relatable ads to the particular user. Right?

But in MAB, we can’t add context and it considers only actions while estimating the reward. How to add context to these MABs?

Contextual Bandits

Contextual bandits help us to add context before taking an action hence making the whole system more personalized. How?

By introducing the concept of State which can be taken as an alias for context. In the case of MABs, we declare a 1d array, where each element represents an action and the value is the estimated reward. In the case of Contextual Bandits, we would be having an NxM matrix where N=total context we can have, M=Unique actions that can be taken and the rest of the process remains the same as MABs. Given the context, we would choose an action and update its estimated reward in this NxM matrix as we do in the 1d reward array in MABs. Easy !!

MAB vs Contextual Bandit

If you look at the above image, the 1st diagram represents MAB where we have accumulated estimated rewards irrespective of the state. In contrast, the 2nd diagram represents Contextual Bandits where we have states/context (Shoes, Medicine, chips, Diapers) and respective estimated rewards per action. Hope the difference is clear !!

Let’s code out a Contextual MAB using a Neural Network. The flow would be as below

Define states/context and possible action space

Define reward function

Train a Neural Network which intakes One-Hot encoded states/context and outputs estimated rewards for each possible action

Finally depending on the policy chosen (Greedy or epsilon-Greedy), choose action per context

  1. Import the required libraries
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
import copy
from keras.callbacks import EarlyStopping

tf.config.run_functions_eagerly(True)

2. Function to convert state (integers) into One-Hot Encoding to feed it to Neural Network

#states = random states generated for training, 
#total_states = possible states count
def ohe_generator(states,total_states):
ohe = np.zeros((len(states),total_states))
for index, array in enumerate(ohe):
ohe[index][states[index]] = 1
return ohe

3. Define Contextual Bandit class which includes all necessary utilities

class contextual_bandits:
def __init__(self,states,actions):
self.states = states
self.actions = actions

def reward(self,state,action):
if (state*action)%2==1:
return 0.5 + 0.05*((state+action)%10)+np.random.rand()*0.1
else:
return 0.9 - 0.1*((state+action)%10)+np.random.rand()*0.1

def network(self):
input_ = Input(shape=(self.states))
dense1 = Dense(128,activation='relu')(input_)
dropout1 = Dropout(0.1)(dense1)
dense2 = Dense(64,activation='relu')(dropout1)
dropout2 = Dropout(0.1)(dense2)
dense3 = Dense(self.actions,activation='sigmoid')(dropout2)
model = Model(input_,dense3)

rms = Adam(learning_rate=0.0001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)
model.compile(loss="mean_absolute_error", optimizer=rms,metrics="mean_absolute_error")
return model

This requires some explanation

  • __init__() : Initializes basic variables like total states & actions
  • reward(): Defined the reward function
  • network(): Defines a shallow Neural Network which intakes an OHE state and outputs an estimated reward per action

4. Training

batch_size = 128
states = 100
actions = 4

def training():
cb = contextual_bandits(states,actions)
model = cb.network()
sample_states = np.random.choice(range(states),size=batch_size*100)
state_ohe = ohe_generator(sample_states, states)
actual_reward = [[cb.reward(x,y) for y in range(cb.actions)] for x in sample_states]
actual_reward_matrix = np.zeros((len(state_ohe),cb.actions))
for index,x in enumerate(actual_reward):
actual_reward_matrix[index]=np.array(x)
model.fit(state_ohe,actual_reward_matrix,batch_size=batch_size,epochs=20)
return model

Let’s understand the training function

  • We, first of all, define the contextual_bandits class object and the neural network
  • Then we generated some random states & One-Hot encoded them. This will become the input for our Neural Network
  • Then for each state, we calculated the estimated reward using the reward function. This will become our ground truth
  • Train the network on this dataset
By the end of the 20th epoch, the training loss has reduced significantly (starting with 0.25)

Now, let’s check how good is this neural network in recognizing the best action following greedy policy

state_ohe = ohe_generator(np.array([x for x in range(100)]), states)
estimated_reward = model.predict(state_ohe)

print({x:np.argmax(y) for x,y in enumerate(estimated_reward)})

Let’s see the state: best action pair

Here key=state, value=best action

Let’s cross-check for a few states

cb = contextual_bandits(100,4)
print('\nreward for state {}\n'.format(0))
for x in range(4):
print(cb.reward(0,x))

print('\nreward for state {}\n'.format(93))
for x in range(4):
print(cb.reward(93,x))
rewards for each action given states.

As you can see, The best state for 0 is 0 and for 93 it is 3 which is in sync with the reward function outputs as well.

With this, it’s a wrap

--

--