Contextual Bandits in reinforcement learning explained with example and codes

starting deep reinforcement learning using tensorflow

Mehul Gupta

Published in

Data Science in your pocket

5 min readFeb 26, 2023

So, continuing my reinforcement learning blog series which includes

Reinforcement Learning basics
Formulating Multi-Armed Bandits (MABs)
Monte Carlo with example
Temporal Difference learning with SARSA and Q Learning
Game dev using reinforcement learning and pygame

My debut book “LangChain in your Pocket” is out now

LangChain in your Pocket: Beginner's Guide to Building Generative AI Applications using LLMs

LangChain in your Pocket: Beginner's Guide to Building Generative AI Applications using LLMs eBook : Gupta, Mehul…

www.amazon.in

I would be discussing Contextual Bandits and their implementation using tensorflow for a dummy use case

Before starting, we need to know what are Multi-Armed Bandits. Though I have already explained this way back, it’s time for a revisit.

Multi-Armed Bandits provides a solution for a stateless environment where we get an immediate reward after taking action (hence no sequence of actions to reach the final termination state) where you average out the rewards gained for a particular action by running a simulation for n times where ’n’ is very large. So if you ran the simulation for 1000 iterations, out of which you chose an action ‘A’ 200 times (how? epsilon-greedy approach can be considered), the estimated reward for action ‘A’ would-be an average of the 200 values. After training is done, we would have an estimated reward value given an action hence we know which action yields the best results !!

Let’s complicate a few things

Assume you wish to show advertisements on a particular website for a user. Now, thinking out loud, you might not show every user the same set of ads. Right? for example, a user searching for shoes should see ads related to footwear or clothing while a person searching for noodles should see ads around edibles, snacks i.e. we have an added layer of information about the user now, his/her intentions or we can say the context of his visit to the website. We wish to use this information to show more relatable ads to the particular user. Right?

But in MAB, we can’t add context and it considers only actions while estimating the reward. How to add context to these MABs?

Contextual Bandits

Contextual bandits help us to add context before taking an action hence making the whole system more personalized. How?

By introducing the concept of State which can be taken as an alias for context. In the case of MABs, we declare a 1d array, where each element represents an action and the value is the estimated reward. In the case of Contextual Bandits, we would be having an NxM matrix where N=total context we can have, M=Unique actions that can be taken and the rest of the process remains the same as MABs. Given the context, we would choose an action and update its estimated reward in this NxM matrix as we do in the 1d reward array in MABs. Easy !!

If you look at the above image, the 1st diagram represents MAB where we have accumulated estimated rewards irrespective of the state. In contrast, the 2nd diagram represents Contextual Bandits where we have states/context (Shoes, Medicine, chips, Diapers) and respective estimated rewards per action. Hope the difference is clear !!

Let’s code out a Contextual MAB using a Neural Network. The flow would be as below

Define states/context and possible action space
Define reward function
Train a Neural Network which intakes One-Hot encoded states/context and outputs estimated rewards for each possible action
Finally depending on the policy chosen (Greedy or epsilon-Greedy), choose action per context

Import the required libraries

import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
import copy 
from keras.callbacks import EarlyStopping

tf.config.run_functions_eagerly(True)

2. Function to convert state (integers) into One-Hot Encoding to feed it to Neural Network

#states = random states generated for training, 
#total_states = possible states count
def ohe_generator(states,total_states):
    ohe = np.zeros((len(states),total_states))
    for index, array in enumerate(ohe):
        ohe[index][states[index]] = 1
    return ohe

3. Define Contextual Bandit class which includes all necessary utilities

class contextual_bandits:
    def __init__(self,states,actions):
        self.states = states
        self.actions = actions
    
    def reward(self,state,action):
        if (state*action)%2==1:
            return 0.5 + 0.05*((state+action)%10)+np.random.rand()*0.1
        else:
            return 0.9 - 0.1*((state+action)%10)+np.random.rand()*0.1
    
    def network(self):
        input_ = Input(shape=(self.states))
        dense1 = Dense(128,activation='relu')(input_)
        dropout1 = Dropout(0.1)(dense1)
        dense2 = Dense(64,activation='relu')(dropout1)
        dropout2 = Dropout(0.1)(dense2)
        dense3 = Dense(self.actions,activation='sigmoid')(dropout2)
        model = Model(input_,dense3)
        
        rms = Adam(learning_rate=0.0001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)
        model.compile(loss="mean_absolute_error", optimizer=rms,metrics="mean_absolute_error")
        return model

This requires some explanation

__init__() : Initializes basic variables like total states & actions
reward(): Defined the reward function
network(): Defines a shallow Neural Network which intakes an OHE state and outputs an estimated reward per action

4. Training

batch_size = 128
states = 100
actions = 4

def training():
    cb = contextual_bandits(states,actions)
    model = cb.network()
    sample_states = np.random.choice(range(states),size=batch_size*100)
    state_ohe = ohe_generator(sample_states, states)
    actual_reward = [[cb.reward(x,y) for y in range(cb.actions)] for x in sample_states]
    actual_reward_matrix = np.zeros((len(state_ohe),cb.actions))
    for index,x in enumerate(actual_reward):
                    actual_reward_matrix[index]=np.array(x)
    model.fit(state_ohe,actual_reward_matrix,batch_size=batch_size,epochs=20) 
    return model

Let’s understand the training function

We, first of all, define the contextual_bandits class object and the neural network
Then we generated some random states & One-Hot encoded them. This will become the input for our Neural Network
Then for each state, we calculated the estimated reward using the reward function. This will become our ground truth
Train the network on this dataset

By the end of the 20th epoch, the training loss has reduced significantly (starting with 0.25)

Now, let’s check how good is this neural network in recognizing the best action following greedy policy

state_ohe = ohe_generator(np.array([x for x in range(100)]), states)
estimated_reward = model.predict(state_ohe)

print({x:np.argmax(y) for x,y in enumerate(estimated_reward)})

Let’s see the state: best action pair

Let’s cross-check for a few states

cb = contextual_bandits(100,4)
print('\nreward for state {}\n'.format(0))
for x in range(4):
    print(cb.reward(0,x))
    
print('\nreward for state {}\n'.format(93))
for x in range(4):
    print(cb.reward(93,x))

As you can see, The best state for 0 is 0 and for 93 it is 3 which is in sync with the reward function outputs as well.

With this, it’s a wrap

Contextual Bandits in reinforcement learning explained with example and codes

starting deep reinforcement learning using tensorflow

LangChain in your Pocket: Beginner's Guide to Building Generative AI Applications using LLMs

LangChain in your Pocket: Beginner's Guide to Building Generative AI Applications using LLMs eBook : Gupta, Mehul…

Contextual Bandits

Written by Mehul Gupta