Reinforcement Learning: Part 1: Multi Arm Bandits

Reinforcement learning is inspired by how humans/animals learn through trial and error.

7 min readJun 2, 2023

Imagine you’re teaching a dog to sit when you say “Sit” and stand when you say “Stand.”

· At first, the dog doesn’t know what to do. So, you start by trying different things.
· You say “Sit” and gently guide the dog into a sitting position.
· If the dog sits, you give it a treat as a reward.
· If the dog stands, you don’t give a treat.
· The dog starts to learn that sitting when you say “Sit” leads to a reward.

Dog (Agent) is learning by exploring different actions in an environment, receiving feedback in the form of rewards, and adjusting its behavior to maximize its expected rewards.

Now, imagine you’re in a casino with many slot machines (also called bandits). You will pull its livers and you will get certain reward.

Each slot machine has a different payout rate, meaning some machines are more likely to give you a win than others.

Now, you’re given a limited changes and want to maximize your winnings.

If someone tells you which machine will give you maximum reward, you will simply pull the liver of that machine and will get the maximum reward.

But since, you don’t know the exact payout rates of each machine, so you have to figure out which machines are the most rewarding by trial and error.

You have to find the best strategy to decide which slot machine to play in order to maximize your winnings over time. This is a Multi Arm Bandit problem.

What could be an intuitive approach?

At first, you might try a random machine as you are unaware of the environment. If you win, you might want to stick with that machine. However, if you lose, you will want to try a different one or keep playing the same machine.

Here there are 2 strategies:

· Stick to the machine you initially selected, hoping that it will eventually give you a win. This is called exploitation.
· Keep switching machines after each loss, hoping to reach to a more rewarding one. This is called exploration.

Exploration is trying out different options to gather more information about their outcomes. By exploring, the agent can discover better options that it may not have considered before. However, during exploration, there is a risk of obtaining lower immediate rewards.

Exploitation utilizes the learned knowledge to select the option that is believed to be the best based on past experiences. By exploiting, the agent can maximize its immediate rewards by sticking with the option it believes to be the most rewarding. However, this prevents the agent from discovering better options that it has not yet explored. We call it Greedy choice.

How to evaluate your action?

To decide which machine is the best, we need a function which can evaluate our action. These values are called action values.

The values of selecting an action as the expected reward we receive when taking that action

Where, K is the number of slot machine, Action selected on time step ‘t’ as ‘At’, The corresponding reward as ‘Rt’. P is the probability of reward when action ‘a’ is taken.

Goal is to maximize the expected rewards by choosing the action which maximized the action value function.

The main issue is that we don’t know the exact probability distribution or environment dynamics of any machine.

One natural way is to average out all the rewards observed up till now for a particular arm.

This is known as Sample Average Method.

You can then update the expected rewards using the below incremental formula.

Here we are selecting greedy actions to get the maximum rewards, but we might miss out better opportunities.

There is a challenge of finding the right balance between exploring new options and exploiting the currently known best option in decision-making processes. This is called Explore — Exploit dilemma.

How to get the best of both worlds?

Epsilon-Greedy Action selection is the process through which we can achieve this.

· At each step, the agent chooses between exploration and exploitation.
· With a probability of epsilon, the agent selects a random action, exploring the environment. This random action allows the agent to discover potentially better actions that it hasn’t tried before.
· With a probability of 1-ε, the agent selects the action with the highest estimated value (Q-value) based on its current knowledge, exploiting what it believes to be the best action.

A hyperparameter ε!

· A higher value of ε encourages more exploration, which is beneficial in the early stages of learning or when the agent has limited knowledge about the environment.

· A lower value of ε promotes more exploitation, meaning the agent relies heavily on its current knowledge to choose actions that it believes are the best based on past experiences.

· Ideally, ε should be gradually decreased over time as the agent learns more about the environment, allowing it to transition from more exploratory behavior to more exploitative behavior.

import numpy as np

class Bandit:
    def __init__(self, num_arms):
        self.num_arms = num_arms
        self.true_rewards = np.random.normal(0, 1, self.num_arms)
        
    def get_reward(self, action):
        true_reward = self.true_rewards[action]
        reward = np.random.normal(true_reward, 1)
        return reward

class EpsilonGreedyAgent:
    def __init__(self, num_arms, epsilon):
        self.num_arms = num_arms
        self.epsilon = epsilon
        self.Q = np.zeros(self.num_arms)
        self.action_counts = np.zeros(self.num_arms)
        
    def choose_action(self):
        if np.random.random() < self.epsilon:
            # Explore: Choose a random action
            action = np.random.randint(self.num_arms)
        else:
            # Exploit: Choose action with maximum estimated value
            action = np.argmax(self.Q)
        return action
    
    def update_estimates(self, action, reward):
        self.action_counts[action] += 1
        self.Q[action] += (reward - self.Q[action]) / self.action_counts[action]

# Initialize bandit environment and agent
num_arms = 5
epsilon = 0.1
bandit = Bandit(num_arms)
agent = EpsilonGreedyAgent(num_arms, epsilon)

num_steps = 1000

# Run the epsilon-greedy agent
for step in range(num_steps):
    action = agent.choose_action()
    reward = bandit.get_reward(action)
    agent.update_estimates(action, reward)

# Print the estimated values for each action
print("Estimated Action Values:")
for i in range(num_arms):
    print(f"Action {i+1}: {agent.Q[i]}")

So far, we have learned about MAB (Multi-Armed Bandit) problems. These are also known as stateless Reinforcement Learning problems where we take only one action (pull the bandit) & get a reward.

What if the action differs with different situations (State)?

Imagine you are a manager of an online store, and you have multiple versions of an advertisement that you can show to different customers. Each version of the advertisement is like an “arm” of a slot machine in the traditional multi-armed bandit problem.

If you follow Epsilon greedy approach, you might end up showing some irrelevant advertisement to the customer which a customer might never click and hence degrading the customer experience.

In contextual bandits, you have some information about each customer, such as their previous state, age, gender, location, and browsing history. This additional information is called the “context.”

These type of problems lies between the Stateless RL and the full RL problem. We will talk more about full RL in the upcoming blogs.

import numpy as np

class ContextualBandit:
    def __init__(self, num_actions, num_features):
        self.num_actions = num_actions
        self.num_features = num_features
        self.theta = np.random.normal(0, 1, size=(num_actions, num_features))
        
    def get_reward(self, action, context):
        theta_a = self.theta[action]
        context = np.array(context)
        prob = np.exp(np.dot(theta_a, context))
        prob /= np.sum(np.exp(np.dot(self.theta, context)))
        reward = np.random.binomial(1, prob)
        return reward

class ContextualBanditAgent:
    def __init__(self, num_actions, num_features):
        self.num_actions = num_actions
        self.num_features = num_features
        self.theta_hat = np.zeros((num_actions, num_features))
        self.action_counts = np.zeros(num_actions)
        
    def choose_action(self, context):
        context = np.array(context)
        action_probs = np.exp(np.dot(self.theta_hat, context))
        action_probs /= np.sum(action_probs)
        action = np.random.choice(self.num_actions, p=action_probs)
        return action
    
    def update_estimates(self, action, context, reward):
        context = np.array(context)
        self.action_counts[action] += 1
        step_size = 1 / self.action_counts[action]
        self.theta_hat[action] += step_size * (reward - np.dot(self.theta_hat[action], context)) * context

# Initialize contextual bandit environment and agent
num_actions = 3
num_features = 5
bandit = ContextualBandit(num_actions, num_features)
agent = ContextualBanditAgent(num_actions, num_features)

num_steps = 1000

# Run the contextual bandit agent
for step in range(num_steps):
    context = np.random.normal(0, 1, num_features)  # Random context for each step
    action = agent.choose_action(context)
    reward = bandit.get_reward(action, context)
    agent.update_estimates(action, context, reward)

# Print the estimated theta values for each action
print("Estimated Theta Values:")
for i in range(num_actions):
    print(f"Action {i+1}: {agent.theta_hat[i]}")

Thanks for spending your time on this blog. I am open to suggestions and improvement. Please let me know if I missed any details in this article.

Reference:

Reinforcement Learning: An Introduction — Richard S. Sutton and Andrew G. Barto

Reinforcement Learning: Part 1: Multi Arm Bandits

Reinforcement learning is inspired by how humans/animals learn through trial and error.

Written by Mehul Jain