Win at Blackjack with Reinforcement Learning

Artem Arutyunov
The Power of AI
Published in
13 min readDec 30, 2022

As a popular casino card game, many have studied Blackjack closely in order to devise strategies for improving their likelihood of winning.

In this project, we will use Reinforcement Learning to find the best playing strategy for Blackjack. We will use Monte Carlo Reinforcement learning algorithms to do it; you will see how Reinforcement Learning can determine the optimum Blackjack strategy in just a few minutes. You will quickly grasp important concepts of Reinforcement learning and apply open AI’s gym, the go-to framework for Reinforcement learning.

To see all of the detailed explanations for the mentioned concepts and analyze/experiment with the code for this blog. Click on

You can also take a lot of FREE courses and projects about data science or any other technology topics from Cognitive Class.

Let’s start.

What’s Reinforcement Learning?

Reinforcement Learning is a machine learning method, which is based on rewarding desired actions/output and punishing for the undesired ones. Reinforcement learning models are choosing which action to make based on the expected return of each action. Model takes some information about the current situation and possible actions as input, then you must reward it based on the decision that it decides to make. Reinforcement learning models learn to perform a task through repeated trial-and-error interactions with an environment and do so without any human intervention.

Basics

  • Agent is your reinforcement learning model, it’s a decision maker and learner.
  • Environment is a world around your agent, the agent learns and acts inside of it. The environment takes the action provided by the agent and returns the next state and the reward.
  • State is a complete description of the state of the environment.
  • Action is a way agent interacts with the environment. Action Space is the set of all possible actions.
  • Reward is a feedback from the environment, it can be negative or positive. It impacts the agent and serves as an indication to an agent of what it should achieve. Rewards are generally unknown and agents learn how to correctly estimate the rewards.
  • Policy is a rule used by an agent to decide what actions to take, given the specific state. It works as a map from state to some action can sometimes be defined as a set of probabilities for each action in the action space.
  • Value Function is the function that returns the expected total reward your agent can get from following the specific policy. The agent uses this value function to make decisions and learns by updating the expected reward values of the parameters of this function. In this lab, we will be using the state-action value function, so our function Q(s,a) will take the state-action pair and will return an estimated reward for taking action a from state s.

The reinforcement Learning Process can be simply summarized by the following process:

1. Agent plays a number of games
2. In every game, the agent chooses an Action from the action space by using Policy and Value Function
3. Action impacts the environment and the Reward and the new State is returned to the agent.
4. Agent keeps track of what reward it received after choosing a certain action from a certain set.
5. After completing the game, the agent updates the estimated reward for each state and action by using the actual rewards values received while playing the game.
6. The whole process repeats again.

Summary of RL process.

Famous RL models that play Chess, Go or Atari Games on superhuman levels, are all based on the aforementioned principles and concepts.

Let’s check our environment.

BlackJack Environment

Blackjack is a card game played against a dealer. At the start of a round, both player and dealer are dealt 2 cards. The player can only see one of the dealer’s cards. The goal of the game is to get the value of your cards as close to 21 as possible, without crossing 21. The value of each card is listed below.

  • 10/Jack/Queen/King → 10
  • 2 through 9 → Same value as the card
  • Ace → 1 or 11 (Player’s choice). Note that ace is called useful when It can be counted as 11 without going bust.

If the player has less than 21, they can choose to “hit” and receive a random card from the deck. They can also choose to “stand” and keep the cards they have. If the player exceeds 21, they go “bust” and automatically lose the round. If the player has exactly 21, they automatically win. Otherwise, the player wins if they are closer to 21 than the dealer.

A few notes on Open Ai Gym: Open Ai Gym is a toolkit for developing and comparing reinforcement learning algorithms. This is the open-source gym library, which gives you access to a standardized set of environments. It’s quite useful in our case since we are working with a pretty standard environment. It also allows you to create your own custom environment but for now, we will only explore a pre-define Blackjack environment.

We create an OpenAI gym blackjack environment by calling gym method, we will use the `make` function to do so.

environment= gym.make(‘Blackjack-v1’)

Now let’s see what the observation space for our environment is. Observation space is a set of all possible states. We can view the space using the method observation_space :

environment.observation_space

Which returns:

Tuple(Discrete(32), Discrete(11), Discrete(2))

The observation is a 3-tuple of: the player’s (you) current sum, the dealer’s one showing card (1–10 where 1 is ace), and whether or not the player holds a usable ace (0 or 1) or (`True` or `False`).
Then we can explain `Tuple(Discrete(32), Discrete(11), Discrete(2))` as

  • The highest score you can achieve is (11,10,11), and the lowest is (1), so there are 32 states for the player’s score.
  • The dealer only shows one card, which can be anything from 1 to 11.
  • The ‘usable ace’ space is True/False, so the size of the space is 2.

So there are 32x11x2 = 704 possible states.

Let’s check the action space of this environment. Think what it should be before running the code:

environment.action_space.n

It returns 2, cause we can either hit or stay:

print(environment.observation_space.sample())
print(environment.player)
print(environment.dealer)

Which returns:

(23, 7, 1)
[4, 8]
[7, 2]

The episode is an agent-environment interaction from initial to final states, so it’s one game that the agent plays. In addition, our agents are operating in a discrete-time game. Each time-advancing decision is a step (e.x. taking some action from some state). It’s easy to see that each episode consists of a series of steps.

Example of 2 episodes of BlackJack game.

What will help us to win at this Casino Game, is ironically a casino name method: Monte Carlo Method. But before jumping there, lets define some things that will build an infrastructure for our learning process and elaborate/ implement previously defined terminology.

Epsilon-Greedy Policy

If you remember, as was mentioned before, policy is just a function that defines which action our agent should take based on the current state. In our environment, a simple deterministic policy π for the state (15,10,0) may look like this:

Now let’s clarify a few things with the title. Epsilon, is just some constant 0≤ and ≤1, and it will define some probability. Greedy, is a concept in computer science where a greedy algorithm is the one that makes a locally optimal choice at each stage. In our case, greedy policy implies that it will choose an action with the biggest estimated return.

For now assume that Q(s,a) is our value function, it will return an estimated reward based on the given state and action. Let A be the action space then our policy can be simply defined:

You may ask if we want to maximize our returns, why can’t we always use the best action, the action with the best-estimated reward, what’s the point of epsilon? To answer this question we will have to learn about 2 more concepts:

  • Exploitation happens when the agent makes the best decision given current information, it uses the best-estimated action to maximize the reward.
  • Exploration happens when the agent takes random action to explore more opportunities, and gather more information about possible actions and the environment.

Epsilon defines the trade-off between Exploration and Exploitation. We need it because the best long-term strategy may involve short-term sacrifices and in most cases, agents must explore the environment and gather enough information to make the best overall decisions. It may save our agent from doing decisions that work instead of finding the best actions

Monte Carlo Method

Let’s talk about the heart of our algorithm, the Value function that we will be using and how it estimates the reward for each action given the state.

Monte Carlo Method was invented by Stanislaw Ulman in 1940s, while trying to calculate the probability of a successful Canfield solitaire (He was sick and had nothing better to do). Ulman randomly lay the cards out and simply calculated the number of successful plays. We will apply the same approach to create our value function. The basic principle of Monte Carlo method can be summarized in 4 steps:

  1. Define the Domain of Possible inputs
  2. Generate inputs randomly from a probability distribution over the domain
  3. Perform a deterministic computation on the inputs
  4. Average the results

Before we can see it in action let’s define a few things. Review that Episode is a set of agent-environment interactions from initial to final states which consists of steps in a discrete-time game.

Monte Carlo reinforcement learning learns from episodes of experience, it functions by setting the value function equal to the empirical mean return.
Let’s assume that we have some initialized policy π that our agent follows. Then let’s play a game once and gain the following episode:

Now let’s look at the total expected reward of taking an Action A_t in the state S_t , where t is some time step.

At time step t=0 (the first time step), the environment (including the agent) is in some state S_t = S_0 (the initial state), takes an action A_t = A_0 (the first action in the game), and receives a reward R_t = R_0 and the environment (including the agent) moves to a next state S_{0+1} = S_1. Let’s define a function G, which will just give us the expected total discounted reward at each time step:

Discount factor 0≤gamma≤ 1 is an important constant. We add the initial reward R_1 as it is, without modifying the value, then to get the total reward we are adding R_{t+1} but note that the value is multiplied by gamma, so R_{t+1} is only partially added to R_1, R_{t+2} is multiplied by gamma², R_{t+3} is multiplied by gamma³ and so on. Gamma determines how much the reinforcement learning agent cares about rewards in the distant future relative to those in the immediate future. Note that if gamma=0 then the total expected return will be defined just by the initial reward, so the agent will only learn and care about actions that produce an immediate reward.

Now we can define our action-value function $Q_π(S, A)$ for some state S and action A as:

So value function returns the expected value of a total discounted reward G(t) for the time step t at which S_t = S and A_t = A.

Now, after completing a series of episodes in the game, how can we adjust the expected values or a bigger question is how’s the learning process itself happening in Monte Carlo Method. For it, we will use the concept of Incremental means.

Incremental means is just the average of values that is computed incrementally. Let x_1, x_2,…, x_n be the set of values, Let μ_1, μ_2, … , μ_{n-1} be a sequence of means, where μ_1 = x_1, μ_2 = (x_1+x_2)/2 and μ_3 = (x_1+x_2+x_3)/3 and so on. Let’s see how the mean is defined incrementally:

Now we can put everything together to describe the Monte Calro Method Learning Process. Let’s have an episode:

For each (state, action) pair we will keep track of the number of times this (state, action) was visited, let’s define function N(s_t,a_t), then every time we visit (state, action) we will update the visits counter and then adjust the running mean:

Let’s look for at an example episode where we update our Q function:

Now we can see how in this episode, before the game begins the number of visits of this state with this action is 0, then the game starts and the number of visits functions is updated. Return is 0 in the beginning. The action is made, by using the predefined policy, we are changing the state and the reward is received since we have more than the dealer. The total reward is updated and the Q function is updated by calculating the average reward of making action 0 in this state. Since it was the first time this state action pair was visited, we are just performing 1/1 so we get 1.

So now we know how to update the action-value function and how to use it in combination with our policy to maximize the rewards, it can be summarized as:

Monte Carlo algorithm/method is a type of model-free reinforcement learning, since the agent does not use predictions of the environmental response, so it’s not trying to create a statistical model of the environment.

We will add a few more tricks and parameters to make this method more efficient.

Monte Carlo with First Starts

Note that for every episode, we update our Q function, based on states and actions that were visited, some state-action pairs can be visited more than once per episode. Every-Visit is Monte Carlo algorithm that averages returns for every time the state-action pair is visited in an episode, where First-visit MC averages return only for the first time state-action pair is visited in an episode.

Let’s check its implementation in the First-visit MC pseudo-code:

Where T is the last step in the episode, T-1 is the second last one and so on.

Since it’s impossible to have the same state appearing more than once in the Blackjack episode, we will only use the First-visit implementation.

Let’s train our model:

policy, V, Q,DELTA= monte_carlo_ES(environment,N_episodes=50000, discount_factor=1,epsilon = 0.1,first_visit=True,theta=0)

We are using a value function V(s) for some state s as an indication of what our model thinks the expected return is when following the best action given by the Q function. So V(s) = max(Q(s,a)) for all a in the action space. Our implementation returns the value function V as a dictionary. DELTA indicates the value by which the V function for each state was updated. The last parameter is theta, which indicated a stopping threshold, so if delta is very small, there is no reason to continue training the algorithm, since updates to the values are minimal we will stop the training process.

Let’s plot the delta value for each episode to see how the update rates are changing:

As you can see the delta parameter is generally decreasing, which makes sense since our model comes closer and closer to the optimal values for the value function. Speaking of which, let’s see the map of expected returns based on what the player has and what the dealer has.

We see a general trend, as the score of the player increases the value function takes on higher values so our expected return grows. Let's see the average result of playing ten thousand games. Let’s compare the result, first using the random policy.

The random policy gives us around 28% of wins, and the agent that uses the trained policy is returning us around 42%. More than 10 % increase. Let’s actually visualize the policy itself.

It looks like the optimal policy for blackjack is: If the agent has no ace, the higher the dealer is showing, the more likely the agent is to hit, the exception is if the dealer has an ace. If the agent has an ace, the strategy is different. The agent will stick if the sum of their cards is over 11 and, for the most part, hold the player’s sum is over 18.

We have created a successful algorithm, that was able to give us perhaps not a winning, but an optimal strategy for blackjack.

If you want to know the answer to the aforementioned questions and learn some simple trick that may help you to improve your algorithm or look at the implementation of the algorithm itself, then click on Win Blackjack with Reinforcement Learning. or explore other FREE courses and projects about data science or machine learning on Cognitive Class.

Thanks for reading.

--

--

Artem Arutyunov
The Power of AI

Hey, Artem here, I love helping people to learn, and learn myself. IBM Data Scientist + Studying Math and Stats at University of Toronto.