Reinforcement Learning: Optimal Blackjack Strategies

Simulating Blackjack games to learn the optimal playing strategy using Monte Carlo and TD(λ)

Published in

MITB For All

10 min readJun 14, 2024

Will Robots start dealing for us? Generated by DALL-E 3 via ChatGPT 4o

The content within this article was submitted in fulfillment of the requirements for SMU MITB’s Introduction to Reinforcement Learning module. Permission was granted to the author to reproduce the assignment context and our submissions for this article.

Imagine an evening during Chinese New Year. Your cousins pull out a deck of cards and decide to play some Blackjack. Sitting down at the table, you start thinking about how you can maximize your winnings (or at least, minimize your losses…) — and the machine learning enthusiast inside you ponders about the optimal strategies to use.

Turns out, Reinforcement Learning (RL) is a suitable paradigm to determine optimal strategies for Blackjack! In this article, we will explore how we can use two popular RL approaches — Monte Carlo and TD(λ) to find the best strategy as a Blackjack player. We will:

Provide a brief introduction to Monte Carlo methods, TD(λ) methods and the Blackjack game environment
Discuss implementations for Monte Carlo and TD(λ)
Review the results of running these algorithms on our Blackjack game

Brief Introduction to Monte Carlo and TD(λ) Methods

To avoid turning this short article into a textbook, this article assumes a basic knowledge of what are Monte Carlo and TD(λ) methods in context of RL. If you need a primer, there are several articles on Medium which explain these algorithms in much more detail — I would highly recommend those written by Prof. James, our course instructor: (i) Key Takeaways 3 (MC Methods) From Sutton and Barto RL textbook and (ii) Key takeaways 4 (Temporal Difference Learning) — Sutton and Barto RL textbook

That being said, I provide a brief introduction of the models and the variants we use to simulate Blackjack below.

(1) Monte Carlo Methods

In simple terms, Monte Carlo (MC) methods learn optimal strategies by running many simulated episodes (in our context, an episode is a 1-player vs. dealer game of Blackjack until one of them wins). Based on the states reached, actions taken and rewards received across each episode, the Monte Carlo agent learns by averaging the returns across these episodes, with state value updates occurring at the end of each episode.

In each episode, it is possible for the agent to visit a specific state s multiple times. When we want to update our state value, v_{\pi}(s), we have two choices here:

First-visit MC: Update state values based on the average of returns following the first-visit to each state s
Every-visit MC: Update state values based on the average of returns following every-visit to each state s

In the context of Blackjack, while we can implement Every-visit MC, it is not needed, as it is not possible for the player to visit the same state twice in a single episode (i.e., After “hitting”, the value of cards you hold will always be different. After “standing”, the game ends with a win or loss). As such, we will analyze the results of the First-visit MC in a later section, with the pseudo code for the algorithm as follows:

Pseudo code for the First-Visit Monte Carlo method

(2) TD(λ) Methods

Again, in simple terms, Temporal Difference (TD) (λ) methods combine the approach of simulating multiple episodes from Monte Carlo (MC) with a dynamic programming approach , where state value updates occur at each step of the episode instead of updates occurring after the end of each episode. (To understand the role of dynamic programming in RL, also read Key Takeaways 2 (DP and GPI) From Sutton and Barto RL Textbook from Prof. James)

Mathematically, the difference can be shown as follows:

Formula 1: Monte-Carlo update step

Formula 2: TD(0) update step

In the MC update step, G_{t} is only known at the end of each episode. In the TD update step, R_{t+1} is the immediate reward received at time-step t+1, and {\gamma}V(S_{t+1}) is the estimated state value of the next state.

This approach qualifies TD methods as online methods, which update state values as the episode progresses, as compared to MC methods which are offline methods, updating state values only at the end of each episode.

Ok, so what does the λ in TD(λ) mean, then? Personally, I find it useful to think of λ as a sliding scale with 0 and 1 at each end.

When λ = 0, we are using a fully TD approach (a.k.a TD(0)), where the update to state values is made with the rewards and expected next-state value from the immediate next time-step. This is akin to Formula 2 shown above.
When λ = 1, we are using a fully MC approach (a.k.a TD(1)), where the update to state values is made with the total returns between current state and end of the episode. This is akin to Formula 1 shown above.
When 0 < λ < 1, we are using a hybrid approach, where all steps till termination are considered, but with diminishing weights. For a deeper explanation of TD(λ), Eligibility Traces and its mathematical formulas, refer to Key takeaways 5 (Eligibility Traces) — Sutton and Barto RL textbook, written by Prof. James.

One final concept I’d like to introduce (sorry, final bit of theory!) is On-Policy and Off-Policy learning using TD(0).

On-Policy learning identifies actions based on the current policy, which has a small amount of exploration to encourage taking new, possibly better actions. However, this may not be desirable during training, where we want to maximize exploration to find the optimal policy as soon as possible.
This is where Off-Policy learning comes in, where we increase the amount of exploration in a particular policy, encouraging a large amount of exploration, and potentially faster convergence to the optimal policy.

Two implementations of TD(0) methods implement on/off policy learning respectively, and are known as SARSA (On-Policy) and Q-Learning (Off-Policy), with their formulas shown below:

Formula 3: SARSA — updates the Q-value based on the action actually taken by the policy

Formula 4: Q-Learning — updates the Q-value based on the maximum possible reward for the next state

Blackjack Game Environment

Game Setup:

Classic Blackjack rules with a modification: No Kings in the deck. The deck consists of Ace, 2–10, Jack, and Queen.

Card Values

Number cards (2–10): Face value.
Jack and Queen: 10 points.
Aces: Can be 1 or 11 points.

Game Rules:

Player wins with a total of 21 on the first two cards (“blackjack”), unless the dealer also has a blackjack (resulting in a tie).
Wins are paid 1 to 1, and blackjacks are paid 3 to 2. Ties result in no payout.
The player is initially dealt two cards and can choose to “hit” (take another card) or “stand” (end turn).
A hand exceeding 21 points is a “bust” and a loss, regardless of the dealer’s outcome.

Dealer’s Rules:

The dealer has one face-up card and no hole card.
The dealer must draw cards if the hand is “hard 16” or “soft 17” (includes an Ace) or lower.

Code for this modified Blackjack environment can be found at this GitHub Gist (full code not embedded here as it is very long)

For the full, detailed implementation of the environment, and RL algorithms in Python, refer to the complete notebook in GitHub.

(P.S. code blocks you see below have been edited slightly from the above GitHub repo for clearer readability and simplification)

Implementations of Reinforcement Learning Algorithms

(1) Monte Carlo Implementation

We start off by defining the agent, and associated methods for the Monte Carlo implementation:

The self.Q_sa variable contains a 4D list, which starts off with zeros as the state-action values, which are then updated when the agent is trained according to an epsilon-greedy policy
The target_policy method acts according to an epsilon-greedy strategy, with an epsilon (exploration) factor of 0.1 as a default.
The sample_episode method generates a round of Blackjack from start to finish, following the epsilon-greedy strategy for action selection.
The act method is used after the agent has been trained, and identifies the best (greedy) action to take after all state-action values have been updated after training.

Speaking of updating state-action values, the following method predict_q iterates through each episode and updates the relevant state in self.Q_sa (line 33 below). You would also see that this method can toggle between First-visit and Every-visit MC, based on the first_visit parameter passed to it! If using First-visit MC, then the update step is skipped whenever the same state-action pair is visited again.

And that’s it for the Monte Carlo implementation! I will discuss how the model is actually trained in a later section. For now, let’s move onto the TD(λ) implementation :)

(2) TD(λ) Implementation

Again, we start off by defining the agent, and associated methods for the TD(λ) implementation:

The self.Q_sa variable, target_policy method, and act method are the same as the ones in the MC implementation.
A new variable self.Et captures the Eligibility Traces for TD(λ)

Note that there is no sample_episode method in the TD agent definition. This is because generating and iterating through an episode occurs in the predict_q method below (recall, as TD is trained in an online fashion, the update step occurs after each episodic-step is generated).

In the above code block, you can see that after the while loop (line 15), each episode step is followed by a set of update steps (lines 41–62), after which the next episode step is used for the updates — i.e. in an online fashion. And this repeats for n number of episodes.

Note that whenever self.lambd (λ) = 1, the algorithm saves the episodes one by one (lines 35–39), and then replays it to the TD(1) algorithm (after line 69), which is equivalent to a MC implementation.

With that, we have completed both implementations of Monte Carlo and TD(λ)! The following code block shows how several variations of MC and TD can be trained using the above defined agents. Next, we will visualize and evaluate some of the results from the top-performing models

Training Results

Below are some visualized policy plots from my top performing models. The green regions indicate states where the best action taken is to Hit. Conversely, the red regions indicate states where the best action taken is to Stand — this is essentially the Greedy Policy learned through training.

Policy Plot (Q_sa) for Monte Carlo — First Visit (500K Episodes)

Policy Plot (Q_sa) for TD(1) — SARSA (On-Policy) (500K Episodes)

We can see that the higher the starting value of a player’s hand, the agent has learned to Stand, and subsequently attain a higher Expected Reward, indicated by the “wall of red” towards the back of the plot above.

Conversely, the lower the starting value of a player’s hand, the agent has learned to Hit, but despite this, the Expected Reward tends to be mostly negative, or only very slightly positive in the “Usable Ace” case.

Intuitively, this makes sense, as this game is ultimately designed to make the house money over the long run. Win Rates for Blackjack, both from my model and empirically from other studies, tend to be around 40–45%, ensuring that in the long run, the player is never the bigger winner than the house.

Both algorithms seem to have converged to very similar policies, with the green zones (Action: Hit) being in areas where the Player Hand sum is low, and the red zones (Action: Stand) being in areas where the Player Hand sum is high, correctly learning that it is better to Hit when the risk of busting is lower, and hence aligning with logical explanations.

FYI, code for plotting these charts can be found in this GitHub Gist

Conclusion

In this article, we briefly went through the mechanics of Monte Carlo and TD(λ), showing the versatility and degree of overlap between both approaches. We then further applied it to a Blackjack scenario, where we successfully trained an agent, and learned a policy that can grant us a 40–45% win rate, corresponding to an expected return of -0.027 with ties and blackjacks accounted for.

I hope this article gave you a practical understanding of how these algorithms work. Thank you for taking the time and all the best in your Blackjack endeavors!

References

[1] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction (2018), MIT press

[2] kavisherlock, BlackJack-RL (2018), GitHub

[3] jovsa, rl-examples-sutton-and-barto-book (2018), GitHub

All opinions and interpretations are that of the writer, and not of MITB. I declare that I have full rights to use the contents published here, and nothing is plagiarized. I declare that this article is written by me and not with any generative AI tool such as ChatGPT. I declare that no data privacy policy is breached, and that any data associated with the contents here are obtained legitimately to the best of my knowledge. I agree not to make any changes without first seeking the editors’ approval. Any violations may lead to this article being retracted from the publication.