An income back test of three roulette strategies based on Reinforcement Learning

Taiwan AI Academy
Published in
8 min readApr 1, 2020


Roulette is a casino game named after the French word meaning little wheel. In the game, players may choose to place bets on either a single number, whether the colors red or black, or various groupings of numbers.

This article aims to use Reinforcement Learning (RL) to evaluate three roulette strategies in terms of total pay outs and winning percentage. I will first introduce how to play roulette, follow by a brief introduction of RL, then I will explain how to simulate a roulette environment for RL training. After that is the algorithm I used and the experiment method. The last part of the article give the final outcome and some discussion about the result.

The code for this project can be found at GitHub.

How to play roulette

In each round of the roulette game, you’ll need to select the chip size that you’d like to bet with first, then click on the table layout to place your bet.

The typical roulette table and the odds are shown below:

In this project, I assume we are playing European style roulette, which has single zero. In American style roulette there will be double zero and the ball fall into one of 38 instead of 37.

After you place your bet, the dealer spins the wheel. It is important to realize how the dealer spin the wheel because we would like to simulate a roulette environment, I suggest you can watch some video about roulette on YouTube or somewhere else if you are not familiar with roulette and you want to know how it goes.

The wheel will come to a stop and the ball will settle in a pocket. The dealer will issue your payout if you’ve been lucky.

Introduction to Reinforcement Learning

Reinforcement Learning is a branch of machine learning where an agent learns to behave in an interactive environment, by performing certain actions given the current state and observing the rewards it get from those actions.

Unlike supervised learning, we don’t tell the agent whether an action is good or bad. For example, in a Tic-Tac-Toe game, the agent first randomly select a place in the 3 x 3 grid. It might place a mark on the corner, which is usually a bad move, however, you can’t tell the result because the game is not over yet. What we do here is continue the process and feedback the result to the previous state. After several training episodes, it select the best action based on past experience, and when it comes to the initial state, it would mark the middle because the winning percentage should be higher there.

Set up a roulette environment

In order to use RL we need to give two components, the agent and the environment. The agent is about the algorithm we use, the environment here is to play the roulette game then give outcome and reward to the agent. The ideal environment for this is to obtain real roulette playing record from casino. However, since nobody has shared a dataset for this, we will have to make our own. Fortunately we don’t have to define the reward here because there is already a payout table for this game. What we need now is to simulate how the dealer spins the wheel.

There are many open project about roulette on GitHub written in python. However, when it comes to giving the next position the most commonly you’ll see is something like this:

random.randint(0, 37)

Which is not preferable because the process of the ball to settle in the pocket is way more complicated than this.

There are 6 factors affect the final outcome:

  1. The dealer would pick up the ball from the previous position, spin the wheel to give acceleration on one direction, and roll the ball on the opposite direction.
  2. The direction of previous step would switch between clockwise and counterclockwise.
  3. The spin force from the dealer.
  4. Collision between the ball and the pocket.
  5. Whether the wheel is tilted.
  6. Human influence from the dealer. The dealer may forget to switch the direction after several rounds, or switch to a different dealer and result in a different distribution.

I started with an open source and modify the part of giving the next position. The top 3 factors can be simulate in code by setting directions and two random force (for spin wheel and the rolling ball) in a reasonable range. The 5th and 6th factor can be ignored if we assume we are playing with roulette machine (that I just did). Factor 4 is more complicated and I left it for future improvement.

After setting up the environment I run 100,000 playing records, I split the first 90,000 records for training and the rest for testing, following is the bar graph of number and color for training set and test set.

As you can see our data still have a uniform distribution after we gave some constrains, let’s see if we can train the RL agent to win money in our simulated environment.

Algorithm and experiment method

To estimate the performance of different strategies, I split the record into different rounds. For each round it has 100 records, the top 20 records is used for observation, the agent would play with the 80 records remain. There would be 900 rounds for training and 100 rounds for testing.The reason why I took 20 for observation is that casinos show last 20 numbers by default.

One of the most popular algorithms in RL is Q-learning. The ‘Q’ in Q-learning stands for quality. Q-learning bases on the idea of assessing the quality of an action that is taken to move to a state. We create a q-table to store the q-value of taking action a in state s. The q-value is initialize to zero and as we observe the rewards r we obtain from various actions, it will be updated using Bellman Equation:

Now it seems reasonable, can we apply it to our case directly?

Well, not so fast, there’s one little problem to be solved…

Remember that we take the last 20 numbers as observation state? For the European style roulette, there are 37 numbers in the wheel, as a result our state space would contain 37²⁰ conditions.

So what is the problem? We might use FP32 to store the q-value, as the state space is 37²⁰, the q-table size would be about (number of actions) x 10³² Bytes. This exceeds the maximum capacity of any data center in the world, not to mention that we also need same amount of data to train the agent.

This is where deep learning come out. We can use deep neural network to replace the q-table, that’s the spirit of deep Q-learning.

In deep Q-learning, the q-network predict q-value of the actions, and the next action is determined by the maximum output of the q-network rather than a lookup table. It uses loss function and back propagation to adjust the weights of the neural network.

Back to our experiment, I picked up three strategies to evaluate.

  1. Straight up: pick up any single number include 0
  2. 1 to 1 combinations: select from where payout is 1 to 1 (Odd, Even, Red, Black, 1 to 18, 19 to 36)
  3. 2 to 1 combinations: select from where payout is 2 to 1 (1st Column, 2nd Column, 3rd Column, 1st Dozen, 2nd Dozen, 3rd Dozen)

The agent choose one available place to bet according to current strategy using. There is also an option to pass because it is more reasonable if you can just observe and wait for some desirable conditions.

In each round of the game there will be $1000 as initial money. The bet is $50. The round ends when the agent win $2000 or run out of money. It would also end after 80 steps. This experiment is to compare the total win money, the winning percentage, and number of steps to reach end conditions by three agents based on different strategy.

Results and Discussion

The results of 100 test rounds from three agents are shown below:

Agent One
Agent Two
Agent Three

Try to guess each agent correspond to which strategy. Agent One has the highest winning percentage and won $47450 in total. Most of the time, it takes all 80 steps to reach end condition. Agent Two tends to pass no matter what the current state is. As a result, all 100 rounds are even and each round terminate after 80 steps. Agent Three won the most money with a 0.62 winning percentage. It also end the game earlier than the others, mostly it doesn’t take 80 steps to end.

It turns out that Agent One uses 2 to 1 combinations strategy and Agent Two is based on 1 to 1 combinations strategy. That leaves us Agent Three with Straight up strategy, that’s the strategy won 122.7 times the initial money!

My explanation for what we seen here is that the reward ratio for 1 to 1 strategy is relatively low and is harder to optimize, so after some frustration in the early stage of training the agent decided to give up. The 2 to 1 strategy would be a good choice for conservative person. If your goal is to make a killing, you should go with the Straight up strategy.

We have come to the very end of the article. We covered the basics of roulette game and reinforcement learning and gave a example that combine the two by back test three roulette strategies with deep Q-learning. There is a lot more to learn but hopefully this is enough to get you started and interested in.

For the future work of this project I will improve the agent to support multiple betting and adjust bet size. I will also try to add collision to the simulated environment. If you have any suggestion or want to share some real data for research purpose, please feel free to contact me.

I hope you enjoyed the article and willing to take it forward to try to use reinforcement learning in your own applications. Thanks for reading!