# Solutions for a Dice Game

## Introduction

I love board games, and I love thinking about board games in mathematical frameworks. There, I said it. The distributions of random events, suboptimal steps, the relative strength of player vs engine as the game progresses in different co-operative games, etc, I could talk about these for hours. One of the most common mathematical features in tabletop games is the outcome of dice rolls. There are tons of games out there with random number generators of varying complexity that all run on your everyday, humble, 6-sided dice.

In this post we are going to formulate a problem around a simple dice game, and we are going to solve it in increasingly (over)complicated ways. I have heard about job interviews where this question came up, so we will start with a non-technical way of approaching the problem, briefly describe how it can be solved with a recursive function, and then dive into the world of reinforcement learning.

If you are familiar with the basic probability and programming aspects, or worked out this example before, I recommend that you skip Solutions 1 and 2, an head straight to Solution 3, where the reinforcement learning magic happens.

Please refer to my **GitHub repository** for the calculations and code, done in Python, organised in Jupyter Notebooks.

## The Problem and the Notations

Here is our problem in simple and sloppy terminology: you have some dice that you roll one by one, decide when to stop, and get the final die roll as the outcome of the game.

Giving it a somewhat structured, mathematical notation, our game is noted as *G(n)*, where *n* is the potential maximum number of dice rolls. The game is played in a series of rounds, *k* number of rounds to be precise, where *k ≤ n*. We start every round with a random state, noted by *S*, and the two indicators of the state are the current reward, *d(i)*, and the remaining rounds after the current state, *i*. Note that we are “counting from the end”, *i* is the potential number of rounds remaining, rather than the rounds that have already passed. In every single state, *d(i)* is a value of a random variable with discrete uniform distribution between 1 and 6 (i.e., the result of a completely balanced die roll). This will make more sense later on. *A(i)* is the action the agent makes, based on information from the current state, *S(i, d(i))*. The two possible actions are *Continue* and *Stop*. If the chosen action is *Continue*, we go through the same process again, starting with a fresh die roll. If the chosen action is *Stop*, the game ends, with the current die roll as the result.

You probably think it is a bit of over-engineering, but building our problem in this structured way will make our life much easier going forward, when we come to the simulations and reinforcement learning.

## Solution 1 — Pen and Paper

Let’s see how we would solve this problem in the old-fashioned way, using only a pen and some paper. And possibly a calculator. And our math knowledge.

The “trick” is that you have to think backwards, and first determine the optimal policy in *S(1, d(1))*, as a function of *d(1)*. This might be your instinct when you approach the project, and without going into too much detail, I find it useful to try and conceptualise why our instinctive feeling is correct.

- First of all, the problem is a discrete Markov process, which basically means that in every state, the future states only depend on the present one, and they do not depend on the ones in the past. This also means that we can find the optimal action in any given state without considering the past states. Check out the relevant Wikipedia page here for more information.
- Second, following Bellman’s Principle of Optimality, we can apply a recursive approach to the optimisation, and we have to have an optimal strategy applied at the very end of the game in order to find the optimal steps before. Once again, check out the relevant Wikipedia page for more information here.

With these in mind, let’s start working our way back from the very end. We have no meaningful action at *i = 0*, if we are at that point, the last die roll is returned as the result of the game. So our last possible action is at *i = 1*. Given *S(1, d(1))*, our two options are: *Continue* and *Stop*. What are the two expected payouts?

- If we
*Continue*, we get the expected result in*S(0, d(0))*, which equals the expected value of*d(0)*, which is 3.5. - If we
*Stop*, on the other hand, we get the value of the actual roll in*S(1, d(1))*, which is*d(1)*.

Considering our aim is to maximise the expected outcome of the game, we can easily formulate our optimal policy:

*A(1) = {Continue if d(1) < 3.5, Stop if d(1) ≥ 3.5 }*

We could rephrase the formula with discrete values, but I did not find it necessary. From here on, when we refer to *A(i),* we mean the optimal action.

All right, we know what to do if there is one round left, what’s next? Let’s consider our options in *i = 2*, assuming we are following the optimal *A(1)* strategy. Introducing a new variable, R*(i),* which is the expected outcome of the remaining rounds of the game, assuming we follow the optimal actions for remaining rolls. So in state *S(2, d(2))*, the agent has two options: *Stop*, and get *d(2)*, or *Continue*, and get an expected result of *R(2)*.

All we need is the expected value of *R(2)*, which can originate from two random variables: *R(2)* will be equal to either *d(1)*, or *d(0)*. If *A(1)* is *Stop*, we get *d(1)*, if it is *Continue*, we get *d(0)*. As we have shown above, A(1) will depend on d(1), and it will be *Continue* with a probability of *0.5*, and *Stop* with a probability of *0.5. *We know the expected value of d(0), but what is the expected value of *d(1)*? Well, on itself, 3.5, of course, but we are only considering the cases where the decision at A(1) was *Stop*, which happens if *d(1) > *3.5, which means it is either 4, 5, or 6, the expected value of which is 5.

In more structured formula:

*R(2) = Pr { A(1) = Stop } * E { d(1) | A(1) = Stop } + Pr { A(1) = Continue } * E { d(0) | A(1) = Continue}*

The | signs in the formula are used for conditional probabilities, you can find more information on the topic here. *E { d(1) | A(1) = Stop} *means the expected value of a die roll, assuming our chosen (optimal) action based on that die roll was to *Stop*. From the second part of the equation, we can omit the condition of the expected value, the value of *d(0)* does not depend on our decision *A(1)*, because, remember, Markov process.

Let’s insert the numbers in the equation above:

*R(2)* = 0.5 * 5 + 0.5 * 3.5 = 4.25

Which means that if we choose *Continue* for *A(2)*, we can expect 4.25 in the following two rolls. Formulating this:

*A(2) = {Continue if d(2) < 4.25, Stop if d(2) ≥ 4.25 }*

That’s interesting, it turns out if we have two remaining dice rolls, it is worth re-rolling 4’s as well. So the next question to consider would be when to re-roll 5’s.

We could continue calculating the results iteratively to any number of potential dice rolls, but that is why we have machines, and in the next section, we are going to examine a recursive function that easily handles this.

## Solution 2— Recursive Function

Without further ado, this is one way of coding the iterative process above:

import numpy as npdef R(i):

"""

Determines the expected outcome of i remaining dicerolls,

assuming we follow the strategy that maximizes expected value.

"""

if i == 1:

result = np.mean(range(1,7))

else:

prob_reroll = int(R(i-1)) / 6

not_rerolled_outcomes = range(int(R(i-1)) + 1, 7)

curr_expected_assuming_no_reroll = \

np.mean(not_rerolled_outcomes)

result = \

prob_reroll * R(i-1) + \

(1 - prob_reroll) * curr_expected_assuming_no_reroll

return result

The results of the function above (rounded to 2 decimal points) are:

- R(1) = 3.50
- R(2) = 4.25
- R(3) = 4.67
- R(4) = 4.94
- R(5) = 5.13

That is as far as we need to go, this is enough to formulate a policy that handles the actions and gives an optimal action under every possible state in a *G(n)*.

Optimal policy in state S(i, d(i)), i > 0, is to **Continue,** **unless** one of these statements is true:

- you rolled a 6,
**d(i) = 6**; - you rolled a 5 and there are 4 or fewer remaining rounds after the current round,
**d(i) = 5 and i ≤ 4**; - you rolled a 4 and there is only 1 round remaining after the current round,
**d(i) = 4 and i = 1**.

## Solution 3— Reinforcement Learning

We have the solution, does that mean we are done? Well, no, far from it, there is plenty more left to play with here. It’s time to enter the world of simulations! In this last section, we are finally getting to training an algorithm to play the game.

*Sources**:* I heavily relied on the book called Reinforcement Learning: An Introduction, by Richard S. Sutton and Andrew G. Barto. It is a fantastic summary of the topic, its history, applications, and potential future development. If you do not want to go back to the source, basically all the blog posts I found on reinforcement earning are summarising certain sections of this book. You can find a great blog post that details the process of teaching a tic-tac-toe program with reinforcement learning here.

Before we jump to the exciting part, a quick introduction to the main concepts used.

*Reinforcement learning*is a type of machine learning process where the algorithm, represented by an*agent*(or agents) in the simulated*environment,*learns to choose their*actions*in different*states*based on cumulative*rewards*they have gathered in the past.*Exploration*and*exploitation*are two possible strategies the agent can follow when deciding on actions, and one of the central points of reinforcement learning processes is to find a balance between these two routes. If the agent decides to explore, it will try to gather more information about the environment, without considering the consequence of taking a suboptimal route. If it decides to exploit, the experience of past explorations will be used to determine the action that promises the highest expected reward.- One of the classic examples is the
*Multi-Armed Bandit Problem*, in which the agent faces a number of slot machines, which all have unknown and potentially different payout distributions. The resources are limited, and the agent must decide how long to spend on trying to figure out which machine pays the highest rewards (this is the exploration phase), and then how long to spend on playing the machine it deemed to be best (this is the exploitation phase). *ϵ -greedy*is a type of policy used to decide when to use exploration and when to use exploitation. The Agent will explore with probability*ϵ*, and exploit otherwise.

It’s time to build a proper model for this whole process. Please check out the GitHub repository for more details, outline of the project:

*Game class*: handles and keeps track of the states, generates a new state based on action of the Agent, also contains the reward logic*Agent class*: most important function is to decide what action to take, based on the state, also keeps track of current game’s experience, as well as the cumulative rewards*main function*: handles the whole simulation, resets Game objects and makes the Agent assign the reward between games, and loops through the process a given amount of times

Let’s start by discussing the Agent’s decision process. The first thing it decides every time it needs to come up with an action is whether to explore or exploit in that turn. In our case, this is handled by the exploration_rate parameter, which is set to 1 at the beginning, and “decays” with a certain amount every turn, until it reaches the min_exploration_rate, 0.1. So we are using a type of ϵ -greedy policy, where *ϵ* starts at 1, and gradually decreases. Every time the Agent needs to decide which way to go, a random *U(0,1)* number is generated, and the Agent will explore, if the random number is lower than the current exploration_rate. So in the beginning, every action will be exploration (which makes sense, considering we have no information to exploit upon), and this probability gradually decreases, but never becomes zero.

And what happens after the Agent decides what kind of action to take? It’s quite simple, actually, under the explore option, it will basically flip a coin to see if it goes True or False (which stand for *Continue* and *Stop*). Under exploit, the Agent will check its record of the previous cases when it was in the exact same situation, compares the cumulative rewards it got when it chose the two actions, and chooses the action that resulted in a higher reward in the past.

You might wonder at this point how we are ever going to achieve the “first level of intelligence” in this problem, which I define in myself as the point where the agent handling the game realises that if it has many remaining rolls, 4’s and 5’s should be re-rolled as well, since this decision depends on the understanding that an optimal action will be taken in future turns.

Well, the solution is a formula that I found to be the most exciting when working on this project, in its generalised form:

*V(t) ← V(t) + α * [G(t) — V(t)],*

where *V(t)* is the value of a state-action pair at time *t* in the process, *G(t)* is the reward (please note that it can be defined flexibly, so it could be the reward assigned at the end of the episode, or it could be derived from the value of the following state), and *α* is the so-called step size, or learning rate. We will use a simple version of the general formula where the rewards are handed out at the end of the game, and earlier steps receive the full reward, without discounting it.

So what actually happens here in the simulation? When a game finishes, and the Agent takes some time to reminisce about the fun it had playing the game, it will assign the new reward to the actions it made during the game. (Please note that I re-coded the final die outcome (1 to 6) to a (-1 to 1) reward, mainly for better comparability with other systems, but a linear transformation should not have an impact on the efficiency of the algorithm.) However, assigning the outcome as a reward would mean that we are keeping none of the previous experience. We obviously do not want that. An alternative is to keep an average of the results, that is how the classic Multi-Armed Bandit Problems work. In this case, however, we would easily get stuck in a suboptimal policy. So what the formula actually does is it takes the difference between the actual reward and the previous estimation of the value of the action, and adds a small portion of it to the historical evaluation.

One really great aspect of using this formula is that older rewards become less and less impactful as time goes by, and newer experiences have a higher weight in the formula. (It’s not immediately apparent, at least it wasn’t for me, see the Reinforcement Learning book if you are interested in the proof of this theory.)

As a result, when our Agent starts exploring, and comes to the wrong conclusion that it is not worth re-rolling 4’s and 5’s, because everything is random anyway, that false notion can be changed by the future exploration steps, when the later steps are already optimised and exploitation works correctly. In some problems, you exploit because you want an actual higher outcome, here, the reason for exploiting is so our explorations elsewhere can result in a signal that accurately describes the expected outcomes of the explored action.

Let’s see one example of a G(5). The states on the left are coded as <current_die> -<remaining_rounds>, the first element in the arrays is the cumulative reward it assigned to the False (*Stop*) action, the right is what it assigned to the True (*Continue*) action, and finally we can see the decision it would make, facing the same state now.

`1-1 [[-0.99997397 -0.01910152]] True`

1-2 [[-0.99999998 0.25753547]] True

1-3 [[-1. 0.40410508]] True

1-4 [[-1. 0.53263981]] True

2-1 [[-0.59999153 -0.01650078]] True

2-2 [[-0.59999999 0.2838495 ]] True

2-3 [[-0.6 0.38229775]] True

2-4 [[-0.6 0.49330816]] True

3-1 [[-0.19999737 -0.00587907]] True

3-2 [[-0.19999999 0.22689457]] True

3-3 [[-0.2 0.37763388]] True

3-4 [[-0.2 0.4962943]] True

4-1 [[0.2 0.03144677]] False

4-2 [[0.2 0.23823287]] True

4-3 [[0.2 0.47342095]] True

4-4 [[0.2 0.5217909]] True

5-1 [[0.6 0.02504011]] False

5-2 [[0.6 0.23787624]] False

5-3 [[0.6 0.42414841]] False

5-4 [[0.6 0.49288091]] False

6-1 [[ 1. -0.02222511]] False

6-2 [[1. 0.28468875]] False

6-3 [[1. 0.37943137]] False

6-4 [[1. 0.5282986]] False

Great, it worked! We only had to set the algorithm loose in the environment without any details, and it managed to correctly set up the optimal strategy for the scope in question, exactly as we outlined above in Solutions 1 & 2.

If we analyse the two values in each row, the number on the left representing the value the algorithm put on *Stopping* in that state (False), and the number on the right representing the value of *Continuing* (True), there are a couple of interesting conclusions we can draw:

- Stopping at any state yields a deterministic result (-1 if we stop at 1, -0.6 if we stop at 2, etc). This is very straightforward.
- Continuing at any current die roll in a state with one remaining roll (so in a state coded as <k>-1) has a value of around 0. This is because the result of these games will be a value of a random die roll, expected value of which is 3.5, and that is coded as 0 in our reward function.
- The learning rate is a great concept when it comes to organically building on previous strategies, but a disadvantage is that the final rewards will be less accurate. The last item added to the value of the actions we discussed in the previous point could be for example learning_rate * 1. As a result, we will be close to the theoretical value of 0, but we will not exactly converge to it, the result will be constantly pushed off by new experiences.
- Similar to the <k>-1 states by k, all the other <k>-<i> states grouped by <k> vary around the same value. This makes sense, the outcome of deciding to
*Continue*with a certain amount of remaining rolls should result in the same outcome, regardless of the die that was shown prior to the decision. But there is a difference between let’s say 6–4 and 1–4 in the table above, 0.5283 vs 0.5326. The reason is, once again, the learning rate, even with sufficiently large numbers of runs, the value will not converge exactly. The discrepancy could be avoided by grouping the action outcomes, and pooling the results of actions coming from different states, if the outcome is from the same distribution.

There is definitely room for improvement when it comes to accuracy or efficiency, but as we will see in the next section, as a tradeoff the algorithm is very robust.

## What I Learned

In retrospect, I think it was very much worth it going through all this trouble to get the results of a problem we already easily solved without any simulations.

I had to think really thoroughly about the meaning of each of the parameters, and had to tweak them a lot to get the desired result. According to my experience, setting the min_exploration or the learning_rate parameters too low would greatly decrease the efficiency of the algorithm, and much more simulation rounds would be necessary to arrive at the optimal strategy. Setting them too high on the other hand could result in a suboptimal policy. For example, if the min_exploration is too high, that would mean that the outcome is still completely random in a large portion of the runs, which decreases the possibility of correctly identifying the optimal actions for states that occur early in the game.

One other thing that I slowly came to realise is that I should not underestimate the power of the algorithm that I trained. It seems like overkill at first sight, but let’s not forget that the problem it solves is not exactly the same problem as in Solutions 1 & 2, the Agent managed to solve the question without being aware of a lot of crucial information! Just a couple of things the Agent didn’t know:

- anything about the distributions of the dice rolls;
- the sequential nature of the game;
- that the different states are connected, and they do not have to be solved on a state-by-state basis;
- anything about the rules, for example the fact that if it does not accept any of the dice, it will be left with a random roll after a while.

When you make a list of the the unused information, it is kind of amazing how the Agent managed to teach itself the optimal policy at all, however ineffective it might have been!

## Conclusion

There are many ways to enhance the current theoretical problem, one example would be setting the explore /exploit split differently for states based on remaining rolls, so it decays very rapidly at the end, and slowly in the beginning.

Enlarging the environment and the game rules is another possible route, a more complicated sequential dice game (involving sums of certain rolls, varying addition rules, manipulating dice rolls, etc) can be solved in the same way.

If you are interested in more complicated theoretical aspects, I recommend that you check out Sequential Multi-Armed Bandits. I would also like to mention that there is a Python library that is built to solve reinforcement learning problems like this, it’s called gym, but it felt more appropriate to build up everything from scratch in order to better understand the process. I intend to explore this topic further, i feel like this is not even the tip of the iceberg yet.