Reinforcement learning on Reversing Stones

In this tutorial I will explain reinforcement learning. I will explain what Machine learning is if you are already familiar with this, you can skip to Reversing Stones. I will explain reinforcement learning and I’ll use an example that you can use on for the game Reversing Stones.

You can find the code on GitHub.

Machine learning

Machine learning is a technique in which, rather than writing a program that solves a certain problem, you write a program that teaches itself to fix a certain problem. Machine learning solutions can be categorized in the following branches:

  • Supervised learning;
  • Unsupervised learning;
  • Semi-supervised learning;
  • Reinforcement learning.

Supervised learning

To train a model (algorithm) that uses supervised learning, you put in a (large) dataset. This dataset has both the input and the desired output. A dataset is usually split up into training data and test data. The model is trained with the training data and then tries to predict the test data’s output and then compared to the actual output. The ratio is likely between 30%-40% test data and 60–70% training data.

Neural networks

Both supervised learning and reinforcement learning can make use of neural networks. These are structures that are based upon the human brain. A neural network consists of different “layers”. Each layer consists of a certain number of nodes. All nodes are traditionally then connected to all nodes of both the next and previous layer and these connections all have a weight. The inputs set the values of the input layer, these then go to the hidden layer(s) and end up at the output layer. When training a model / neural network, the weights between the nodes get altered to try to get closer to the desired output.

Reinforcement learning

When using reinforcement learning, you don’t require any data, instead you must generate the data itself. The model tries to do the thing you want it to do (turn over tiles) and you reward it if it’s doing something good (win the game) or punish it when it’s wrong. These rewards are very important, since the model will try to reach the largest reward possible. This can mean that the model will do something completely unexpected. Using this technique, the model can play thousands or even millions of games in a few hours or days’ time.

The agent

The agent is the one taking actions, deciding which moves to perform, and is the component that is being trained to perform better. After the training is complete, we take the agent and put it in another similar environment (on

The environment

This is where the agent takes actions in. The environment is a place that has a certain set of rules. In our case it is the game, the environment is the board, the opponent’s stones, our stones and the empty tiles.

The interpreter

This is the part that understands when an action is good or bad. This then rewards or punishes the agent accordingly. This is only ever used while training and is mainly implemented as a rewards function.

Reversing Stones

Play starts with a board with four stones on it (2 dark and 2 light) and you will be randomly allocated the colour dark or light.

Dark must place a stone with the dark side up on the board, in such a position that there exists at least one straight (horizontal, vertical, or diagonal) occupied line between the new stone and another dark stone, with one or more contiguous light stones between them.

After placing the stone, dark turns over (flips, captures) all light stones lying on a straight line between the new stone and any anchoring dark stones.

Then it’s light’s turn and they do the same and turns over dark stones. If a player cannot make a valid move, their turn is skipped.

The game ends when no player can make a valid move or the board is full. The winner is always the player that has the most stones of their colour.


You can find more information about installing tensorflow here. Please note that tensorflow requires a 64-bit instance of Python 3

To play the game itself, go to the downloads page of AIgaming and download the aigamingReversingStones library.

I use the following libraries for this project:

  • aigamingReversingStones;
  • Tensorflow;
  • Numpy.


First we will need to import all the required modules.

Next we will make some configurations, this way if we decide to use a different board size, we can just alter these variables and start training.

  • CONFIGURATIONS: A list of all possible board sizes;
  • USE_CONFIGURATION: The index of the configuration we use;
  • BOARD_SIZE: The size of the board.

This is a function that initializes the model. We create a neural network that has one hidden layer that has the same number of nodes as the input and output layer. We initialise every layer to a pseudo-random value. Then we connect the layers and initialise a gradient descent optimizer.

Note: Below are not actual values. They only hold a value if you call[The variables you want to know the value of], feed_dict={values for placeholders})

  • input_positions: The inputs of the network / the input layer;
  • labels: The expected result of the network. This is made using the actual action you took rather than the list of probabilities that the network produces. It is represented using the index of the ‘1’ in the list;
  • learning_rate: How much the model learns from the current move. This depends on the action taken and how good that action ended up being;
  • W1: The weights that connect the input layer to the hidden layer;
  • b1: The biases of the hidden layer (this is a different value for every node);
  • h1: The actual hidden layer. This contains the values of the hidden layer’s nodes;
  • W2: The weights that connect the hidden layer to the output layer;
  • b2: The biases of the output layer;
  • logits: The values of the output layer;
  • probabilities: The values of the output layer in range [0;1];
  • cross_entropy: The differences between the logits and labels. Converted to values between [0;1];
  • train_step: This tells the model how to minimize the cross_entropy using the step size determined by the learning_rate.

We can simply initialise a game using the library aigamingReversingStones. The initialise function has the following parameters:

  • player_index (0): The index of the player that initializes the game. This doesn’t affect the game;
  • player_ids ([“Player0”, “Player1”]): The names of the players in the game. This doesn’t affect the game;
  • move_time (3000): The time each player has to respond in ms;
  • board_dimensions ([8, 8]): The dimensions of the board [#columns, #rows];
  • shape (“Square”): Currently not in use;
  • holes (False): Currently not in use.

This function returns a dict with the following fields:

  • Result: “SUCCESS” if the game was successfully created;
  • StepTexts: Will hold an English description that the game has started;
  • SpecificGameState: A gamestate that is similar to the one you get when you play the game at;
  • GeneralGameStates: A list of one or more gamestates. Use the last entry in this list to make the first move.

Process move

To make a move, we need to proccess through the neural network. Once this is done, we have a list of probabilities. We only want to make a valid move, so we remove all impossible moves from the probabilities and recalculate.

Note that np.random.choice needs the sum of all probabilities to be exactly 1. Therefore, we take all values to six decimal places. We then recalculate all probabilities so that the sum is exactly 1. If there are no possibilities left, we change the list to give all valid moves an equal chance.

We make a move in the game using the library aigamingReversingStones. The move function has the following parameters:

  • player_index (0): The index of the player that is making the move. This needs to be equal to gamestate[“Mover”];
  • player_ids ([“Player0”, “Player1”]): The names of the players in the game. This doesn’t affect the game;
  • gamestate ({}): The most recent gamestate, this is holding all data for the game;
  • current_move ([8, 8]): The move you want to make, this is in the format {“Row”: int, “Column”: int}.

This function returns a dict with the following values:

  • Result: Either “SUCCESS” if the move was valid, “INVALID_MOVE” if the move wasn’t valid or “GAME_HAS_ENDED” when the move finished the game;
  • StepTexts: Will hold an English description of the move that was made;
  • SpecificGameState: A gamestate that is similar to the one you get when you play the game at;
  • GeneralGameStates: A list of 1 or more gamestates. Use the last entry in this list to make the next move.

Play game

To play a full game, we will first initialise player_index, player_ids and result. result will hold the state of the game and all information used by the library. There are also some logs that need to be initialised.

The game keeps running as long as the moves succeed. Once the game has finished result[“Result”] will change to “GAME_HAS_ENDED” and we will escape the loop. Every move consists of the same pattern:

  • We get the current player;
  • We get the current board;
  • The board is converted to a uniform input, this is done so the algorithm always gets to play with ‘0’;
  • We log the current board so we can use this to train later;
  • The model returns a move that we then convert to the correct format;
  • We then make the move and update result;
  • We then log the action taken and the probability of that action.

When the game has finished playing, we return all the logs and the final gamestate.

Learning / Rewarding

When the game ends, result[“WinnerIndex”] will hold one of three values. 0 or 1 if that player has won or -1 if the game ended in a draw. You would usually create a rewards function that rewards the player for each move individually depending on the contribution that move made towards the winning of the game. To keep this tutorial understandable, I give all moves the same reward. A winning game will result in a reward of 1 for each move. This makes the model more likely to make that move again if it ever found itself in a similar situation. If, however, the player didn’t win, I give a reward of 0. Note that this doesn’t stop the model making that move again. This is because we don’t know if that move is a bad move, we just know that the sequence of moves made by that player didn’t result in a victory this time.

Note: this code is included in the training part.


To test how successfully the algorithm learned, I created a function that lets the model play a game against a randomly moving player. This function is similar to the play_game function.

To run this code, you can simply call it. It will print the results after every 100 games played and return the overall percentage.

Executing the training

Before we start the training, we want to execute all initializing functions. You also have to create and initialise a Tensorflow session.

ALPHA is a multiplier to the training steps the model will make. This is a parameter that prevents underfitting and overfitting. This is very algorithm dependent and when you, for example, change the rewards you need to find the value that works best for that algorithm. This is simply done by trying different values (I usually change it with a factor of 10 untill I’m happy with the result).

The all_ variables are for logging purposes.

I do quite a lot of logging during the training so that I know what is happening. All print statements and time-calls are directly related to the logging. The result is that I know how long the training took and how much longer it will take to finish the training. NUMBER_OF_GAMES defines how many games are played during the training.

A game consists of the following steps:

  • First, the actual game is being played;
  • We log all the things we might want to look into;
  • We print the progress every percentage;
  • For both players, we check if they won and reward accordingly;
  • We then train for each player, every move of the game with the input and what move they actually took. We then make a training step that depends on ALPHA and how their game went.

After the game has finished, we log some data that may be of interest.

Note: In this game, “Avg of last step” doesn’t define how well the model is doing. In another game where there might be a more explicit progress, you might want to print that.

Note: If you are using jupyter notebook, you can re-execute this cell to keep on improving the model.

Using the trained model at

Before we can use the model we need to save it, this can be done in a number of ways, I prefer to use the tf.train.Saver().

Before loading, you need to initialise all variables in the same way as the training, you can then load the model from these files. Know that you need all files that the saver created. When the model is loaded, you can use it like I did in training.

The best part is to have your code play against other people’s code. This can be done at You need to use load_model, initialise_tf and guess_move without many changes (disable the benchmarking).

To make the algorithm work, we need the model. I use google drive to make my model accessible to the internet. You can do this by creating a sharable link and copying the id (you will get something like this: “{id}/view?usp=sharing" or like this: “{id}"). Do this for all four files and replace the ‘xxxxx’ ids in the list of links. You can use a different service if you prefer, this is just one way to do it and this is the one I implemented in my code.

The calculateMove function is the main function of the program. This is being executed every time you need to make a move. We first need to check if the model exists in the /tmp folder (this is the only folder you can access). If the model is not there, then put it there. If the model is there load the session and guess a move. Simply return that move in the correct format and you’re done.

If everything works out, you will be able to defeat housebot-practise with ease.

What’s next

If you managed to get this working, you can start iterating and create a more successful model / algorithm.

One of the biggest improvements you can make is in the rewarding of moves. Currently all moves in a game are rated equal, if you were to write a function that could detect which move is better and which is worse, you would be able to make significant improvements.

An example of this would be a function that counts how many stones you convert. You then rate every move accordingly and reward better moves more and worse moves less (negative or 0). You can increase this reward if there are less stones of the opponent on the board. Another thing to consider is if stones have some strategic importance later in the game or if the stone is in a good location.

Another thing that you might want to do is use multiple models. You can, for example, use a model for each colour or one for the first part of the game and another one to finish the game or combine the two.

When you feel confident, I would strongly suggest trying to create an algorithm to train on a different game or challenge.

About the author

This article is written by William Verhaeghe.