Training Tic-tac-toe AI in Unity ML

Published in

aureliantactics

7 min readSep 14, 2020

I plan on doing a Reinforcement Learning (RL) project on a turn-based environment using Unity Machine Learning (ML). Before I dive into that project, I wanted to refresh my Unity ML knowledge and start with a simpler turn-based environment. In this post I’ll walk through setting up a Unity ML tic-tac-toe game and training an AI to play tic-tac-toe with Unity ML. While I managed to get this project working and train a tic-tac-toe agent I’m not sure if I got everything implemented correctly and I’m not confident at all that I’m following the best practices for a Unity ML project. I’ll end the post with some of the questions and uncertainties I had with using Unity ML and where I plan on going next.

Unity ML Refresher

Unity ML agents is a package that lets you train RL agents on Unity environments. Since I last wrote about Unity ML, the package has released version 1.0 and undergone some changes. To refresh my knowledge, I read through the getting started guide, re-implemented the example Roller Baller environment, went through a free mini-course on Unity Learn, read through the docs, and went through all the code and Unity scene set ups in the example environments.

Tic-Tac-Toe Environment

In the Unity framework, Scenes contain the environment. The tic-tac-toe scene is where I created and run the RL environment. I made a GameObject called Area to hold my game board. The board itself is a flat plane in the shape of a square with the main camera focused on it. To simulate placing the X’s and O’s for tic-tac-toe I created a GameObject on each square of the board. When an agent takes a move and places an X or O on the board, the X or O GameObject is activated. Attached to the Area GameObject is the TicTacToeArea.cs script which holds the code for running the game.

View of the board. Green + represent the O’s

Area also holds the agents that are trained to play tic-tac-toe. I created an agent for O and X and used Unity ML’s self-play options to train the agents against each other. In self-play, the AI trains by playing against itself. An alternative way of training the AI would be to create one agent and have it learn against a random or pre-programmed opponent. Attached to each agent is a BehaviorParameters.cs script for training the Unity ML agent and a TicTacToeAgent.cs script I wrote.

The docs have an explanation for the Behavior Parameters. In this project, I have 10 observations, 1 for each square of the game board and 1 for which team the agent is on. The action space is also 10: a ‘no action’ option and 9 actions for placing a piece on a game board square. Since this is self-play, I set the Team Id to 0 for the O agent and 1 for the X agent. The model attached to the agents is the neural network produced from training the AI. The Behavior Type lets you determine who controls which agent. If Behavior Type is set to Heuristic the human player can control the agent by pressing q, w, e, a, s, d, z, x, or c to play a piece on the game board.

I turned the Area GameObject into a Prefab. Prefabs in Unity allow for easy copying and editing. This helpful in my example because during training I have 12 copies of Area — each holding its own board and containing its own agents — running simultaneously. By running multiple copies, the agent can train much faster.

Using prefabs to run multiple game boards

The game and agent logic takes place in the TicTacToeArea.cs and TicTacToeAgent.cs scripts which I’ll go over next.

TicTacToeArea.cs Script

The game logic is run in this script. The TakeAction() and UpdateBoard() functions takes input from the agent and updates the board accordingly. A few functions check to see if the game is over and the result. After the game is over, I set the reward of the winning agent with the Unity ML’s SetReward() function (1 for a win, 0 for a draw and -1 for a loss) and call the EndEpisode() function before resetting the game board.

TacTacToeAgent.cs Script

The agent script inherits from Unity ML’s Agent class. The agent waits until its turn — a value updated from the TicTacToeArea.cs script — and calls RequestDecision(). From my understanding, the RL loop of observation-decision-action-reward repeats when the Agent requests a decision. This can be done by adding a DecisionRequester.cs script to the Agent GameObject or calling RequestDecision() manually. Most of the Unity ML examples use the DecisionRequester.cs script but those are not turn-based games and the docs recommend calling RequestDecision() for turn-based games. The GridWorld example has RequestDecision() called in the Agent script so that’s where I put my RequestDecision() call. It may make more sense to have the RequestDecision() be called from TicTacToeArea.cs since that script manages the turns.

Once RequestDecision() is called the CollectObservations() and OnActionReceived() functions are run. These functions are overrides of the Agent class and written by the developer depending on the environment. CollectObservations() gathers observations that the AI uses to decide what action to take and also to train the neural network. Observations are added to sensor.AddObservation() and Unity ML takes care of the rest. For the observations I have the agent’s team ID — since the O agent and X agent will react differently to the same board— and the status of the 9 squares of the game board (0 for empty, -1 for X and 1 for 0). One of the open questions I have is how I can view the actual observations that are fed to the neural network. I’m used to always debugging my observations so I can see what the agent sees and troubleshoot it. I wasn’t able to do that in this case and only after multiple training runs did I find out that visual observations were accidentally being sent to neural network. I also am not quite clear on how the self-play features affect the observations and would like to see the observations to clarify that as well.

OnActionReceived() takes the action given by the AI (or in some cases the human player) and then moves the environment accordingly. For example in the Hummingbird environment, OnActionReceived() will move the hummingbird around. In tic-tac-toe I passed the action to TicTacToeArea.cs and updated the game board.

The last function from this script is Heuristic(). This is another Agent class override that allows a human to control an agent in the environment. When you set the Behavior Parameters’ Behavior Type parameter to Heuristic, this function allows you to control the agent. This is useful for debugging and testing the environment. This is also used for playing against the agent in tic-tac-toe. When it came time to play against the trained AI, I set one agent to Heuristic, which I then controlled.

I really struggled getting the Heuristic() function to work properly. I’m not sure what exactly I did wrong. I can play tic-tac-toe with this environment but only after making it so that if there is a human player in the game, the game waits 2 seconds before resetting the board. In those two seconds the human player has to choose their next move. If not, the last move inputted by the human from the last game will be inputted. I’m not sure why this is happening. When Heuristic mode is on, the last input I enter constantly spams the OnActionReceived() function and isn’t updated until I choose a new move. There’s no turn-based environment in the Unity ML examples to guide me here so I’ll have to come back to this.

Training the Agent

I trained the agent following the standard Unity ML process. I installed mlagents to a virtual environment and activated the env in the console. I set up configuration yaml file using the docs to help me choose hyperparameters. From the console prompt, I used the config file and mlagents-learn call:

mlagents-learn tictactoe.yaml — run-id=ttt21

The results can be monitored form TensorBoard, which Unity ML makes easy to utilize.

I went through about 21 testing runs before I debugged things well enough to get my code to work. I had a number of issues I had to fix. I used bad observations (at first I didn’t have the team ID as an observation and then I accidentally copied over a game object that made the agent rely on visual observations). I failed to sync the agents properly with the game board (my agents were acting 100’s of times per tic-tac-toe game). My agents picked moves on the game board where there were pieces rather than empty squares (in order to avoid losing). Also, my attempt to fix the Heuristic() function caused an issue with the overall RL loop.

Questions

What are the best practices for a turn-based env? Are there any examples?
How do I see the observations my agent is seeing?
How should I be calling RequestDecision()?
Why does my episode length statistic not match what the agent in the tic-tac-toe game is doing?
How should I being using the Academy singleton? Prior versions emphasized this but I’m not seeing much usage of this in the current docs or Unity ML 1.0.
WTF am I doing wrong with the Heuristic() function? Why is my OnActionReceived() constantly being called with prior actions when the Heuristic option is activated?

Next Steps

Up next I’m going to try a GridWorld type env and maybe a maze type env in an old Unity game I made years ago. If I can get those to work I’m going to try some simple applications to a turn-based strategy game. Meanwhile, I’m going to do some more research on Unity ML (in particular the actual code, more examples, and the forums) to try to figure out the best practices and get a better understanding. I hope despite my weaknesses and lack of knowledge with Unity ML that this post helps.