As a side project during our studies at Cambridge, I and three other engineers built our own Go playing AI, based on Deepmind’s AlphaGo Zero paper. We called it BetaGo, as it uses the same approach as AlphaGo but instead plays on a much smaller 5x5 board (due to computational restrictions). After nearly two months of debugging and training, our model was good enough to play against amateur humans.
Check-out our video of the project here:
There are already many resources online explaining how AlphaGo works, so we won’t go into the details here. Instead, in this blog post, we want to share some of the things we learnt about organising projects, as well as give a few tips for debugging Reinforcement Learning (RL) projects.
Below is a list of great resources which detail how AlphaGo and RL in general works, for anyone who’s interested:
- Andrej Kaparthy’s post on reinforcement learning http://karpathy.github.io/2016/05/31/rl/
- Surag’s blog post about how AlphaGo Zero works https://web.stanford.edu/~surag/posts/alphazero.html
- Deepmind’s paper, if you want to dive into the details https://deepmind.com/blog/alphago-zero-learning-scratch/
- Sutton and Barto’s book: Reinforcement Learning: An Introduction. Read this if you want a more thorough understanding of the theory behind RL http://incompleteideas.net/book/the-book-2nd.html
Making the team work
In a normal Hackathon project that takes place over a single weekend, teams usually distribute the work by splitting the project up into modules, and assigning one to each member. This approach makes thing very easy to manage, especially if the module assignments can leverage the experience and capabilities of the different individuals in the team. Longer-term projects, such as the ones at Hackbridge Cluster, allows teams to delve into much more depth in any particular topic, thereby allowing projects to develop to a more mature state. However, when delegating tasks to the team, there is a lot of danger in using the ‘module assignment’ strategy for long-term projects.
Often in a project, there will be “fun” parts like coding-up algorithms or training models, but also “boring” parts like building the UI. We all do projects to learn and improve our skills, and no one wants to be doing the grunt work all the time. Having a predefined module for each team member can make people feel that they are not getting much value from doing the project, and this might cause the team to fall apart.
We faced this problem ourselves when building BetaGo. As we progressed, some of the plumbing was completed first but the Monte-Carlo Tree Search (MCTS) module was still riddled with bugs. This left some team members with nothing to contribute, while giving others a substantial workload. After some firefighting, we recognised this was a problem and instead decided to switch to a task distribution system that more resembles that of open-source software contribution. Rather than just changing the code by ourselves, we would create issues and feature requests on Github, and each member would then pick one that suits his or her ability to work on. This way, people can always choose what they want to work on, rather than being confined into a predefined set of tasks from the outset. We found that this approach improved the overall efficiency of our team, as well as the level of satisfaction we each received from doing the project.
Testing different parts of the pipeline
It is almost inevitable that there will be bugs in the code, but reinforcement learning projects are notoriously hard to debug. All you can see is that the results are poor, but its hard to tell whether this is due to a bug, incorrect hyper-parameters, or simply insufficient training. Therefore, it is important to break the pipeline down into sections that can be tested individually. Our approach was to divide the pipeline into three sections: simulation, data collection, and training.
The simulator has a deterministic behaviour (i.e. we know exactly what should happen when an action is made) and is therefore easy to write tests for. Having a set of unit tests for it can give you confidence that this particular module is working as expected, though — of course — care must be taken to check that all the edge-cases are being tested for. It can also be useful to stress test the simulator by generating large amounts of random inputs and checking the results against precomputed ground truths, or just checking that it doesn’t throw an error. This strategy highlighted a bug in one of our early implementations where the simulator permitted a move when it, in fact, had placed a stone right on top of another stone.
The second thing to try when debugging is to save all the data generated and look through them to see if they make sense. For us, the data would be a list of (board, action, predicted value) tuples. By looking through the saved data, we realised that our code had a erroneous minus sign that gave the same value to our victory and the opponent’s victory. There’s no wonder it wasn’t learning anything useful at the time! 😂
Here’s a code snippet of this particular bug:
next_state = find_next_state(move)if next_state is None:
next_board, next_player = simulate_move(board, player, move)
next_state = evaluate(next_board, next_player) total_value += next_state.value # this is the bug
When we evaluate a new state, it would always be from the perspective of the opponent. Therefore the opponent’s value should be multiplied by -1 before being added to the current player’s total value. This is a subtle error that can have significant consequences on the success of an RL implementation!
After you are confident that the data generated is correct, it is important to make sure that the model does improve with more training data. The last thing you want is to leave it training for many hours to come back later to a model that isn’t any better. To test this, we saved all the data from our self-play episodes and trained a model using this large, but low quality dataset. Luckily for us, our pre-trained model was able to beat the random benchmark on the first try. This gave us the confidence to scale up the training by leaving the model to train for an entire day and collecting data simultaneously on multiple computers.
Understanding the theory can be really useful
When we started, we just wanted to get the code written quickly and didn’t want to spend time figuring-out the theory behind why this approach works in the first place. However, as the project developed, we found ourselves randomly trying out hyper-parameters and hoping that something would work. Obviously, this wasn’t the best strategy!
Eventually we decided to read Sutton’s book to get an understanding of the theory behind AlphaGo Zero. This effort payed off in many ways.
One example of how theory helped is when we were tuning the hyper-parameters for Monte Carlo Tree Search. In our implementation, there was a constant called
cpuct that controlled the tradeoff between the level of exploration and exploitation. We initially left it as the value suggested in the paper, but after reading through Sutton’s book, we realised that since we were using fewer search iterations, the value of
cpuct should actually be an order of magnitude larger. Using our new understanding, we were able to calculate the value it would need to be, and changing this hyper-parameter in an informed manner made the model significantly better.
Personally, I think the best way to learn is to work on theory and practice in alternation. Practical work makes you appreciate the importance of theory, while learning the theory helps you solve the hard practical challenges.
We hope that these lessons can help those working on their own RL projects to avoid some of the pitfalls we faced. Good luck, and let us know how you do!
All our code is open source and can be found on Github here.
Lastly, we’d like to say a massive thank-you to Hackbridge and Entrepreneur First for sponsoring the food and drink that helped us through the long nights. If you’re a Cambridge student and you want to build cool things with amazing people, definitely attend some of Hackbridge’s events! You never know, we might find ourselves working on a project together ;)