Project Bonsai — our first look at it

Published in

SOUTHWORKS

9 min readApr 5, 2021

Over last weeks here at SOUTHWORKS we have been playing with an emergent technology from Microsoft called Project Bonsai — an artificial intelligence platform currently in preview state. It allows solving machine learning problems using deep reinforcement learning without requiring an extensive machine learning background.

This tool relies on people’s expertise to break a problem into easier tasks and give machine learning models important clues about how to find a solution faster — a concept that is called machine teaching, a new programming paradigm that expresses the solution to a problem in terms of how you teach the computer to find the solution as opposed to how to calculate it.

Please do not forget to go read the introductory post to the machine learning world here.

Reinforcement learning

Reinforcement learning is a kind of machine learning method in which an agent interacts and learns from the environment.

When an episode starts the system begins in a predefined state (St). Based on the state the agent chooses an action (At) then the environment provides a reward for the performed action (Rt) and calculates the new system state (St+1). The goal of the agent is to select the actions that maximize his cumulative reward over the episode.

Bonsai

Bonsai aims to solve reinforcement machine learning scenarios without requiring expert machine learning knowledge from the user. The only inputs that it needs from the user are:

A definition of the state of the system.
The list of actions it can select.
A specific goal to achieve (ie: conditions that need to be fulfilled or avoided by the system state) or a reward function based on the current system state and action taken.
Connection to a simulated environment that can calculate the new system state based on the current system state and the action taken by the agent.

Based on these inputs, Bonsai will build an agent (called BRAIN — Basic Recurrent Artificial Intelligence Network) that selects actions based on the current state and it will train it to fulfill the specified goals or maximize the cumulative reward. To build this agent it will auto-select a machine learning model based on the problem structure and it will tune its hyperparameters during training automatically.

What we did?

To understand better this tool we chose several AI-related scenarios that would be a good fit for reinforcement learning, but within this article we are going to be talking briefly about 2 and a little bit more about the remaining one (anyway you can find everything more detailed explained within this repository).

These are 3 different scenarios we are going to discuss here:

Leaving Home: An agent starts in a random room in a house with a fixed layout and has to move through the house to reach the exit.
Tic Tac Toe: An agent needs to play tic tac toe trying to maximize its chance of winning.
Bipedal Walker: An agent moves its hips and knees in order to learn to walk through a plain terrain to reach the end.

Leaving Home scenario

We are not going to talk in detail about state, actions and reward functions in this article, if you want to learn or dig deeper into all the setup, training and results with this scenario you can browse here.

Small house version

The initial scenario consists of a home with five rooms and one exit. A agent is placed in one of the rooms and its goal is to find the exit. The agent is only allowed to move from one room to the next if they are connected.

After defining the inputs for this really simple scenario in Bonsai’s lingo (Inkling — which is a DSL), we were able to train the brain to reach the exit from any initial room in approximately 100.000 training iterations as seen in the image below:

Large house version

After getting good results in the small house version, we made more complex the scenario with a bigger graph. We created a layout of ten rooms and an exit. The image below shows the 10 rooms and indicates that rooms number 4 and 9 are connected to the exit.

As the previous version, this one learnt how to reach the exit of the house starting from any of the 10 different rooms. The training took approximately 160.000 iterations.

Large house: Shortest path version

On our previous versions, the brain was rewarded when reaching the exit but it had no incentive to prioritize short paths. Therefore, in multiple occasions it took unnecessarily long paths to exit the house. On this version, we decreased the reward inversely proportional to the path taken length.

This version took around 1.5 million iterations to train, considerably longer than the previous one, but it managed to reach the exit using shorter paths for almost all starting rooms:

Tic-Tac-Toe scenario

The goal was to train an agent in order to learn how to play Tic-tac-toe against a simulated player. The agent would fill the role of player 1 whereas the environment would simulate player 2 that played randomly.

After trying to train the model with this reward function we did not manage to make the agent play tic-tac-toe correctly. The model continued executing invalid moves and no significant improvements were possible after trying multiple variants of the reward function — you can read more detailed reasons for it browsing this repository.

This example shows that the techniques used by the Bonsai platform are not always able to adapt to every possible scenario. In this case, we believe that having an opponent player that executes random moves make it hard for the agent to learn how to play — since the evolution of the state does not only depend on the previous state and the agent’s action but also on factors (the other player’s moves) outside of the control of the agent.

Bipedal Walker scenario

The Bipedal Walker is a standard sample scenario that can be obtained from OpenAI (significantly more complex than the Leaving Home scenario). It consists of a fixed-length plain terrain along in which a robot has to walk from a starting point to an end point. This robot has a hull as a head and two legs with two joints each: one would represent the robot’s hip and the other one its knee. Additionally, the robot has a LiDAR rangefinder.

The image below shows the UI with the robot leaving from the starting point marked by the red flag:

If you are interested on seeing how we defined the state and actions for this scenario to train a brain using Bonsai you can redirect yourself here. Although, within this section, it is interesting to see how we evolved the reward function (which we briefly describe ahead) using a trial/error approach that led us to make progress and solve ‘little’ problems step by step.

Using goals

Our first version used Bonsai goals:

Avoid walking backwards: the position of the hull on the horizontal axis must be greater than 0.
Avoid falling: the robot must avoid receiving the game_over flag sent by the environment if the robot falls.
Reach the target position: the horizontal hull position must reach the end of the terrain.

After completing the training process, we found that within 800,000 iterations all goals were accomplished except for reaching the target position. The robot successfully learned to avoid moving backwards and falling but could not find a way to move forward. The image below shows the satisfaction percentages for each goal as a function of the number of iterations. The blue plot takes the average of all three goals.

Using reward and terminal functions: Reaching the end point

We then created new versions using reward and terminal functions, instead of goals. Defining the reward function of the model directly allows the user to have more granular control of how the agent is rewarded instead of leaving it up to Bonsai.

After multiple iterations we got to the following reward function that gives rewards when the agent moves forward with its hull straight and penalizes it falling down or going backwards.

We can see that the cumulative reward stabilizes after 4 million iterations:

With this reward function we managed to get the robot to reach the end of the terrain without falling. However, it did so in an awkward manner, dragging one leg on the floor.

Using reward and terminal functions: Keeping the hull high

After a few more versions, we managed to avoid the robot dragging its knees by adding a reward for keeping its hull above a predefined coordinate.

With these modifications, the robot is able to reach the end without crawling or dragging a leg on the floor by staying in a mostly upright position throughout the process. However, it still does not walk in a ‘natural way’ since its legs never alternate: one always leads and, when the other one gets closer, the robot hops forward. These are the best results we were able to achieve but there is still room for improvement.

As mentioned the scenario is significantly complex and has a much bigger state space to explore (than the others discussed before). This resulted in requiring a much larger number of versions to start getting good results in which the agent moved forward as desired.

The fact that the final version of the agent manages to advance without alternating its legs demonstrates the difficulty in selecting reward functions that are able to promote the desired conduct of the agent in some cases, since it is not always easy to translate a desired conduct into a reward function.

Additional steps to be considered towards this scenario would be to decompose the problem into several concepts & lessons, that would focus on making the agent walk in a ‘human way’. Although this is not an easy task, as requires lot of brainstorming, breakdown experimentation with trial/error, the fact is that building this learning graph could help us achieve results the way we desire— we leave this for another article.

Conclusions

Bonsai is a tool with many interesting powerful features and lot of potential for sure. The fact that it does not demand a lot of machine learning knowledge from the user could make it a very interesting platform to work with for developers without an artificial intelligence background.

If the user does have a bit of experience with machine learning, he is able to tune some settings more granularly — such as manually selecting the algorithm used for the agents. He is also able to use rewards and terminal functions directly instead of goals, which requires a bit more expertise but also gives the user more control that can lead to better results. However, not everything can be controlled directly by the user — for instance, the tuning of algorithm hyperparameters is done automatically by Bonsai and cannot be overridden.

As it is today, the platform is especially useful for final users that do not have an extensive machine learning background and just want to create and train models to solve specific problems without worrying too much about the internal algorithm details. Looking forward to continue exploring it and have more extensibility and fine tuning allowed in the near future.