Unity ML Intro Tutorial — Super Cart Pole — Part 1

I will show you how you can use Reinforcement Learning to teach an AI to balance an inverted pendulum and more

Gonçalo Chambel
13 min readOct 17, 2022

Reinforcement Learning is probably my favorite type of learning (and probably the one I hate the most at the same time). It allows you to pretty much train an AI to do anything, without the need of a previous dataset (the good part), however, the learning process can be very tedious depending on the problem (the bad part).

In this article, I will give you a VERY brief introduction to Reinforcement Learning, and then we will jump right into applying it on a real problem using Unity. You can find all the code for this project here.

Reinforcement Learning Introduction

Reinforcement Learning (or RL for short) is a type of learning where an AI learns through positive and negative reinforcement. Very self explanatory. It’s easier if you consider the following analogy. Let’s say you are trying to learn how to do a cartwheel. You start by giving it small and random tries at the beginning because you are not sure what to do. But then as you keep practicing, you start to understand what moves will give you an overall better cartwheel and what moves don’t. This is the basic concept of Reinforcement Learning.

RL loop

The main components of the RL loop are the Agent, which in the example above is you, and the Environment. At a given time instant, the agent will take a look the environment (for example in what state of the cartwheel you are currently in), then decide on an action. This action will trigger a change in the environment which will generate a new state and a reward associated with that action.

There are a lot of aspects of RL that I’m not discussing here for the sake of simplicity, but there is one important aspect that I would like you to think about. Learning how to do a cartwheel is not a very hard task for humans. Of course it depends on the person but almost everyone should be able to achieve one in no time. However, this is only true given a very big assumption, which is that you know what a cartwheel is, and possibly even understand the physics behind it. This is not the case for RL methods.

RL methods in general, have no prior knowledge of the environment, meaning they know nothing, so before they could even try to learn how to cartwheel, they would still have to learn how to stand on their feet. And even if that was not hard enough, they would then have to learn how to do cartwheel without even knowing what it is, hoping a set of actions will eventually lead to something that resembles a cartwheel. This is why RL algorithms usually take a long time to train. At first, they perform random actions and hope that, by chance, they perform a set of actions that results in the desired objective. There are processes that could speed this up but I will not discuss them here.

Cart Pole AI Implementation

To implement this AI, we will be using Unity ML. This package allows you to create an agent and an environment inside Unity, and then directly train your agent with state-of-the-art RL algorithms such as PPO and SAC.

The first step is to create our agent, the cart pole. This will be consisted of a Box and a Cylinder. Let’s create an empty game object and name it “CartPole” and add a Rigidbody component to it. Under the Rigidbody component you just added, expand the “constraints” parameter and freeze everything expect the x position. This will make sure that the cart only moves in one dimension and does not rotate. Next, under the object you just created, create a Cube and name it “Cart” and a Cylinder and name it “Pole”. Here are the transforms I’m using

Transforms for the Cart and Pole game objects.

Add a Rigidbody component to the Pole game object. We also need a way for the pole to hinge on the cart. For that, add a Hinge Joint component to the Pole game object and drag the CartPole Rigidbody to the “Connected Body” variable of the Hinge Joint component. Next we need to define where our hing is located and how will it hinge. See the image below and copy the fields to your project.

Hinge Joints parameters

What is important here is that the hinge is located at the intersection between the Cart and Pole game objects, which in my case, is at (0, -1, 0). It is up to you if you want to change the remaining parameters for the hinge component. One last thing, is to adjust the masses of your Rigidbodies. I used a ratio of 1:100, meaning the cart mass is 100 bigger than the pole mass. This should stop the cart from moving when the pole swings down.

You can actually experiment with different mass ratios and see if your AI can adapt!

We can now test if the physics of the pole are working. Just press play and you should see something like this. (I’ve also added a face for the cart just so it is less boring to look at :). You can find the code for this in the GitHub repository).

The pole may take a while to fall because it is in equilibrium at the start but it should fall eventually.

Now we can start coding our AI.

Like I mentioned above, we will be using the Unity package, Unity ML, to build and train our agent. To install the package just go to the Package Manager and search for Unity ML, then click Install. You should also install the necessary files to train the models. You can follow the official documentation here. Once you have everything set up we can start coding.

The first thing to do is to add a script to your CartPole game object which will control the agent. I named it CartPoleAgent.cs. In order for Unity to correctly process your agent and act with it, there are 4 functions that you need to implement and those are:

  • CollectObservations(VectorSensor sensor)
  • OnActionReceived(ActionBuffers actions)
  • OnEpisodeBegin()
  • Heuristic(in ActionBuffers actionsOut)

These are the minimum requirements for you to set up the agent. Well, the Heuristic function is not strictly necessary but it is helpful for debugging.

In order to override this functions, you must first add using Unity.MLAgents; at the top of your program and make your agent script depend on the Agent class (just replace MonoBehaviour by Agent).

Let’s start with the OnActionReceived function. This function will be responsible for processing the actions output-ed by the training algorithm (which will be the PPO) for our case and apply them to the agent. Here we have two types of actions, continuous or discrete. Continuous actions means the action can take any value between -1 and 1, whereas discrete actions can take any integer value from 0 to a given value. Since in the case of our problem we only need to tell the agent to move left or right we can use discrete actions. For that we can use the actions.DiscreteActions property of the function parameter. This is an array containing all the different actions.

This leads us to the following, how many actions are there? Well I previously said the agent can take two actions, go left or right, but in reality we can achieve both actions with a single action. If we say that our action can take the values 0 or 1, we can say in the OnActionReceived function that 0 means go left and 1 means go right like so

public override void OnActionReceived(ActionBuffers actions){    int move = actions.DiscreteActions[0];    if (move == 1) 
rb.velocity += new Vector3(speed, 0, 0);
else
rb.velocity -= new Vector3(speed, 0, 0);
}

Where rb is a reference to the CartPole Rigid body component. Now the agent knows what to do for each action.

I’ve discussed that we only need a discrete action to move our agent but we also need to inform the algorithm on the possible actions it can take. To do this, we need to add another script to our agent, the “Behavior Parameters”. This script is already part of the Unity ML package, do not create a new one. If the script was not added automatically, just add it manually to your agent game object like this.

There are a few things we need to need to change here. The first is to name our behavior, which I named mine to “CartPoleAgent”. Next we see a field to set our observation vector. Let’s skip that for now. Then we have a field for our actions and there we the option to set our discrete and continuous actions. Like we mentioned above we will only use one discrete action so under Discrete Branches, type 1 and the under Branch 0 Size put 2 (this value is the number of integers the action can take with 0 inclusive so this action can take the values 0 and 1). The rest of the parameters we do not to worry about for now. While we are at it, we also need to add the Decision Requester script (if it wasn’t already added) to the agent game object.

The agent can now move, however if we press play, we can see that the agent will only move left and we cannot control it. This is because, since we are not training or running a trained model, the actions will always be the same (which in this case will make the cart go left). We need a way yo directly change the actions so we can test our agent and this is where the Heuristic function comes in handy.

This function allows us to directly change the values of the action buffer so that we can control the agent.

public override void Heuristic(in ActionBuffers actionsOut){    ActionSegment<int> discreteActions = actionsOut.DiscreteActions;
// 1 -> move right
// 0 -> move left
if (Input.GetAxisRaw("Horizontal") > 0)
discreteActions[0] = 1;
else if (Input.GetAxisRaw("Horizontal") < 0)
discreteActions[0] = 0;
}

We are now directly changing the values of the discreteActions variable. Now if we press play, we should be able to move the player using the Right Arrow on our keyboard.

The controls are not very intuitive because the agent will always move left, unless you press the Right Arrow.

Ok so now that we have a working agent, we need to tell the algorithm what it will have access to, to properly learn. For this, we will override the function CollectObservations.

This function is responsible for collecting the necessary inputs from the Unity environment and pass them to the algorithm so it can process them. There are 4 input we are going to provide to the algorithm,

  • The agent local position (and we only need the x axis part)
  • The agent velocity (again we only need the x component)
  • The pole angular velocity (in this case we need the z component)
  • The pole angle (again we only need the z component)

So our function will look something like this

public override void CollectObservations(VectorSensor sensor){    sensor.AddObservation(transform.localPosition.x);
sensor.AddObservation(rb.velocity.x);
sensor.AddObservation(poleRb.angularVelocity.z);
sensor.AddObservation(RoundAngle(pole.eulerAngles.z));
}

To finish up, we need to tell the algorithm how many values we are passing in. To do this, we are going to go back to the Inspector and in the Behavior Parameters script, we change the Space Size of the Vector Observation to 4.

It is important to note that some Unity variables, even though they are one variable, contain multiple values. In the case of the Vector3 for example, if we passed this to the observations, our Space Size would have to increase by 3 instead of 1, because a Vector3 has 3 values.

We are almost ready to start training, there is only a couple of steps left. The first one is to decide how will the agent be rewarded. Remember, in RL, the agent decides that a certain action or set of actions was good or not based on the value/reward it generated (or it is predicted to generate). So with the reward function, we want to motivate good actions, and penalize bad actions. In this case, we will reward the agent with a value of 0.1, for every frame that the pole angle is 20 degrees. So we can update our OnActionReceived function to the following

public override void OnActionReceived(ActionBuffers actions){    int move = actions.DiscreteActions[0];    if (move == 1) 
rb.velocity += new Vector3(speed, 0, 0);
else
rb.velocity -= new Vector3(speed, 0, 0);
if (Mathf.Abs(RoundAngle(pole.eulerAngles.z)) < 20)
AddReward(0.1f);
else
EndEpisode();
}

And here is the code for the RoundAngle function

float RoundAngle(float angle){    angle %= 360;
return angle > 180 ? angle - 360 : angle < -180 ? angle + 360 : angle;
}

This function is actually very important because, if we did not use it, the default pole angle would have a discontinuity (it would jump from 0 to 360 if the pole fell to the right). With this function we get rid of the discontinuity, which would hinder the learning process.

The absolute values for each reward signal are a bit arbitrary, what is important is that if you have multiple reward signals, the relation between their absolute values are taken into account. More on that later.

The last thing we need to code is a way to stop the episode. The first way of doing this is to limit the space the agent has to move. This can be done by adding colliders to each side (I added two cubes at +10 and -10 on the x axis from the player and set their colliders to trigger). Then, in the agent script, we just need to check if the agent collided with these objects and reset the episode

private void OnTriggerEnter(Collider other){    if (other.tag == "Wall"){
AddReward(-75);
EndEpisode();
}
}

Notice here that I have a reward of -75 to teach the agent not to collide with the walls.

Notice here that now the absolute values of each reward need to be considered because we have different reward signals. I chose a negative reward of -75 because I really don’t want the AI to collide with the walls and since the maximum reward in an episode is 50 (500 steps * 0.1), as described below, the negative reward will always outweigh the positive one, forcing the AI to work on that first.

Also you see there the function EndEpisode(), which we have not talked about. This function just resets the episode and it is part of the Agent class. We can also set the MaxStep variable on the inspector to 500 to stop the player from playing infinitely. Speaking of episodes, once our episode is reset, we also need to reset the environment, which we can do by overriding the function OnEpisodeBegin.

public override void OnEpisodeBegin(){    rb.transform.localPosition = Vector3.zero + new                            Vector3(Random.Range(-2f, 2f), 0, 0);
rb.velocity = Vector3.zero;
pole.localRotation = Quaternion.Euler(0, 0, Random.Range(-5f, 5f));
pole.localPosition = new Vector3(0, 2, 0);
poleRb.velocity = poleRb.angularVelocity = Vector3.zero;
}

We are done coding our AI!

In order to start training, we just need to set our algorithm configuration file since we do not want to use the default one. To do this, just go to our project folder a create a new folder and name it “config”. Then inside that folder create a new file and name it “cartPoleAgent.yaml”. Then you can just copy paste this code

behaviors:
CartPoleAgent:
trainer_type: ppo
hyperparameters:
batch_size: 64
buffer_size: 4096
learning_rate: 3.0e-4
beta: 5.0e-4
epsilon: 0.2
lambd: 0.99
num_epoch: 3
learning_rate_schedule: linear
beta_schedule: constant
epsilon_schedule: linear
network_settings:
normalize: false
hidden_units: 64
num_layers: 2
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
max_steps: 3000000
time_horizon: 32
summary_freq: 20000

I won’t go in detail here on what each hyper-parameter means but you can find out more here. (Note: your “config” folder should be on the same folder as your “Assets” and “Library” folders are).

IMPORTANT: If you used a different name for your Behavior in the Inspector, make sure it matches with the one at the top of the file.

Now, on the same folder you have our config folder, just open the command prompt and type

mlagents-learn config/cartPoleAgent.yaml --run-id=CartPoleAgent_V1

If everything is well set up, this should give you a message saying that you can start training your model by pressing the play button in the Unity Editor. Then your model should start training!

Tip: If you want to speed up progress, you can create a prefab of your environment and duplicate it multiple times on the scene. This is equivalent to having multiple agents training at the same time, while all sharing what they have learned.

While your model is training, you can visualize the progress using TensorBoard. On that same folder, open a new command prompt and type

tensorboard --logdir results

This will then tell you to open a new tab with the given url. You should see something like this. When you open that page, it will show you important information regarding the progress of your training such as the total reward. Here is how mine looked after training.

In the config file, we set the total number of iteration to be 3 million but we can see that after 500k iterations, the reward is pretty much stable and maxed out, since the max reward is 50, so we can stop the training earlier (but only after 500k iterations because that is when the checkpoints are saved by default). We can also observe that the duration of the episode approaches 100 as the training progresses. This is one more factor that tells us that the model learned properly (because we were ending the episodes earlier if the model failed). The fact that it shows 100 (instead of 500 which is the maximum) is because of a parameter we set in the Decision Requester script (the decision period) but don’t worry about that.

It’s finally time to test our trained model. To do so, go into the “results” folder, grab the .onnx file and drag it into the Unity project.

Then, in the Inspector of our agent, you can assign the model in the “Model” variable. When we press play (and since our Behavior Type variable is set to Default or Inference Only) the agent will be controlled by the trained model!

You can see that the model got pretty good at balancing the pole! I honestly cannot do better than the AI…

In the next part, I will show how we can improve this AI by teaching it how to move to a goal, while balancing the pole, and compensating for external forces so stay tuned for that!

--

--

Gonçalo Chambel

I write about programming tutorials related to anything between Space and Machine Learning