TF Jam — with ML-Agents 🤖

6 min readOct 30, 2018

This article is about learning how to train a machine learning model in Unity3D using the ML-Agents Toolkit.

This is like an unofficial follow-up post to the original TF-Jam — Shooting Hoops with Machine Learning post by Abe Haskins (Twitter, Github). So first of all a big thank you to Abe Haskins for making this all possible 👏 and go read the original article for a lot of the background info!

After 7 hours of training on my MacBook Pro

Motivation

The goal with this project was to extend the original TF-Jam project by implementing the ML-Agents Toolkit. This means we can now train the AI inside of Unity using Reinforcement Learning.

Everything you need to get started is in the Github repository. Check out the readme to get started.

Getting Started

Really, check out the Github repository. Everything you need to know to get started, is listed here.

From Supervised to Reinforcement Learning

From the original article:

What is our goal?
To keep things simple, our desired outcome for this project will be incredibly simple. We want to solve: if the shooter is X distance away from the hoop, shoot the ball with Y force. That’s it! We will not try to aim the ball or anything fancy. We are only trying to figure out how hard to throw the ball to make the shot.

The way learning was done in the original project is via supervised learning. Meaning, we first:

Let the agent shoot the ball randomly from various positions
Collect data (distance, force) for all successful shots
Then train a model to best fit that labeled data

But now we are using reinforcement learning. Meaning we let the agent interact with the environment directly. First, it observes the environment (distance), takes action (force), collects rewards and then updates its policy accordingly. The policy determines how the agent acts in the environment and which action it takes given an observation.

Learn more of how this is implemented in Unity here.

Reinforcement Learning in Unity

As this is a completely different way of learning, we have to change the setup. An example learning environment in Unity looks like this:

The Learning Environment contains three components that help organize the Unity scene:

Agents 🤖 handle generating observations, performing the actions it receives and assigning a reward (positive / negative) when appropriate. Each Agent is linked to exactly one Brain.
Brains 🧠 which encapsulates the logic for making decisions for the Agent. In essence, the Brain is what holds on to the policy for each Agent and determines which actions the Agent should take at each instance.
Academy 🏫 which orchestrates the observation and decision making process. Within the Academy, several environment-wide parameters such as the rendering quality and the speed at which the environment is run can be specified. The External Communicator lives within the Academy.

Designing the environment

Equipped with knowledge we can now start the design our environment. We have to answer a few simple questions:

What does our agent observe? 👀
What actions can it take? ✋
How can the agent be rewarded? 💰

What does our agent see? — One of the most important parts when creating a learning environment is to start as simple as possible. Therefore the agent will only observe its distance to the court (hoop) and nothing else. Additionally, it is always a good idea to encode positional information relative to the agent. So instead of observing the position of the court and the position of the agent, we boil it down and only observe its distance to the court (hoop). This value has to be normalized to be in the range of 0 to 1 (or -1 to +1). In our case the max value seems to be around 25.4 and min value is around 0 and we want it to be in the range of -1 to +1. So we can just easily calculate it like this:

What actions can it take? — Again, we want to start as simple as possible. The agent only decides how much force to use when throwing the ball. For continuous actions the range is clipped between -1 and 1. We have to make sure our calculations fit within this range.

How can the agent be rewarded? — We want the agent to hit the court. So we reward him when he hits it. You really don’t want to over-design your rewards, because it could easily lead to reward-exploitation by the agent.

Challenges

We have a good foundation to built a learning environment, but there a few challenges we have to solve. If we let the agent continuously throw balls at the hoop and reward it accordingly it can be difficult for it to associate reward and action.

It could get rewarded for a ball thrown earlier and think the ball last thrown was a great action. So what can we do to prevent this?

We wait till the ball hit the hoop or if not and it flies beyond, we destroy the ball and let the agent throw again. In ML language we call this On Demand Decision Making:

On demand decision making allows Agents to request decisions from their Brains only when needed instead of receiving decisions at a fixed frequency. (source)

It sounds fancy and complicated but in reality in just means we have to tick the On-Demand-Decision-Checkbox (pretty much…).

Another challenge arises

This solution works but it slows down the training process. Even with speeding up the game time while training, it still takes a lot of time for not a lot of data. Thankfully, this is not the real world and one brain is not limited to one agent.

So now by using multiple agents we compensated for the On Demand Decision Making. Remember, they are all using the same brain — meaning all learning progress is shared. It would be completely possible to train them all in different environments, but this is fine for now.

Result

It seems to be working pretty good! 👏

Ending Notes

If you have come this far, thanks a lot! This is my first medium article. I am by no means an expert in machine learning. I am just sharing what I have learned so far.

Really curious in what you guys have to say. Feel free to leave your suggestions and feedback. I am sure there is a lot I can optimize.

✌️ @seppischuchmann