Imitation Learning in Game Development

Christos Perhanidis
7 min readJun 1, 2020

--

Example of Unity ML

Intro

What if we could somehow transfer human knowledge about the world to an agent? How can we extract all this information? How can we create a model out of it? Answer is an Imitation Learning.

The idea of Imitation Learning (IL) is in direct feeding an agent with information about environment and then mimicking it. Instead of trying to learn from rewards or from manually created reward function, an expert (typically a human) provides us with a set of demonstrations. The agent then tries to learn the optimal policy by imitating the expert’s decisions.

IL techniques aim to mimic human behavior in each given task. An agent is trained to perform a task from demonstrations by mapping between observations and actions. The paradigm of learning by imitation is gaining popularity because it makes easier teaching complex tasks to an Agent must easier. Only basic knowledge of the task is needed. Generic imitation learning methods could potentially reduce the problem of teaching a task to that of providing demonstrations, without the need for designing reward functions for specific task.

It sounds good, but what about real life? Unity has recently released ML-Agents toolkit. It is an open-source project that allows games and simulations to serve as environments for the training of intelligent agents. They can be trained using reinforcement learning, imitation learning, neuroevolution, or other machine learning methods through a simple-to-use Python API. You can learn more here: https://unity3d.com/machine-learning. ML-Agents provide a central platform where advances in AI can be evaluated on Unity’s rich environments and then made accessible to the wider research and game developer communities.

In this simple project we will use ML-Agents Version: Release 1 with Unity 2018.4.19f1. Source code is available here: https://github.com/pmchrist/ml-agents

Assumption

There are several different ideas behind this project. First, to see if Imitation Learning can actually speed up Agent training? Are we bound to the Expert’s performance of the demonstrations? How many demonstrations are needed to train Agents? Could the expert possibly pass his “playstyle” to the agent and make him follow some unnecessary rules?

Project Overview

We are not going to get into the details of how Unity works, but we are going to cover some basics that are needed to understand the project. The main window of Unity is divided into sub-windows: Project (shows all available Assets), Scene (here we create our Environment and place an Agent in the world), Hierarchy (list of Objects in Scene), Inspector (shows the parameters and scripts attached to the selected object in the scene), Game (window where we can see how Agent performs).

Window of Unity Editor

Our Scene (simulation environment) is very simple. It consists of 4 main parts: Floor with Walls (only impenetrable colliders), Target (a cube that must be pushed by an Agent into a green zone), Goal (green zone that triggers End when Target is in it) and Agent himself.

Learning Environment created in Unity

Now, we can check our Agent in more detail. It has a lot of parameters: Transform (position in the world), Rigidbody (defines Agent as an object that obeys physics), Box Collider (specifies the size of Object in the world) and different C # Scripts. Before we dive into Scripts, we need to talk about the Agent’s logic and how he is supposed to work. Our Agent is a simple cube that follows the laws of physics, navigates the world using 7 raycasts that scan the environment. Each raycast returns the distance and the tag of the object that was hit, based on this information the Agent performs. He can rotate himself left/right and move forward/backwards. Environment tags: Ground (Floor), Area (Walls), Goal (Green Zone), Block (Target Cube).

Now we can understand the role of each script. ‘Ray Perception Script’ manages the raycasts and ‘Push Agent Basic’ decides which action the Agent must take on the basis of the returned information. “Behavior Parameters” is the mind of Agents, it contains the neural network model that we’re going to train. ‘Decision Requester’ requests action for an Agent from ‘Behavior Parameters.’ ‘Model Overrider’ gives us the opportunity to train on a pre-trained model to speed up or continue the process. “Demonstration Recorder” records our demonstrations to be given in training.

Agent ”Object” with attached Scripts

Training

Training in the ML-Agents Toolkit is powered by a dedicated Python package. We will train our Agent with current State-of-the-Art Model Free Imitation Learning method called General Adversarial Imitation Learning (GAIL), basic principles of which we have already covered.

We are going to train our agent in four ways. Without any demonstrations at all. With the “Good” Gameplay demonstration, completing the game in a short time. With the « Perfect » Gameplay demonstration, completing the game in the shortest time possible. And with the « Bad » Gameplay demonstration, where we play badly and always push the block to the side. We’ll see if any of our assumptions will hold up.

Example of Demonstration with “Good” performance
Example of Demonstration with “Bad” performance

Here are the hyper parameters that have been used for training.

Hyperparameter that has been used for training

You can learn more here: https://github.com/Unity-Technologies/ml-agents/blob/release_1/docs/Training-Configuration-File.md

We chose a simplest Reward Function to make training of our Agent as unbiased as possible. We give him big reward +1.0 on completing the task and small punishment -0.0025 for each step it takes to complete it. (It was a default value)

Training screen

Final Overview

  • Set-up: A platforming environment where the agent can push a block around.
  • Goal: The agent must push the block to the goal.
  • Agents: The environment contains one agent.
  • Agent Reward Function:
  • -0.0025 for every step.
  • +1.0 if the block touches the goal.
  • Behavior Parameters:
  • Vector Observation space: (Continuous) 70 variables corresponding to 14 ray-casts each detecting one of three possible objects (wall, goal, or block).
  • Vector Action space: (Discrete) Size of 6, corresponding to turn clockwise and counterclockwise and move along four different face directions.
  • Benchmark Mean Reward: 4.5
Training process of “Bad” performing Agent
Training process of “Good” performing Agent
Training process of “Perfect” performing Agent
Training process with RL of Agent

Results

We have trained 5 models for our Agent to follow.

The First one uses GAIL, based on 30 “Good” performance demonstrations (mean Reward = 4.80). Training time 30 minutes.

The second one is the same as the first, but the training lasted 2 hours instead of 30 minutes.

The Third one uses GAIL, based on 100 demonstrations of “Perfect” performance (mean Reward = 4.92). Training time 30 minutes.

The Fourth one uses GAIL, based on 15 demonstrations of “Bad” performance (mean Reward = 4.09). Training time is the same 30 minutes.

Fifth one does not use GAIL, Agent is trying to figure out everything on its own, classical Reinforcement Learning.

Training Results in TensorFlow

The results from TensorBoard are very interesting, as we can see. Cumulative Reward or simply Agent’s performance achieves the best performance when he learns everything on his own. But when the Agent has very good and enough Demonstrations, GAIL demonstrates its true potential. Agent managed to get good rewards much faster. Agent with bad demonstrations cannot get high rewards and seem bound by mean Reward = 4.0, the same mean reward that we had in our “Bad” Demonstrations. The most interesting part is that the Agent with a smaller sample of “Good” demonstrations has achieved the worst performance.

Performance of Agent with “Bad” model
Performance of Agent with “Good” model
Performance of Agent with “Perfect” model
Performance of Agent trained with RL

There’s one more thing, though. As we can see from Agent’s gameplay, he plays with a “playstyle” of demonstration that we showed him. In the fourth case, he moves the box to the wall (as shown in “Bad” demonstrations) after he moves around, and if he doesn’t push the box to an impossible corner, he ends the game. This means that without changing the reward function, we managed to pass a “playstyle” with GAIL to the Agent.

In summary, we proved that every one of our Imitational Learning assumptions was true. We must be sure to give our Agent enough demonstrations to train on and check that they show optimal performance, so that the training is successful. Agent may imitate “playstyle”, which can be useful in training different Agent behaviors without changing the Reward Function. And that Agent can only work on a human-like level of performance, but in a videogame, this is, probably, a good result for the AI. Games have to be fair to a player, after all.

--

--