Killers and Explorers: Training RL Agents in Unreal Engine

Jonathan Jørgensen
6 min readApr 18, 2023

In my previous story, I explored a new experimental plugin for training reinforcement learning agents in unreal engine. Today I want to show two examples of agents who actually learn to solve a task. As the plugin is experimental and still under development, I’ll also highlight some updates that have been pushed during the past few weeks.

The initial state of the cube environment. Despite their stern look, these are the explorers, not the killers.

To demonstrate learning, I design two environments for agents, inspired by two supposed archetypes of players, the killer and the explorer. Despite being a crude simplification of the concept of exploration, the first environment rewards the agent for chasing down a green cube. The second environment pits the agents against each other, rewarding those who can shoot without being shot themselves.

Update: The project is now available as a public github repository

Plugin Updates

Since last time, Epic Games has continued to develop their plugin. The first section of this post will therefore cover what is different this time around. According to the developers, the current architecture is not likely to change much in the foreseeable future.

Learning Manager

In the previous post, I included a “Learning Manager” that tied everything together. This is now a canonical approach in the plugin, and such managers can derive from the LearningAgentsManager class. This class contains both functions for adding and managing agents, as well as callback events for when agents are added. An example of how these callbacks can be used is to provide setup procedures that are specific to the environment, while still using generic pawn agents.

Interactor

What we referred to as the AgentType previously, has taken on the more descriptive name of “Interactor”. This means that it is through this class that the learning algorithm can interact with the world, and vice versa. This class is more or less the same, but do note that the plugin now supports a few more types of observations and actions.

Policy & Critic Classes

As they have their own interfaces for inference, as well as structs with configuration settings, both the agent policy and critic are now exposed to blueprint as their own classes. Currently they are used as data only blueprints, but require their own setup in the manager.

Training Loop

One of the most satisfying additions is a helper function that runs everything with a single call. This is in contrast to the chain of nodes we had to call in the right order from the previous post. In my project this is still part of the tick event of the manager. Note that this will only work if everything is setup appropriately first. Due to the visual size of the blueprint I will not describe the setup through a screenshot, but rather as a sequence of calls:

  1. Setup Manager
  2. Setup Interactor
  3. Setup Critic
  4. Setup Policy
  5. Add all agents with a loop
  6. Setup Trainer

At this point the IsSetup function should return true and training can begin. Note that the Run Training node will also initialize the training, if that has not been done explicitly already (not to be confused with Setup Trainer). The parameters passed in (aside from Target), are only used for this purpose.

New training loop, which both starts and runs the entire training

I will also attach the tooltip for the node, as it explains everything perfectly:

Tooltip for the “Run Training” node

Now we move on to the main course of this post: training reinforcement learning to solve simple tasks in Unreal Engine. This is still at the level of toy problems, but serve both as a starting point for developers who are curious to try this out themselves, as well as a proof that the learning actually amounts to something.

Environment #1: Cube

The cube environment represents a very simple navigation task: to be in contact with a green cube. The catch is that by touching the cube, the agent also pushes it away. Every 1000th step, the agents are reset to a random location and rotation.

Observations, Actions and Rewards

The observation space of this environment consists of six line traces. The first three traces are straight forward, 20 degrees to the right, and 20 degrees to the left. These traces return the distance to the first wall, and the remaining three traces are the same, but only return the distance to the target cubes. The values are normalized by dividing the distance by the max trace range.

There are only two actions: a discrete walking action, which walks forward at a constant pace if the action is a non-zero positive float, and standing still if negative. The second action is a continuous rotation delta, which rotates the character (yaw) directly by the output value of the action. The reward is 1 every tick the character is in contact with the cube, and -0.01 otherwise. If they touch the outer walls, they are penalized by 0.1.

The “Cube” environment, where players are rewarded for being in contact with the green cubes. The cubes have physics enables, which means they will be pushed around by the agent, making the task non-trivial.

Results

By running training overnight and logging the output with W&B, we can see that the agents solved the problem to some extent. However, the eagle-eyed reader can see a significant drop after around 10–20k steps, and also around 120k. This turned out to be a silly bug, where the green cubes were pushed out of the environment, permanently. With a total of 3 cubes, this was not a sustainable development. In addition, the agents collide with one another, which means that they can “occupy” a cube and prevent the others from reaching it. With four agents and (at most) three cubes, this explains why the average reward decreases as the cubes disappear, one by one.

Average episodic reward from training in the “Cube” environment

Environment #2: Shoot

The second environment brings us closer to a typical game, by introducing a shooting mechanic for the agents. By choosing to shoot, the agent traces a line ahead, and if another agent is the first object hit, they are shot. Successfully shooting another agent gives a reward of 1, and being shot gives a penalty of 0.5. To give the problem more depth the shooting ability has a cooldown of 20 steps, preventing the agents from constantly shooting. The observations are similar to the first environment, with the difference that the second set of traces respond to other agents, instead of reward cubes. The observation also contains the current cooldown value, where zero means ready to shoot.

The “Shoot” environment, where the players can fire a shot straight forward. If they hit another player, they are rewarded, while the victim is penalized. Shots have a cooldown.

Results

The reward plot of the shooter environment looks like textbook reinforcement learning results. The agents converge around a ceiling and stay there. Before ending the training, I observed their behavior, and the strategy that they appear to utilize is to start rotating fast as they spawn, until they face another agent, and then they start blasting. As an experiment, I added strafing (sideways movement), as this might be a tool for dodging shots. However, the results did not change in any significant change after this introduction. This remains as future work.

Average episodic reward from training in the “Shoot” environment

Final Remarks

While still experimental, I think the LearningAgents plugin is functional to the level where you can perform normal experiments with a low risk of technical disasters. However, there is a big segment of the plugin that has been barely mentioned and not at all touched by me yet: imitation learning. I might write up a new post to cover it eventually, but for the time being, my plan is to try and push the limits of the current setups, with more complex agents and environments.

As always, don’t hesitate to reach out, and if there is a demand for it, I can clean up the project and turn it into a public git repository.

Until next time! 👋

--

--