Creating a Custom Reward Function in Retro Gym and Other Utilities

AurelianTactics
aureliantactics
Published in
7 min readJul 2, 2018

See my prior blog post for an intro to this and this repo for the files I’ll be discussing.

The University of California Deep Reinforcement Learning (RL) course has a lecture by John Schulman where he talks about reward shaping. One part that stood out to me was his discussion of reward shaping and smoothing. To paraphrase he says that hacking reward functions to give hints to the agent isn’t really talked about much in RL because it’s considered cheating a bit. However it’s very necessary and effective, particularly in robotic problems, and he gives an example of a heavily modified reward function that was used in a paper.

Why modify the reward function? Imagine a RL problem where the agent starts at 0 on the number line. The agent can either go right (+1) or left (-1). Agent gets a reward at reaching a specific positive value, like say 3. Seems simple enough to solve. Agent goes right, right, right, gets a reward. The reward signal eventually propagates and the agent learns to go right and solve the problem. However what if the only reward is when the agent reaches 100? 10,000? 1,000,000? If your agent starts out randomly selecting left or right, the agent will rarely if ever visit 1,000,000 (unless it acts for a huge amount of timesteps). Without some sort of reward modification how will the agent ever solve this simple toy problem in a reasonable amount of time?

Something similar happens in the Sonic the Hedgehog Retro Contest. The fourth level isn’t solvable by Rainbow Deep Q Network (DQN) or Proximal Policy Optimization (PPO). The tech report scores are 3,000 for DQN and 2,700 for Joint PPO where 9,000 points signals completing the level. The issue is the reward function for the contest scores you by how much positive horizontal progress Sonic makes (ie go right). Any level where you have to move vertically or move left the reward function doesn’t give any feedback. Here’s the level map of the fourth level (Marble Zone Act 1):

Note how about half way through the level Sonic has to drop down, move left, drop down further, and go right to finally get a tiny reward increase. For half the level the reward function doesn’t give Sonic any feedback. That is why the Sonic agents fail to get through this level. Sonic goes from interacting with a steady stream of rewards to a sparse reward environment which is difficult to solve.

The simplest way to help Sonic is to modify the reward function. I came up with a few methods that rely on human demonstration trajectories. Human demonstrations can be easily made with the retro-movies repo. I created a retro_movie_trajectory.py file that takes these demonstrations and turns them into useful information for a modified reward function. Basically the script steps through each frame of the human demonstration and outputs three files:

  • Trajectory dictionary: The exact x,y coordinates that Sonic took in the level. Stored with x,y as the key and the value is what step in the trajectory Sonic is at. This can be used in the reward function to offer more positive reward the further Sonic gets in the trajectory (ie closer to finishing the level). You can press a key while making a human demonstration that keeps coordinates from being included in this dictionary, which can be useful for areas where you do some backtracking or it is essential that Sonic reaches a certain point first before preceeding onward (ie want Sonic to step on a switch).
  • Follow dictionary: Basically trajectory dictionary with the key and value swapped. Sonic has to follow an exact trajectory step by step to get points. Hard for Sonic to do. In your reward function you can make it so Sonic has to find the exact x,y values or gets somewhat close enough.
  • Waypoint dictionary: User presses and holds a key during a small number of crucial points in the level. The reward function rewards Sonic for approaching the first waypoint. When Sonic reaches that waypoint, the reward function then rewards Sonic for approaching the next waypoint.

The retro_movie_trajectory.py script can also be used for analyzing runs. If you call it with debug mode on, the script will render the level and print out the reward Sonic is getting frame by frame so you can diagnose what is going wrong with the run. To use from the terminal:

> python3 retro_movie_trajectory.py 0 #print the dictionaries
> python3 retro_movie_trajectory.py 1 #debug mode

Custom Reward Functions

To use a custom reward function, you will need to create two files similar to custom_scenario.json and custom_reward_script.lua (see the repo). The .json file is called when the environment is made (set the scenario parameter as shown in my previous blog post) and points to custom_reward_script.lua . The reward script is where the reward modifications happen that help give your agent the reward signals that help the agent complete the task.

One annoying thing that you will notice from the dictionaries I created and the .lua file is that I manually copy and pasted the dictionaries in. I created a script to automate this and it works in Lua 5.1. However it doesn’t work in the modified Lua version that is included in retro-gym. If anyone knows how to read from a file in the retro-gym Lua version please let me know.

Let’s walk through some of the various reward function modifications I made. In Sonic the Hedgehog, if you take damage you either lose all your rings or — if you have no rings — you die. Sonic sort of learns this behavior in some cases but if you want to make it more explicit you can use the reward_by_ring() function:

prev_ring_num = 0
function reward_by_ring(data)
local ring_reward = 0
local current_ring_num = data.rings
if current_ring_num < prev_ring_num then
ring_reward = -10.0
elseif current_ring_num > prev_ring_num then
if prev_ring_num == 0 then
ring_reward = 2.0
else
ring_reward = 0.1
end
end
prev_ring_num = current_ring_num
return ring_reward
end

And if you want to punish Sonic for having 0 rings, which is a dangerous state to be in, you can do something like this in the reward function:

if first_ring_gotten == 1 and data.rings == 0 then
reward = reward - 0.05

While rewarding Sonic for proper ring behavior sounds beneficial it can be a bit tricky to tune correctly in practice. Scale the rewards the wrong way and Sonic can become too afraid to act and lose rings and will stand around doing nothing. I only used ring rewards to solve one level and was mainly designed to solve this specific part:

A proper Sonic agent would push the block, and ride on the block until the lava flow carried him up and then jump to the next level. My Sonic was doing a suicidal jump that allowed him to get a good distance reward score at expense of all his rings and then his life. I modified the reward function to only let Sonic get a distance reward if he had more than 0 rings and penalized him some for losing rings and being with 0 rings. The first couple times I tried this I made the penalties too large and Sonic failed to reach this part of the level.

The retro contest reward has two variants: one rewards Sonic based on his horizontal progress and penalizes him for negative progress. Another — written outside the .lua file python — gives Sonic a reward for reaching the max horizontal progress and doesn’t penalize for going away from that max. My reward functions follow the same variants: one rewards Sonic based on progress towards/away from the goal (reward_by_trajectory() and reward_by_waypoint()) while the other only rewards and doesn’t punish when Sonic reaches a personal best (reward_by_max_trajectory() and reward_by_max_waypoint()).

I found the reward-by-max variants worked better in practice. When penalizing Sonic negatively, he would become too timid and get stuck in parts where some negative actions would be needed. Let’s walk through reward_by_max_trajectory():

The variable prev_step_max keeps track of the maximum point Sonic reached. Every step this function uses calc_trajectory_progress() to see if a new max (temp_progress) has been reached and uses the difference between prev_step_max and the new max to give Sonic a reward. This function uses reward_by_ring() but if you don’t want to use it comment out that line and write local reward = 0. level_done is used by the Retro Contest to see if Sonic has reached the end of the level and if so to reward him a bonus based on completion time. calc_trajectory_progress() looks up Sonic’s position in the trajectory dictionary. This can be modified a bit to allow Sonic to not have to find the exact position but a nearby one (see the file for details).

The reward_by_max_trajectory() script:

Unfortunately this function is complicated by the fact that the episode doesn’t immediately end when Sonic dies. Instead Sonic will jump up then move in a highly negative vertical way which has to be accounted for or Sonic’s deaths will give a huge reward/penalty depending on where the waypoint is. I also used Manhattan distance (absolute values of differences) towards the waypoint instead of Euler distance (squared values of differences) towards the waypoint.

Line 13 has a bit of code that is only called once. This sets up some variables that can’t be initialized at runtime. Of note is waypoint_reward_scale. This scales the distance_reward by the total amount of distance Sonic has to travel from waypoint to waypoint to keep reward given consistent between levels. curr_distance (line 33) is where we see how much distance Sonic has traveled and whether that exceeds the prev_max_distance_reward. If Sonic has, he gets a distance_reward.

I’ll have more on how I used these scripts and reward functions to beat Sonic levels in an upcoming post.

--

--