Attempting to Beat Sonic the Hedgehog with Reinforcement Learning

Published in

aureliantactics

5 min readJul 9, 2018

OpenAI held a Retro Contest where competitors trained Reinforcement Learning (RL) agents on Sonic the Hedgehog. The goal of the competition was to train an agent on levels of Sonic from the first three games and see how well the agent performed on new levels created specifically for the competition. While I didn’t accomplish anything of note in the competition, I wondered whether it was possible to beat the first Sonic the Hedgehog for Genesis using current RL tools.

The tech report that went along with the competition reported how well the baseline algorithms that OpenAI provided for the competition did on the Sonic levels. Only 6 of the 17 levels were successfully cleared with the baseline algorithms. I attempted to beat the remaining 11 levels. I was successful on 8 of the 11 and can see possible paths to clearing 2 more levels using my methods. Beating the first Sonic is definitely doable though I came a bit short in clearing the game.

The baseline algorithms feature Rainbow DQN and PPO. While I was able to complete the second level of Sonic at a higher success rate and get incrementally further in the fourth level using DQN with human demonstrations, I found more success with PPO for this task. PPO is a policy gradients method that uses a surrogate loss function to keep improvements in the policy in a stable region.

My main approach to using PPO was to change the reward function to use human trajectories. As I detailed in a prior post, the default reward function rewards Sonic for making positive horizontal progress. I instead used human demonstrations to come up with trajectories and modified the reward function to reward Sonic for progressing further along these trajectories. That combined with only rewarding Sonic for positive progress — and not punishing him for moving the wrong way in the level — allowed Sonic to clear 8 of the 11 levels.

I mostly kept the default PPO parameters. One exception was increasing the max timesteps from 10,000,000 to 20,000,000. While a rigorous search of hyperparameters may be helpful for this task, the levels I attempted different hyperparameters on had worse performance. I also limited Sonic to the following actions: left, right, jump, down (some levels only), and no action (some levels only).

Here is a reward score comparison. This is not a fair comparison since the tech report is not optimizing per level but seeks to improve upon unseen levels and the algorithms use different reward function. I am including the best run for the trajectory PPO I used.

+------------------------+-----------+-----------+---------+
|         Level          | Joint PPO*| Traj. PPO*| Cleared |
+------------------------+-----------+-----------+---------+
| Marble Zone Act 1      |      5007 | 8500      | Yes     |
| Marble Zone Act 2      |      1620 | 7089      | Yes     |
| Marble Zone Act 3      |      2054 | 2509      | No      |
| Spring Yard Zone Act 1 |      4663 | 8486      | Yes     |
| Spring Yard Zone Act 2 |      9300 | NA        | Yes     |
| Spring Yard Zone Act 3 |      2608 | 7550      | Yes     |
| Labyrinth Zone Act 1   |      5041 | 8174      | Yes     |
| Labyrinth Zone Act 2   |      1338 | 3784      | No      |
| Labyrinth Zone Act 3   |      1919 | 3662      | No      |
| Star Light Act 1       |      6363 | 8307      | Yes     |
| Star Light Act 2       |      8336 | NA        | Yes     |
| Star Light Act 3       |      2637 | 7159      | Yes     |
| Scrap Brain Act 1      |      2112 | 7769      | Yes     |
| Scrap Brain Act 2      |      1404 | 8262      | Yes     |
+------------------------+-----------+-----------+---------+
*Average of final 10 runs for Joint, 100 runs for Traj.

The trajectory PPO I used was effective in almost all cases. In some levels, PPO achieved better than human performance scores mentioned in the tech report. While agents trained with trajectory PPO generally followed my demonstration trajectories, the agents do not exactly mirror my runs. The agents often come up with their own solutions to pass obstacles. In most cases one human demonstration was sufficient. In some levels I had to try a different demonstration that took easier parts in the level or more explicitly emphasized certain trajectories.

The three levels that I was unable to clear are Marble Zone Act 3, Labyrinth Zone Acts 2 and 3. Marble Zone Act 3 has a difficult part where Sonic has to push a block into lava, patiently ride on top of it, and jump off it when the lava erupts and pushes Sonic up. The Sonic agent is frantic and trains poorly on tasks where Sonic has to be patient. The reward shaping of ring rewards I detailed in this post helped Sonic eventually clear this part — after many attempts. Punishing Sonic for losing rings and for not having any rings and not allowing him to get reward while having no rings worked well at that specific part of the level but hurt Sonic at earlier and later parts of the level:

I am confident that Marble Zone Act 3 — with proper reward and hyperparameter optimization — can be completed. While the level is long, the part that Sonic fails at— swinging platforms over lava — was successfully cleared in Marble Zone Act 2. Labyrinth Zone Act 2 can possibly be cleared. The part where Sonic failed was due to running out of timesteps and not an insurmountable challenge. However to get that far I had to use waypoints as my reward function and they were much trickier to get working than trajectories. Labyrinth Zone Act 3 fails due to Sonic drowning at a difficult part. Perhaps with more access to the game data and the creation of a Sonic breath indicator, better hyperparameters, or better reward crafting Sonic can clear the level but I did not make much progress in the difficult part.

For those who are interested here is a jupyter notebook with results and some functions useful for turning retro gym logs into graphs. I’ll finish this post with some successful runs and quirks from training.

Breakthrough run. Sonic finally learns to get past the ‘push block onto switch’ part by doing a glitchy jump.

The run highlights the quirks of training by human trajectory. The RL agent learns something clever and something dumb at the same time. The way the seesaw jump works is that you jump onto it, a spiky ball on one end of the seesaw is launched, and when it lands on the other end of the seesaw Sonic will be thrown upwards. In the human trajectory, I jump onto the seesaw and launch the ball. I then move to the best spot to be launched from and wait. I get launched onto the platform, jump over the exploding bad guys and continue on. The trained Sonic agent does this differently. When he gets launched onto the platform he jumps back down again. This causes the exploding bad guys to not hurt him. When Sonic returns to the platform it is safe and it is easier for him to continue onwards.

However, note how Sonic learns to get launched. He jumps to the far right part of the seesaw (where I get launched from) but does so while the spiky ball is there. This causes him to take damage and lose all his rings. Later on in the level there is another seesaw jump sequence. Sonic dies constantly on this because he continually runs into the spiky ball — thinking this is the key to clear seesaw jumps — before he eventually figures out the trick.

Sonic clears the last level in the game.

Attempting to Beat Sonic the Hedgehog with Reinforcement Learning

Written by AurelianTactics