The Startup
Published in

The Startup

At the Speed of Reinforcement Learning: an OpenAI Contest Story

Team

Sergey Kolesnikov (scitator)
RL enthusiast. Senior Data Scientist @ Dbrain. Had the master thesis at MIPT defence during the contest.

Reinforcement Learning and Sonic

Baselines

As a baseline, we had full guides for Rainbow (DQN approach) and PPO (Policy Gradient approach) agents training on one of the possible Sonic levels and the resulting agent’s submitting. Rainbow version was based on an anyrl project while PPO used the well-known OpenAI baselines. Published baselines differed from the ones described in the article — they were simpler and had a fewer number of learning acceleration hacks.

Approaches and result

After a quick review of suggested baselines, we chose the PPO approach by OpenAI as a more formed and interesting way for our future solution. And by OpenAI tech-report the PPO agent handled the task better. So, our main features:

1. Joint PPO training

2. Finetuning during the testing process

We had to find the approach having the max generalization accordingly to the main idea of the contest. This required the agent to finetune its policy for the test level and generally that’s what we did — at the end of each episode/game, the agent estimated the reward and further improved its policy for maximizing the reward expectation.

3. Exploration bonuses

Let’s now dive into the level award conditions. The agent got the reward for the x coordinate progress, therefore it may be stuck when it’s necessary to move forward and then backward. So, we established an additional award named count based exploration, given a new state the agent could get into.

4. The best initial policy search

This improvement greatly contributed to the outcome. The idea was quite straightforward: we trained several policies with different hyperparameters. At the test time each of the policies was tested on the first few episodes. Then we just chose the best one for further finetuning.

Bloopers

What didn’t play well:

  1. NN architecture changes: SELU activation, self-attention, SE blocks
  2. Neuroevolution
  3. Personal Sonic’s levels creation — we have prepared the whole pipeline but hadn’t enough time to realize it
  4. Meta learning algorithms like MAML and REPTILE
  5. Model ensembling with importance sampling

Results

OpenAI published results in three weeks after the competition ending. Our team took the honorable 4th place on the 11 additional levels, jumped from the 8th place on the public test and outperformed tuned OpenAI baselines.

  1. Augmented action space with more common button combinations
  2. Exploration bonus based on the perceptual hash of the screen
  3. More training levels from the Game Boy Advance and Master System Sonic games

This story is published in The Startup, Medium’s largest entrepreneurship publication followed by + 374,357 people.

Subscribe to receive our top stories here.

--

--

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +768K followers.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store