Hello! I’m Sergey Kolesnikov, Senior Data Scientist at Dbrain, a lecturer at MIPT and HSE and open-source developer. Mostly, I am interested in RL (NIPS Learning to Run 3rd place), sequential decision making and planning.
Open AI hosted a reinforcement learning competition — Retro Contest this spring. Main goal was to come up with a meta learning algorithm that can transfer knowledge from a set of training levels of “Sonic The Hedgehog” to a set of previously unseen test levels made specifically by OpenAI. Our team took the 4th place out of 900+ teams. Reinforcement learning differs from the typical machine learning and this contest stood out from other RL competitions. You can read the details below.
Reinforcement Learning and Sonic
Reinforcement learning is a corpus of theory and algorithms uniting the fields of temporal difference (TD) methods in machine learning and optimal control theory. In practice, reinforcement learning algorithms are employed to identify solutions to optimal control problems that are too difficult to define mathematically. They achieve this goal through an agent that learns from experience by interacting with its environment. The environment provides the agent with rewards based on its behaviour; the better the agent’s behaviour is, the higher the rewards it gets. Hence, good controllers are obtained by having the agent learn to maximize the rewards received by performing optimal actions.
RGB images served as the environment during the Sonic competition, and as an available action the agent had to choose the button to press on the virtual controller. Similarly to the original game, the reward increased through the ring gathering and the level pass speed. Basically, we had the original Sonic game with our agent as the main character.
As a baseline, we had full guides for Rainbow (DQN approach) and PPO (Policy Gradient approach) agents training on one of the possible Sonic levels and the resulting agent’s submitting. Rainbow version was based on an anyrl project while PPO used the well-known OpenAI baselines. Published baselines differed from the ones described in the article — they were simpler and had a fewer number of learning acceleration hacks.
The results evaluation features
Algorithms are tested in the environment identical to the one they learned into in the typical RL competition. That benefits the algorithms which are good at memorizing and have many hyperparameters. In this competition, the agent was tested on new Sonic’s levels, which were designed by Open AI team for the contest. Additionally, the agent could get the award while testing which made the finetuning possible. Nevertheless, there was test time limit to keep in mind: we had 24 hours and 1 million ticks maximum.
Teams provided the results in a docker image with assembled API in this contest. This policy solution retrieval is more fair, as far the resources and timing were limited by the docker image. I really appreciate this approach, because it lets researchers who lack “home cluster in DGX and AWS” be in the same conditions as the last ones stacking over9000+ models lovers.
By the way, an identical contest policy is used at Dbrain. The main aim we try to achieve is insightful model development. The solutions obtained from the participants of our competitions are kept in docker image with ensembled API. Thanks to it we can get both the predictions and the procedure for resolving. I really hope to see more such contests in the future.
Approaches and result
After a quick review of suggested baselines, we chose the PPO approach by OpenAI as a more formed and interesting way for our future solution. And by OpenAI tech-report the PPO agent handled the task better. So, our main features:
1. Joint PPO training
The baseline we got could learn on one out of 27 Sonic’s levels only. But we modified the learning process by paralleling it for all 27 levels. So the agent had higher generalization and better Sonic world orientation.
2. Finetuning during the testing process
We had to find the approach having the max generalization accordingly to the main idea of the contest. This required the agent to finetune its policy for the test level and generally that’s what we did — at the end of each episode/game, the agent estimated the reward and further improved its policy for maximizing the reward expectation.
3. Exploration bonuses
Let’s now dive into the level award conditions. The agent got the reward for the x coordinate progress, therefore it may be stuck when it’s necessary to move forward and then backward. So, we established an additional award named count based exploration, given a new state the agent could get into.
Exploration bonus has been realized in two types: image, based on pixel similarity, and x coordinate, based on state frequency in a particular location. Both of them were formed inversely to the unicity of the condition the agent visited.
4. The best initial policy search
This improvement greatly contributed to the outcome. The idea was quite straightforward: we trained several policies with different hyperparameters. At the test time each of the policies was tested on the first few episodes. Then we just chose the best one for further finetuning.
What didn’t play well:
- NN architecture changes: SELU activation, self-attention, SE blocks
- Personal Sonic’s levels creation — we have prepared the whole pipeline but hadn’t enough time to realize it
- Meta learning algorithms like MAML and REPTILE
- Model ensembling with importance sampling
OpenAI published results in three weeks after the competition ending. Our team took the honorable 4th place on the 11 additional levels, jumped from the 8th place on the public test and outperformed tuned OpenAI baselines.
Cool features from top three solutions:
- Augmented action space with more common button combinations
- Exploration bonus based on the perceptual hash of the screen
- More training levels from the Game Boy Advance and Master System Sonic games
By the way, I appreciate that OpenAI encouraged the best write-ups track as well.
Feel free to comment and ask questions!