A Deep Dive into the Jerk Agent

Days 22–25 of the OpenAI Retro Contest

Without a great idea of where to go next, I took a stab at tuning some of the hyper parameters for the PPO2 baseline agent. It didn’t go so well:

Initialization

Starting with the main function, the first thing that gets set up is the remote environment. I conceptualize this as being the same as the local environment that I could render, but instead being specified by the retro-contest command. The environment is also wrapped into a larger class of a TrackedEnv, this take the initial environment and gives in the following additional properties that will be used for learning:

self.action_history = []
self.reward_history = []
self.total_reward = 0
self.total_steps_ever = 0

Move(env, num_steps, left, jump_prob, jump_repeat)

Move is where all of the actual action occurs. It first sets up some new variables to use later:

total_rew = 0.0
done = False
steps_taken = 0
jumping_steps_left = 0
[False False False False False False False False False False False False]
_, rew, done, _ = env.step(action)
  • rew this is the incremental reward achieved from executing this command.
  • done is a boolean value just describing if the game is over, either through sonic dying or getting to the end.
  • info here is an array of all the the relevant game information that the gym environment is pulling from the emulators memory. It has a ton of useful info such as the x & y position of Sonic, the number of rings & lives, etc. The baseline jerk agent does not use any of that, so it is also wasted away in _. For the third level of the first zone of Sonic, it looks like this:
{'act': 2, 'screen_x': 857, 'zone': 0, 'level_end_bonus': 0, 'score': 0, 'lives': 3, 'screen_x_end': 10592, 'rings': 2, 'x': 1017, 'y': 812}

Backtracking

If the move function does make any positive impact into the reward, then the agent “backtracks” which is what is sounds like, moving backwards. This is achieved by just calling move() again, but with the Left parameter set to true, so that Sonic will move to the left. This is an essential aspect of the agent, since only going to the right can get Sonic stuck on plenty of walls.

Life without backtracking

The Learning part of Machine Learning

Now that we have gone through how to get Sonic through the environment, let’s take a look at how our agent learns. At the end of the episode, the maximum total cumulative reward (the largest in a running total of all the rewards achieved) in the run along with an array of all of the moves that were made (that is, a long list of 1x12 arrays that are mostly filled with false values), are stored in the solutions array. The array of all the moves is created by the TrackedEnv’s best_sequence method, which returns all the moves made up until the maximum total reward wash achieved. For reference, a run that didn’t go well for me looked like this:

[(
[1903800.0],
[array([False, False, False, False, False, False, False, True, False,False, False, False]),
array([False, False, False, False, False, False, False, True, False,False, False, False]),
...

Exploitation

With one run complete, we now have a viable sequence of moves and the reward that we achieved by doing those moves. Now it is time to exploit. Back in the main while loop, if you start a new episode with a solution, there is a check for whether or not you should exploit your best solution, or if you should try for a better solution with the random movements. This check looks like this:

random.random() < EXPLOIT_BIAS + env.total_steps_ever / TOTAL_TIMESTEPS

Software Lead at NorthPoint Development. When I’m not helping automate a real estate company, I’m growing succulents in my back yard. https://tristansokol.com/