My final submission: the Improved JERK.

All the improvements I was able to make, and the best AI agent team Bobcats 🐯 submitted at the OpenAI Retro Contest end.

Tristan Sokol
5 min readJun 10, 2018

With my limited schedule, I focused in trying to get the most from the jerk agent to see if I could get a decent score. I was able to make some meaningful code changes, and do some fine tuning to get a ~15% improvement my submission of the original baseline JERK agent. (although in all reality that might be well with in the variance of learning with a randomization based agent 😉)

Here is a list of changes with the associated scores from submitting them.

  • v11: Some of the first changes I made was experimenting with the “when to exploit” decision making process. I changed the if statement to use the best score over 10000 as exploit bias, then the agent would exploit more as the scores got better. 3459.76
...
if new_ep:
if (solutions and random.random() < best_run/10000):
solutions = sorted(solutions, key=lambda x: np.mean(x[0]))
...
  • v12: That didn’t seem to work well, so I tried to make it a little more interesting and used a squared factor and included the time base factor back in as well. 3489.56
if (solutions and random.random() < math.pow(best_run/10000,2) + env.total_steps_ever / TOTAL_TIMESTEPS):
  • v13: Since those changes seemed to be steps in the wrong direction I reverted them. Instead I took a look at the exploit function. The baseline just stands still if sonic is able to get farther than the best replay of moves so added code for sonic to move instead. 3726.14
def exploit(env, sequence):
env.reset()
done = False
idx = 0
while not done:
if idx >= len(sequence):
# _, _, done, _ = env.step(np.zeros((12,), dtype='bool'))
# baseline ⬆️
_, done = move(env, 5,jump_repeat=1)
else:
_, _, done, _ = env.step(sequence[idx])
idx += 1
return env.total_reward
  • v14: Dropped the backtracking down to 45 (from 70) 3817.38
  • v15: Lower exploit bias to .20 (from .25) this got me the best score so far: 3944.49
  • v16: I went back to editing the exploit decisions. Now I was keeping track of the best score ever made and using that to determine exploitation in addition to the exploit bias and time parameters: 3903.91
random.random() < EXPLOIT_BIAS + env.total_steps_ever / TOTAL_TIMESTEPS + best_run/10000
  • v17: Added backtracking to the exploit function. It seemed like it would be very minor, but not knowing what the maps would be like it could be impactful: 3793.01
  • v18: I kept the backtracking even though it seemed to not really help (assuming the difference was in the variance of a run) and lowered exploit basis to try to get back to v15 high score. 3936.67
  • v19: Instead of continuing exploited episodes that run out of moves, it might be more worthwhile to just end the episode and use the timesteps to do more runs. Didn’t seem effective. 3519.35
  • v20: Upped the backtracking way more (to 100) in post-exploit moves. 3889.45
  • v21: Lowered the backtracking to 45 but also lowered the exploit bias to .15: 3852.97

At this point it seemed like the most useful things would be to explore tuning the EXPLOIT_BIAS hyperparameter.

  • v22: Upped to a .225 exploit bias: 3753.14
  • v23: .175 exploit bias — Best ever! 4149.36
  • v24: Back to .2 exploit bias (should be the same as 18?): 3769.02. It seems like 200 is well within some inter-run variance.
  • v25: .18 exploit bias, a new record and probably the best run I’ll achieve. It is two days before the contest end. 4174.88
Pretty far from Bobcats best place of second on Day 3
  • v26: The dynamic factor of my exploit decisions might be having a bigger impact than I think. Lets try a .13 exploit bias: 4188.64

From that point, the night before the contest, I tried to finish up some hyperparameter tuning to see if I could eek out any more points. I thought there might be some kind of sweet spot of the exploit bias, but in reality I think I was just seeing all of the variability related to the random learning of the agent.

Looking at the breakdown in the scores, the difference between a ~3800 score and a ~4200 score all came down to the first task, the former scoring ~5200 and the latter usually scoring close to 8000. I am betting when we see the levels, task 1 will have some kind of crux similar to the loop in episode 1–1

🎵If you can make it here, you can make it anywhere 🎵

All that work did get team Bobcats to an all time top score of 4223.46.

You can see a final copy of my code here (along with all of the interim copies) in tristansokol/Bobcats. All of the useful tools I made during the past two months are collected in this gist, please feel free to use them when making your own write-ups! I’m still working on the final write-up for the contest, but it will be housed here.

Overall I am more than thrilled with how the contest turned out for team Bobcats. I had never done any real attempt in machine learning or artificial intelligence (or really much work in python) so this was an awesome learning opportunity. After this I think I am going to try out some of the new TensorFlow.js, but first I need to finish my writeup!

--

--

Tristan Sokol

Software Lead at NorthPoint Development. When I’m not helping automate a real estate company, I’m growing succulents in my back yard. https://tristansokol.com/