Discovering Q learning

My first foray into “real” machine learning didn’t go so well. Days 38 - 53 of the OpenAI Retro Contest.

Tristan Sokol
11 min readMay 24, 2018

I got pretty hung up on the idea of looking at the level like a maze, and exploring it as a means of getting the best score. Without being able to easily locate sonic in the map, I had to think of something else a little more robust.

The basic idea was to associate the reward from a move to the screen that the move was made on. That way it should be possible to have a mapping of the best move at any one point in the game based on the last step’s screen data. To do that my plan is to keep a buffer of the last couple screens, and associate those with a move, and a score. I would have Sonic randomly move to generate a large amount of screen-move-reward data, Then at any given point, sonic can know what move to do in the future to get the most score. It will probably be important to do at least a couple of screens so that I can capture momentum, but to start I might try just one. There are a lot of challenges to this approach (credit assignment, storing massive amounts of screen data, just off the top of my head) but still wanted to see if would work and address each one in turn.

Getting a history of screen data

This was easy. I just started returning the observer variable from stepping in the environment. I am almost surely going to have to cut this data down to reduce the total number of views, but for now this seems ok.

Storing a memory

I created an object (err dictionary?) to store the “memories” of each screen. keying on the whole array seemed silly, so I flattened the array, turned it into a string and then hashed it. I’m not totally sure how python’s hash function works, so if it is very sparse I might have to replace it. In essence my function looked like:

hash(obs.flatten().tostring())

So a whole rgb pixel array might correspond to something like: -7691142970161541456. Each of those hashes were keys to another object that stored the reward from each move that could be made, kind of like this:

{
-7691142970161541456: {
up: 0,
down: 0,
left: 0,
b: 0,
right: 10,
}
...

This wasn’t too hard to implement, but I ran into my first problem. There is a ton of motion in a given frame of Sonic.

There are all sorts of active environmental effects like the clouds, waves in the water, and the text at the top. With just the timer alone, I would have to be in the exact same position, at the exact same second to have another frame line up. Luckily, there are some obvious improvements.

Cropping

Cutting out most of the frame seems like an easy optimization, after all, the clouds will not likely have much to do which move to do. I wanted to start off small, so my first attempt started with a 10x10 subset of the total screen space in the center of the screen. This wasn’t too hard to do after I had worked with the array operations for creating my level visualizations, but a 10x10 space was much too small. A 50x50 space was still pretty puny:

I moved to a section that was generally on the bottom right side of the frame and about a third of the screen in area. It seems like Sonic should generally be moving to the right, so that seems like where the important stuff is. With my initial cropping, I could never get past the beginning because of my inherit credit assignment problem. Sonic starts the level far to the left, so early moves don’t move the screen. I increased the frame to 250 pixels wide and that seemed to help. Now my hashing function looked like this:

window_size_x = 250
window_size_y = 100
window_offset_x = 80
window_offset_y = 100

hash(obs[
window_offset_y:(window_offset_y+window_size_y),
window_offset_x:(window_offset_x+window_size_x)
].flatten().tostring())

Now sonic was throughly exploring the entire area, but subsequent replays would follow the exact same path, never gathering any new information. I realized that my code would replay a move for a given frame, even if it did not get a good reward. So I changed it to instead do a random move if the maximum reward for a frame was 0. I gave this strategy a bit of time to see how it would do. Sonic didn’t really make much forward progress, so the episodes usually timed out. one big disadvantage was that timing out late in a game was that it really took away from the million timestep limit:

Only 9 runs had already taken a couple hundred thousand timesteps, but they had also generated nearly that many screen values to draw from. By the end of my millionth timestep, I had a created just under 800,000 different screen combinations, and didn’t really make it very far into the level:

One of the neat things about this strategy is that I should be able to store a bunch of states, and save them between agent runs, building up a rich corpus of these actions. I gave it a few more runs but didn’t make any more progress, even though I had over 2 million different hashes saved.

I was about this far into the idea when I read this article by Ravi Munde How I built an AI to play Dino Run (based on Using Keras and Deep Q-Network to Play FlappyBird) and found out that I was implementing Q-learning.

Q-learning is a very well known thing apparently and works by iterating through a loop of actions upon a state, seeing the reward that is created with each one.

In our case, our state is the game screen, the action is a button press, and the reward is the reward function from the gym.

All of those state, actions & rewards go into a large matrix called a Q-table. The Q-table has all of the states on one axis, the actions on the other, and the rewards in each cell. There is this thing called the Bellman Equation that is used to fill in the reward values from exploring all the actions on each state, basically a completed table should be able to tell you what the expected long term reward for a given action on a given state is.

The main challenge here is that while there might only a few rows to this table (since there are a limited amount of moves you can make) there are many billion states if you count the full screen data, but being clever about the image processing to cut that number down is key. If you are looking for a good introduction to Q-learning, forget anything I tell you and take a look at something like this Introduction to Q-Learning.

Image processing

it was clear I needed to get higher tech than my cropping technique, so I basically tried re-implementing the work in that blog post. I started with a normal image (being sure to convert the gym’s RGB array into OpenCV’s BGR)

And then I applied the same image processing:

image = cv2.resize(image, (0,0), fx = 0.15, fy = 0.10) 
image = image[2:38,10:50] #img[y:y+h, x:x+w]
image = cv2.Canny(image, threshold1 = 100, threshold2 = 200)

Resizing the image to .15x the width and .10x the height, cropping to only a handful of pixels and then using Canny edge detection to spot where the edges are. I ended up with something like this:

It would be interesting to see if a trained model could get that to work, (the dino run model images were pretty opaque to me, but I felt that I needed a better starting point. I cut the scaling back to 50% and stopped cropping, and ended up with something that looked more reasonable:

I knew that the top part of the screen with the timer was nothing but trouble, so I recropped that out. A non-resized image actually looked pretty good:

There still seemed to be way too much noise, but a brief fiddle with Canny’s parameters go me nowhere, so I decided to leave it noisey. Adding grayscale conversion did help out a bit:

Being able to filter out the background would be much better, but all my ideas for that would at least need colors, or seemed too complex. I decided to put that on the back burner for now and move forward with the images that I had.

Training the neural network

From there I started training my model, doing a fair amount of observation, then some training, and slowing moving away from random moves and into only doing the action that the model predicted would be best.

It was not successful.

Right off the bat it was clear that I was going to run out of ram with the current image size. The example I was working on kept a record of every state (processed screen image) that had been seen, and quickly started taking up all of my ram. Cropping the images to a quarter of the size (50% height & 50% width) at least made it possible to shrink the stored images and the associated neural network to something that could fit more comfortably in my memory.

Observing Sonic to generate states to train the neural network went by quickly, but when the model started training, things started crawling along. I powered through for a few hundred thousand timesteps, but after a while my loss turned into Nan so it seemed like a good time to stop.

Up, up and away 🛫

It seems like my lose kind of went into the stratosphere, with if my intro ML classes have taught me, that probably means my learning rate is too high?

Plotting the difference in moves between the random moves and the moves with the best predicted outcome was pretty neat though:

The random moves look pretty random and well distributed, but the model really favored move 0 and 6 which was jump without moving, and walking left. Its no wonder that my agent didn’t really make it very far:

Timesteps,    Total_Score
36096, 0.0
36097, 0.0
36095, 703.836
36108, 565.345

Out of the four episodes, they all timed out and the first two didn’t make it anywhere!

I tuned a couple parameters and tried to do a bit more training. A quick google of what do to when your loss gets to high suggested changing my loss to mean_squared_logarithmic_error so I tried that too.

It didn’t do much better.

My loss actually got down to very reasonable numbers (I think?):

I think the sawtooth pattern is from the end of an episode and beginning of another, but I am not totally sure.

The graphs of the distribution of initially moves gave me lots of hope though, 4 and 5 were standouts from the agent, if those were “move right” & “jump right” then I could really be on the right track!

The weren’t

Move 4 is the B button and down, which would do the spin dash, but in my version of sonic just jumped instead & 5 was a no-op, no button press at all. Somehow I had reinforced standing still and jumping in place, two moves that would probably have zero reward difference at any time throughout the levels. Sonic’s progress in the level reinforced that:

The random move agent made about the same amount of progress.

My AI agent may not have learned anything these last couple weeks, but I sure have! With only ~10 days left in the contest, I probably won’t be able to get name brand tools like Keras or TensorFlow into my Agent, and will have to go back to making improvements on the jerk agent like before. Until next time!

I would love for anyone more familiar with the space to take a look at my notebook below and offer up some tips as to where I went wrong. I know I that I would probably need 10–100x more training to get a viable agent, and that there is probably something fundamentally wrong with my model (should loss stay between 0 and 1?).

Bonus #1: Notebook of all the code

One of my primary motivations in this contest is to learn about the tooling and workflows in the machine learning space. Jupyter notebooks are definitely one of those tools, so this time I did all my development inside a notebook. You can see the results of which here: https://gist.github.com/tristansokol/2c9eace463b65ac2730bd55d96a3feb2

Bonus #1: Creating a python venv for jupyter

I am new to python, but off the bat python’s package management seems kind of crazy. At PyCon in Cleveland a couple weeks ago I asked everyone I met how they managed packages and python environments for each project they were working on and two things stood out: everyone said something different (pipenv, venv, virtualenv, etc.) and, almost everyone thought that what they did was the most popular option. All I wanted to figure out how to isolate my environment and packages for running my noteboox, and I found this helpful article: https://anbasile.github.io/programming/2017/06/25/jupyter-venv/

--

--

Tristan Sokol

Software Lead at NorthPoint Development. When I’m not helping automate a real estate company, I’m growing succulents in my back yard. https://tristansokol.com/