Mario, Please Jump: An Informal Blog detailing My Journey and Struggles with Deep Reinforcement Learning

Eisuke H.
7 min readJan 24, 2022

--

The following link is my github repo on the stuff I’ve been doing, though the readme is so messy I’d suggest just reading this blog instead. Also heads up, this blog is probably more suited to those who understand the basics of RL. If you want to learn DRL, at the bottom of my github readme you can find a list of sources (tutorials, textbooks, documentations, etc.). All the sources were ones that I used myself, so I recommend looking at any of them.

When I first got interested in RL, I stumbled upon a video which caught my attention. The video explained how the agent learned to play Sonic within 2 hrs of training, which I found amazing. 2 hours. That’s not that much time. So obviously, I wanted to learn how they did it, but I soon found that I had to get a ROM for Sonic which would cost me $5 (legally), to which I was like: “No way do I wanna spend $5 for something I have no idea what I’m doing.”

Trying to find other (somewhat) complicated games to do DRL with, I found the PyTorch Super Mario Bros tutorial. All I had to do was download gym super-mario-bros through pip and I was set. And so, my DRL journey begun.

I first started with DQN, but tl;dr it was a major failure. I then switched to DDQN and still ended in failure. I tested them on Breakout (because training Breakout probably takes less than Mario) to which they barely learned anything (I ran it overnight to which its average reward was 15, which most likely meant that there was a bug in my code but at this point I was too frustrated), and so I thought about trying some other algorithm. Since I only have my macbook, I don’t have a GPU to speed up testing, so things would take forever; I wanted to use the “most efficient” algorithm. I decided to rewatch the Two Minute Papers video on Sonic, to which I found that the top scorers often used PPO. So, I switched to that.

I found multiple PPO tutorials that were great, but I mainly credit Costa Huang’s tutorial as my code follows his closely. This time, my Breakout ran well: it only took 2 million timesteps to break a score of 250. I could have ran it for longer but 10 million timesteps would take me like 12 hrs and I have nothing to do to kill that kind of time in my college dorm and it’s like -20 degrees outside so I can’t leave my room.

My graph recording the return per episode during training

Seeing that PPO worked on Breakout, I diverted my attention back to my original goal: Mario. I had to comment out some tricks that were used for Atari games (e.g., NoopResetEnv, FireResetEnv) as they don’t apply to Mario. I also had to change how to record data, as recording Episodic Returns won’t help given that episodes won’t be ending quickly like Breakout where the paddle misses the ball numerous times until there are no more lives left. Rather, Mario will often be stuck on a pipe and has no progress, sits there but loses no lives, thus the episode won’t end, thus there is no episodic return. After removing all of these, I started training.

After running for some timesteps, though, I realized that I can’t just comment out the Atari tricks but must also tune some hyperparameters. Looking at the following image, we see that the rewards plateau around 940–950 (this score is for 8 environments combined, so that means the average score per agent is 945/8 or around 120). Looking at my hyperparameters, it was obvious that I had to change my num_steps variable (number of frames per trajectory) from 128 (originally chosen following PPO paper) to something higher (I changed to 2000 and also removed Frame Skipping). 128 frames (technically 512 frames due to 4-frame Frame Skipping) for Breakout may make sense, but for Mario, that length of time is way too short. Since it was too short, the game would cut off with the agent getting little to nothing and barely any training.

My first test run of ppo on Mario

Aside from changing num_steps, I also changed the learning rate from 2.5e-4 to 5.0e-4 (which I chose arbitrarily). After this, I ran the model overnight, for 10 million steps, which you can see the results in the following image.

Test run after tuning some hyperparameters and running for 10 million timesteps

“An average score of around 3500… Holy sh-”. Excited, I then decided to evaluate the model, and rendered it. ………….. “What is this?” I found that this score meant nothing. Looking at the following GIF (which is ONE trajectory), we see that the agent runs a bit, then purposely dies, to respawn and repeat the previous steps. While the environment automatically resets, the rewards continue to add up, meaning that this is the most efficient way to gain rewards, albeit completely unwanted behavior.

I guess Mario’s favorite food is mushrooms

In Breakout, this behavior wouldn’t happen. In Breakout, it’s way more efficient to continue hitting the ball, as time-wise, breaking multiple blocks per episode compared to one block per episode over and over, is faster. My thought process (which would prove to be wrong later) was that in Mario though, since the agent gets stuck on a pipe and doesn’t know how to jump over it, time-wise, it’s taking too long.

So, I removed another trick which was resetting the env immediately after end of episode. This caused the env to reset and still have the rewards add up. Now, when the env resets, it’s done. No more adding score to the previous episode, and only play ONE episode per trajectory. Alongside that, I set my num_steps to 1000 (thinking that a long trajectory may cause the agent to lose points from doing nothing infront of a pipe due to the reward function), I ran it again, and this time it actually showed progress! It didn’t just repeatedly die but decided to keep running forward. Though, since my num_steps was too short, it couldn’t progress any further than the second pipe. So, I decided to up num_steps and lr to 1600 and 5.0e-4 respectively.

WHY CAN’T YOU PRESS JUMP FOR ONE MORE FRAME PLEASE

That was garbage. I tried tinkering around, but kept finding garbage results. I decided to look up for help and found this github, which showed me an obvious thing I wasn’t doing: I wasn’t tinkering with the minibatch size. I was honestly embarrassed that I didn’t think to modify it yet. Even when I increased the trajectory length from 512 frames to 1000+ frames, I still continued doing 4 minibatches; obviously it wouldn’t work.

I followed the github’s hyperparameters and did 512 frames (and added back the 4-frame Frame Skipping, i.e., 2048 frames per trajectory). Since the frames were increased from 128 to 512, it makes sense to also up the # of minibatches from 4 to 16 to keep the same ratio. Also, I know I just said I removed the env reset trick but I decided to put it back in (which I’ll explain later).

Following these changes (also including changing the number of update epochs from 4 to 10 and also changing lr to 1e-4) I found these to be my results:

Rewards with changed hyperparameters
Mario looking like a god (until he falls into the hole)

Good job Mario, good job. I guess my mindset from before where I removed the env reset was not a good idea. Given that there are more minibatches, it would make sense to reset the env and feed it more data. I learned here that resetting the env is critical, but it can be a double-edged blade if I don’t tune the hyperparameters.

I only ran it for >1.3 million timesteps but we see the agent still wanting to go right. Clearly, if I keep training, the reward will continue to incrase. I don’t want to keep running my laptop though, as I would need to keep it charged overnight which will ruin its battery life. So at this point, I think my journey with this has ended, though, I might look at other algorithms (I’m looking at you Rainbow). We’ll see though. Also I really need to keep reading the Sutton & Barto RL book because I decided for a whole month to do DRL instead of continuing to read the textbook. Either way, this was a fun project to do over the winter break between the Fall and Spring semester. Now that Spring’s here, time to struggle in class instead of teaching Mario how to damn jump over a pipe.

--

--