A self-learning Atari Dragster bot in Python

Andreas Thiele
17 min readFeb 19, 2019

--

A screenshot of the Dragster game. What’s wrong with this picture?

This post will cover a little journey of mine involving a game from 1980. The game is incredibly short, and the best players in the world will finish it in slightly more than 5.5 seconds. And as you might have guessed, this is the Dragster game for Atari 2600.
Dragster is interesting because it is has a unique place in the history of computer games and especially in speed-running. The best time ever was supposedly 5.51 seconds, and this held the Guinness world record for being the longest-standing record in any video game (a record for having a record is really quite meta). I say “supposedly” because the claim was not that well-supported, and the best time that anyone else could get was 5.57 seconds, which was done by multiple people, but no-one else ever did 5.51 or 5.54 (the time can have the values 5.51, 5.54 and 5.57, other times are impossible due to time tracking mechanics that I’ll explain later). This brought on a massive hunt for understanding how the record was set, and after serious looking through code and piles of evidence mounting up, 5.51 seconds was deemed impossible, and the record was rejected.

Dragster is a conceptually simple game in which you control a dragster. All you do is drive from the left of the screen to the right. It is actually a two-player game so that you can play with a friend, but for speed-running it is played solo and usually as the top dragster. Here’s a video of a person doing a 5.57.

This game is so simple, right? That was my first thought. And surely, it can’t be that hard, right? That was my second thought. Although I didn’t spend many hours, I had issues getting close to even 6 seconds, so yeah, it is quite hard to make a world record. But then, could I make something that ties the world record? That is actually quite simple, since I’m using a computer and can optimize inputs for every single frame of the game, so this isn’t much of a challenge. But could I make a bot that teaches itself to do well on the game? That sounded like a fun challenge. As will be explored later, Dragster is actually a rather complex game, and the best record my bot ever got is 5.77, which is definitely a decent time. Although it’s no world record, it’s better than most people ever manage to do. And this post will take you all the way from zero to that point.

Technical background on the Atari 2600

The Atari 2600 is by today’s standards an incredibly slow machine. But it is also rather fascinating in its own way. Here are some of its specifications:

  • 1.19 MHz CPU
  • 128 bytes of RAM (yes, bytes)
  • Maximum resolution of 160 * 192 pixels
  • 128 colors

The machine isn’t what you would call a powerhouse. On top of that, common operations like multiplying two numbers was something that developers needed extra code to do quickly, and supporting floating point numbers was also left to the developer (side-note: this has Steve Wozniak, co-founder of Apple, as a co-author).

Dragster insights

Remember me saying that Dragster is simple? For a game that takes less than 6 seconds, it is actually very complicated. What is not shown in the video of the world record is how you can actually lose. Here’s the three ways to lose:

  • Before the run starts, there is a count-down. If you try to shift up the gear during the count-down, you lose.
  • Any time in the game, if the tachometer gets too high, you will blow your engine and lose.
  • Player 2 will finish before you do.

In my case the last scenario is redundant as there’s no player 2. I will however add another way to lose, namely not finishing quickly enough; if you take more than 19.99 seconds to reach finish, you lose.

I also want to know when the player reached finish and what the finishing time was. This can all be done by applying some kind of image recognition to a screenshot of the game, but it can also be done by looking at the memory of an emulator. Looking at memory is much easier and faster, and later on I will have access to this too.

And here’s where the post takes a very technical turn.

Poking through Dragster’s memory

To get state information out of Dragster, I needed to identify where in Dragster’s memory each required piece of information is kept. Luckily, Atari 2600 only has 128 bytes of memory, so it was a manageable task. I used the Stella emulator and its debugger for this.

The count-down to start

Stella’s debugger is nice enough to highlight the changing memory addresses with a red background. Here I’m trying to find the count-down; by its very nature, I would expect it to be a value that decreases between the frames. The only value that consistently decreases is the one at 8d. When 8d goes from value 30 to 2f, the on-screen counter decreases from 4 to 3, which confirms that this is the correct address.

Finding the tachometer memory address

Although it’s not strictly needed to play the game, it’s useful to have a way of measuring the tachometer. At a8 there’s a value that constantly increases by 3 in sync with the tachometer on-screen, so that is the memory address.

Blowing the engine

This gif shows the engine blowing up as the tachometer gets too high. This is linked to the tachometer value at a8, and here you see the blow-up happening and the tachometer resetting to 0. The engine blows up when tachometer >1f (that’s 31 in decimal).

From what I’ve tested, it seems that when player 1’s engine is blown, then ce equals 1a and d4 equals 1.

Starting too early

In the gif above, player 1 is starting too early, which happens when you try to shift gears during count-down. The address indicating this, as far as I’ve gathered, is d4, which equals 1d when the player started early.

Driving and finishing the race

Finally, I wanted to figure out how to see the player distance. Quite a lot is going on here. The one I’ll mention first is the address ba: this shows the current player distance, which stops at 61 (97 decimal), after which the counter stops. Thus, this both gives the distance and a way to tell when the finish-line is reached. Neat :) Distance is actually a floating point number, although it’s not obvious. c2 counts the distance after the decimal point in 256th parts. So if ba = 95, c2 = b3 (b3 = 179 in decimal), that means the total distance is 95 + 179/256 = 95,699.

Player run-times are sort of using floating point numbers. You can see player 1’s time in b3 and b5, where b3 stores seconds and b5 stores hundredth of seconds — and note that they’re stored as decimals, not bytes. Dragster implements time-tracking as two integers, each frame adding 3 or 4 hundredths of a second, and that’s how it avoids using floating point numbers in a simple and interesting way that mostly computer people appreciate.
This way of time-keeping makes most Dragster times impossible, and that is the reasoning behind why 5.51, 5.54 and 5.57 are times that the timer can have due to the time tracking mechanics, whereas 5.52, 5.53, 5.55, 5.56, 5.58 and 5.59 are invalid.

Who is driving?

One thing you may or may not have noticed in all the gifs: 8f is flipping between being 0 and 1, and that is odd. As it can be seen in the gif showing the finish, the player times are not updated in the same frame. Instead, each player is given a frame in which it is updated and its inputs are read. This theoretically allows for a player to input something in the other players frame and release the button without it being registered. Dragster was released in North America who use NTSC, meaning there are 60 frames being calculated per second with 30 per player, giving one only 1/60 of a second to make an input and be unlucky enough to hit the opponent frame counter for it to not get registered. I don’t know if that’s even physically possible to do, but it is possible for a bot.
This detail is not game-breakingly important, but it did allow me to skip all the frames in which input was ignored anyways.

A technical peculiarity is that because each player has 30 updates per second, it makes perfect sense that the players’ times are increased by 3 or 4 hundredth of a second per frame. If you make 20 of 30 frames add 0.03 seconds and the remaining 10 add 0.04, you will end up with 20*0.03 + 10*0.04 = 1 second increase per second.

In a different universe where Dragster was released in a PAL country with 50 frames per second, each player would have 25 frames per second, making each time-step 0.04 seconds. Had the game ever been released in a PAL country, one could possibly have made a 5.56 (56 is divisible by 4) time and thus made a world record just by having a different version of the game. But that never happened and it’s just my wild speculations. This shows a time-keeping headache that some programmer must have had when someone from marketing undoubtedly proposed releasing Dragster for PAL regions. Poor fella.

But now that I’ve got all memory addresses, it’s time to move on.

What do other people do?

At least to my knowledge, there is no such post like this one on the interwebs that details making a self-learning bot specifically for Dragster.
However, making bots for playing games is something that people do a lot these days, and you have likely heard of Google’s superior chess bot or OpenAI’s DOTA 2 bot playing real people. It is with good reason that it is done; games are easy to relate to, and differently from other machine learning areas, you don’t actually need data — you just make your bot play the game.

Speaking of OpenAI, one of the things they’ve done is to make reinforcement learning more approachable to anyone by providing game-setups through their OpenAI gym. They actually already have a decent collection of Atari games to play with; too badly, Dragster isn’t one of them.

Since OpenAI couldn’t/wouldn’t support all games, they made it much more approachable by creating retro, which enables you to play a game through the Stella emulator. A caveat to retro, though, is that you will need to provide the ROM yourself, and it will have to match a specific checksum.
Unsurprisingly, retro also doesn’t support Dragster. So here, too, I didn’t catch a break. But their work made my task much easier, as I could add Dragster myself, and so I did in a fork of the repository. I will not go into the deep technical details on what I did and how I did it; if you really wish to know, just check my commits. The most important thing I did was creating data.json, which holds information on where in memory to find the information needed to play the game later on. That’s what I needed those memory addresses for.

What is reinforcement learning?

There’s an issue I’ve blissfully ignored until now, namely that the bot will have to teach itself the game, and thus far I haven’t at all mentioned how to do that.
Here’s a very superficial conceptual primer of reinforcement learning and what choices I made when there was any to be made.

Agent

Agent is the entity that is interacting and trying to learn something. So, this is the bot.

Environment

This is, abstractly, what the agent interacts with. It can be reset to some predefined state, it can be observed (in the Dragster case, get the current game frame) and it can be interacted with through a limited set of interactions (button inputs, or no input, for Dragster). After each action, the environment will return a new observation of the environment after the action was applied, return some reward (more on that) and indicate whether the game is over or not.

Reward

Imagine trying to teach a dog to roll over. You start by saying “roll”; after which the dog will likely stare at you. So you say “roll” once more, and you keep doing so, until the dog actually does roll over, and then you give it a treat — the reward. You repeat the process, and for each time it rolls over, you give it a treat, thus reinforcing the behavior of the dog. In the end, the dog will have connected the command “roll” with doing a roll and getting a treat. This is the very core concept of reinforcement learning — reward the bot when it does something good, and reward it more the better it does.
In Dragster terms, this means blowing the engine during start is very bad, while starting and not finishing is not good, but at least it’s better than not starting at all. And finally I want to reward the bot more the better a time it gets.

The reward-function will always be some function with numeric outputs, which the bot will try to optimize so that the total sum of rewards earned throughout a game is maximized. Defining the reward function is one of the most important and difficult things to do when working with reinforcement learning, so I had to pay special attention to this.

My requirements for a reward function were that it:

  • Gets lower the longer time the bot takes to finish
  • Is negative if the bot starts too early, but less negative the less time is left of the count-down
  • Rewards the bot if it finishes the count-down without starting early
  • Penalizes the bot if it blows the engine, but the penalty should be less the farther the bot has driven
  • Rewards the bot if it finishes the game

Optimizing this function will require the bot to start, drive without blowing the engine and finish quickly; the quicker it finishes, the better. Also, if it fails in some way, it will be rewarded more the better it does until that point.

I’ve put together a composed reward function that fulfills these requirements.
During count-down frames, if the bot starts too early, the reward is defined as below, where x is number of frames left of count-down:

This means that the longer it takes for the bot to fail during count-down, the less penalty it will receive.

After count-down has finished and the game has started, to incentivize the bot to not start too early, a positive reward is given:

So not only will it not be penalized for not starting early, it will also receive a nice reward for just starting.

After having successfully survived the count-down, in each frame, give:

The rewards after starting should be seen as an if-elseif-else case. Only one of those rewards are given per frame:

  • I want the bot to be penalized if the engine is blown, but it should be penalized less if it managed to move a lot. If it just survives the count-down and blows the engine, that’s not as good as getting 90% of the way.
  • To incentivize reaching goal, an additional 100 is given once goal is reached.
  • To incentivize moving forward, 10 is given each time the Dragster moves a unit forward.
  • Finally, to make the bot want to finish quicker, it will receive a penalty of -1 for each frame spent driving without moving forward, meaning faster driving results in higher total reward.

The maximum achievable sum of rewards for this reward function is 1210, which is achieved by getting a time of 5.57.

Exploration

Think once more of the dog from before. How would it know what to do to get the treat? It wouldn’t. So it would just have to try out different things until you gave it a treat. This is what exploration is about; exploring new things in the hopes of more reward.

At some point the dog will have to decide whether to just exploit its knowledge and get one treat by doing a roll versus exploring other options in search of potential more treats, like getting two treats for rolling over twice. So how does it do that? Since it’s a dog, I don’t know, but this is known as deciding between exploitation and exploration, and it is a trade-off that needs to be made.
If an agent explores too much in a complex game, the more complex states are rarely reached because it does not make the right choices to get there. On the other hand, if exploration is too low, it might never try out new actions that will give higher rewards. Making a world record in Dragster is actually very difficult, since you will need to time some of your inputs down to a single frame, so getting there by randomness alone is highly unlikely.

I used epsilon annealing, which makes the bot explore new actions 25% of the time at the beginning of training and gradually drop to explore new actions 1% of the time after 1,000,000 actions.

A hard-coded world record

By this point I had the knowledge needed to set up an environment with reward functions and game interactions.

Dragster has already been heavily analyzed by a gamer known as Omnigamer. He put together a spreadsheet laying out some different ways to get to 5.57, so I made an implementation of this strategy from Omnigamer’s spreadspeet. As you can see from the gif above that I recorded, this gives a perfect run. So everything was good so far and the environment was working :)

Game simplifications

Since Dragster is a rather complex game, I decided on making some more non-game-breaking simplifications to make building a bot easier as discussed here.

Limit available actions

The agent doesn’t have too many inputs to consider; it can press:

  • Left
  • The single button on the joystick
  • Both of the above
  • Nothing

Other actions like up, right and down are available on the Atari 2600, but these are not needed for Dragster. So I limited the bot to only be able to use the subset of actions.

Only use each second frame

As only each second frame reads any input, there’s no point in showing and asking for actions for all the frames. If I did show the frames in which the bot’s actions didn’t change anything, the bot would have a harder time learning, as only half the actions would impact the game’s state; also, this halves the number of tensors used for training and makes for faster training and less memory usage.

Start the bot’s game interaction when there’s 4 input-frames until game-start

Dragster starts with a countdown of 160 frames, but you can only act in half of those, so it’s really 80 useful frames.
Remember that you can start too early or blow the engine. What are the chances that the bot will not make a bad combination of actions for 80 consecutive actions? And even if it does manage to start at some point, it’s likely because the bot learned to not press anything. That’s also not the wanted behavior. So by limiting the period in which the bot can screw up, this greatly reduces the training time.

Once more referring to Omnigamer’s spreadsheet, a good many of the ways to reach 5.57 require 4 input-frames during count-down, so I decided on using 4 frames during count-down.

Stop game after 20 seconds

There’s no actual limit of the length of a Dragster game, although the timer will stop at 99.99, but if you haven’t finished, you can still press stuff and drive. But seeing that the world record is much less than that, I will stop the game after 20 seconds. This will increase training speed a lot.

Self-learning agent implementation

The time to actually implement the bot has finally come. I decided to not implement the self-learning algorithm myself, but instead use Tensorforce, which is Tensorflow for reinforcement learning. Tensorforce also comes with a collection of tools made for the purpose.

Tensorforce supports some preprocessing of the data it receives, and I’m using its grayscale-preprocessor that makes the image grayscale and so removes a lot of the color information. As a raw Atari frame is 210*160 pixels, and a game can now take up to 20 seconds, with 30 input-frames per second, that is 600 of those frames, so removing some color information makes the frames take up less space.
I also cut away big parts of the picture; for instance, I don’t care about player 2 and can thus cut away half the image. Similarly, many other unimportant rows and columns of pixels can be removed from the picture without losing essential information.
After cutting, there’s only 150*35 pixels left. This is what a sample frame sized shown to the bot will look like before the grayscaling is applied

Cutting the frames also makes the tensors smaller, making for faster training.
This kind of preprocessing by cutting the image isn’t supported by Tensorforce, so I had to do it myself.

For the neural network architecture, my first layer is a convolutional layer with 16 8*8 filters with stride 4 and leaky relu as activation function. Second layer is another convolutional layer with 32 4*4 filters with stride 2 and leaky relu as activation function, then a flattening layer and finally a fully connected layer of size 256.
This network architecture is taken from this article that teaches a bot to play Breakout, which in turn has it from an article from the DeepMind team about playing different Atari games.
I am using a leaky relu in instead of the more commonly used relu. This is a very minor tweak to avoid having parts of the network dying.

Differently from what was used by the DeepMind team referenced above, I’m using a PPO (Proximal Policy Optimization) agent — for the record, DeepMind used a Deep Q-Network. PPO is the one that Tensorforce use in their examples since it generally performs well, and I had a lot of good experiences with it.

PPO is usually quick to learn, and it is with this algorithm and the architecture described above I managed to get a score of 5.77.

Prologue

Phew, this was all quite a mouthful for me and it has taken a lot of time. I wanted to tie the world record, but failed to do so, but I got close enough that I can live with it.
If you wish to try making a bot yourself, feel free to use any of my code. If you do beat my time with something self-learning, please write a comment and let me know how you did it :)

I hope you’ve enjoyed the journey into this legendary Atari game and machine learning as much as I have. It’s been really rewarding (get it? Sorry…)

Suggested reading

I haven’t scratched the surface on implementing any reinforcement learning algorithms, as I believe many other source do that much better than I can. Here are some:

My reinforcement learning primer also only covered the bare minimum required to understand the concepts. The links above does a much better job than me, if you are interested in the deeper details. This blog series is also a great resource.

I went for the PPO algorithm for my agent, which isn’t one you will use often when first encountering reinforcement learning. Instead, you will likely hear more about Q-learning, or Deep Q-Networks that’s just a deep neural network implementation of Q-learning. That is also the algorithm used by the source links I provided just before. I just went with PPO because it produced good results.
Tensforforce implements a lot of other agents that you can play around with, if you wish to. Generally, if you want to work more with Tensorforce, you should go visit their github and their documentation.

--

--