Pong, Machine Learning, and Never Standing Still

What We Did With Two Weeks of Bench Time

Thoughtworks Canada
Connected
5 min readApr 3, 2018

--

By Tsun Yin Lip and William Wen

Summary: As software engineers at Connected Lab, we had two weeks of bench time for research. We decided to spend that time investigating a well-known application of machine learning to the classic Atari game Pong. We were curious to know: why does the AI’s paddle never stop moving?

Animation by Luc Palombo

Background

In a 2016 blog post, former OpenAI director Andrej Karpathy recounts how he used a type of machine learning called deep reinforcement learning to teach a neural network how to play Pong.

By feeding differential successive frames of gameplay images into a neural-network-based learning agent, Karpathy was able to train the agent to adjust its behaviour based on the outcomes of its actions, learning over time to eventually beat the computer. Using principles of reinforcement learning, the agent would make recurring micro-adjustments to the weights of the network, such that the agent is rewarded for taking actions that lead to victories and discouraged from taking actions that lead to losses.

By training his agent in this way over three nights, Karpathy was able to construct an AI that beat the computer in the majority of cases.

The Problem of Stillness

But as we went through Karpathy’s AI, looking at the algorithms he used and experimenting with parameters to gain a better understanding of how he trained it, we noticed that the AI paddle he made never stopped twitching. It shook like a coffee addict on a Monday morning.

Pong AI with Policy Gradients, published by Andrej Karpathy on YouTube

The restlessness of the paddle made us wonder: Was the lack of a stillness function a simple oversight on the part of its creator? Did Karpathy leave it out on purpose, for simplicity’s sake? Or was stillness always an option, and the agent simply learned, better than any human ever could, that continuous movement is the best way to win? We had to know.

Finding out required us to understand how the agent ultimately decides what to do at each time-step. To our surprise, we found out that Karpathy didn’t even give the agent the option to stay still. What did Karpathy know that we didn’t?

We decided to experiment: If we gave the agent the ability to stay still, would it ever do so? The original Atari game lets humans do it, so surely it must serve a purpose other than aesthetics, right?

Changing the Game

In the original implementation, the neural network only had to calculate two numbers at any given time: the probability that moving up is the correct action, and the probability that moving down is. In order to add the option to stand still, we had to change the output to three numbers: up, down, and still. Adding this extra dimension required swapping out functions and algorithms for more generic equivalents. Here’s what that looked like.

In Karpathy’s original implementation, after feeding inputs (namely, the arrangement of pixels on the screen at a given moment of gameplay) into the neural network, the agent used a sigmoid function to ultimately determine whether the paddle should move up or down. The properties of the sigmoid function that make this work are 1) that it is differentiable, and 2) that it returns a single value somewhere between 0 and 1, a number which represents the agent’s best guess that a given paddle movement will be the right choice.

If the agent thinks that moving down is the best move, for example, the point will move closer to 0 than 1 (as shown). With a fully trained agent, the output point will move in a way that maximizes the chance of winning.

Unfortunately, since we wanted to add a third action (stillness), the conveniences of the sigmoid function didn’t apply to our case. There’s no way to divide a line into three sections with a single point, so we had to find something else.

What we found was the softmax function. Softmax is a generalization of the sigmoid function that 1) is also differentiable, but more vitally 2) can return any number of values that add up to 1, which allowed us to split the 0–1 line into 3 sections and assign probabilities for the 3 actions — up, down and still — to each.

Results: Movement for the win

After going through all that effort, we discovered that, even after adding the stillness function, the paddle kept twitching. At this point we could say with greater certainty that the game didn’t favour that action: it’s always better to move the paddle. Staying still, while aesthetically pleasing, didn’t necessarily win games. Besides, the agent wasn’t interested in looking good — it was interested in winning as much as possible. Perhaps this was why Karpathy never bothered to add it.

The agent wasn’t interested in looking good — it was interested in winning as much as possible.

When you think about it, it kind of makes sense (Pong players take note). The default Atari player (i.e., the computer) is programmed to blindly follow the current position of the ball, so to score against the computer, you have to make the ball move faster than the computer’s paddle. And since hitting the ball while the paddle is in motion actually speeds up the ball, it kind of makes sense that our agent learned to always be moving.

Conclusion: A Faster Agent

Our exploration into Karpathy’s work taught us a lot about how deep reinforcement learning works and even gave us the opportunity to make a few tweaks. On top of adding a stillness function, we also managed to dramatically speed up the agent’s learning rate by tuning the hyperparameters.

Though it required a good deal of noodling around until we found something that worked well (we weren’t using any ML tools so it was all “by hand”), accelerating the learn rate was ultimately sort of a quick win. In his essay, Karpathy mentioned that he “did not tune the hyperparameters too much” so we thought we’d take a little time doing it ourselves. We ended up with an agent that gets up to winning speed after training over a single night.

We ended up with an agent that gets up to winning speed after training over a single night.

After walking in Karpathy’s steps and making a few changes along the way, we’re better equipped to deal with machine learning problems than before we started. Make no mistake, there’s still tons to learn, but we now have a foundation on which we can build upon. Thanks Andrej.

--

--

Thoughtworks Canada
Connected

Creating extraordinary impact on the world through our culture and technology excellence. ry, while advocating for positive social change.