An OpenAI Codex Experiment: Coding the Game of Pong using an AI co-pilot

14 min readOct 19, 2021

OpenAI recently released Codex in private beta. Codex is the AI model that powers GitHub Co-pilot, branded as "Your AI pair programmer".

There is much hype about this technology (and for good reasons), but in order to understand this state of the art automatic code generation technology better, we need to look at a specific example. A good example is one that can be described in just a few lines of text, can easily be understood through visualization and in a domain that is outside of your expertise, so you can better appreciate the value a “code co-pilot” brings to non-experts.

So let’s build the game of Pong![1]

Note: I have no affiliation with GitHub, Microsoft nor OpenAI, so what follows is the honest feedback of someone who has been writing software in a variety of languages for over 25 years.

First Try — Code generation from a general description

When you look up “the game of pong” on google, you are pointed to WikiPedia, which offers the following description:

Pong is a two-dimensional sports game that simulates table tennis. The player controls an in-game paddle by moving it vertically across the left or right side of the screen. The player can compete against another player controlling a second paddle on the opposing side. Players use the paddles to hit a ball back and forth. The goal is for each player to reach eleven points before the opponent; points are earned when one fails to return the ball to the other.

If you would believe the popular press[2], you would expect that the code generation problem was already entirely solved. So let’s see what happens when we feed this description to Codex.

Generated code:

Not too surprising, we didn’t get a working solution out of that, but still it introduces pygame which I didn't know existed! So, bonus points for recommending a useful library!

Of course, the above description is largely under-specified. We humans know what table tennis looks like in real life and therefore can make some implicit assumptions about what this game will be about.

Second Try — Code generation from pseudocode

Let’s try to be more specific about the code we wish to write. Still, we want to avoid writing code ourselves, so let’s just stick to English (pseudocode).

Screen Layout instructions:

draw a rectangle of 800 pixels wide and 600 pixels tall and call it the screen
make the background white
draw a black vertical line in the middle of the screen
draw a black rectangle of 80 pixels high and 5 pixels wide at the left of the screen. This is paddle1
draw a black rectangle of 80 pixels high and 5 pixels wide at the right of the screen. This is paddle2
draw a red circle with diameter 10 pixels at the center of the screen. This is the ball.

Game Control instructions:

there are 2 players. Player 1 controls paddle1. Player 2 controls paddle2.
player1 uses key W to move paddle1 upwards at 10 pixels per second
player1 uses key S to move paddle1 downwards at 10 pixels per second
player2 uses the up arrow to move paddle2 upwards at 10 pixels per second
player2 uses the down arrow to move paddle2 downwards at 10 pixels per second

Game play and scoring instructions:

both players start with a score of 0. Draw the scores at the top of the screen.
a game consists of multiple rounds
at the start of a round, position the ball at the middle of the screen and move the ball in a random direction at 10 pixels per second
when the ball hits the top or the bottom of the screen, bounce the ball
when the ball hits paddle1 or paddle2, bounce the ball
when the ball hits the left side of the screen, player2 wins and gets 1 point. Then, a new round is started
when the ball hits the right side of the screen, player1 wins and gets 1 point. Then, a new round is started
when a player accumulates 11 points, that player becomes the winner and the game is over.
when the game is over, print a victory message

Now, let’s feed these pseudocode steps one at a time and see what code gets generated. In the below code fragments, each first comment line is our input (question) to Codex, and the following lines are produced by Codex. Sometimes Codex actually generates a second comment as you will see.

Part One — Draw the screen

Codex correctly calls the set_mode() function with the requested size and assigns it to the variable of our choice!

Again, nicely done! It knows we are referring to the background of the screen and it correctly uses the RGB value of white.

All good! The color is RGB 0, 0, 0 (black) and it correctly calculates the middle of the screen to be at x coordinate 400 (= 800 / 2). If you pick a different screen size to start with, these values are correctly updated, so it seems like Codex knows how to do basic math!

But does it really? Let’s try some non-sensical values for screen width it will surely never have seen in its training set: 123/2 = 60, 777/2 = 400, 20560/2 = 10080. Hmmm. Not exactly right, but pretty close nevertheless. It seems the first two have some sensible rounding applied before dividing, but the last one is a bit weird. Of course, for screen positions, this might still be ok-ish, but for any other use case where precision matters, we have a problem!

Now let’s draw the paddles and the ball.

Nice job again.

But if we would now want to run this generated program, nothing would happen. First we need to import the relevant libraries (which is easy enough for a human non-expert to do), but even then, you would see a black screen popping up and closing down immediately. This is because pygame requires an active event loop, preventing the program to terminate. This event loop is also in charge of actually drawing the items on screen at each loop, so as to allow for moving objects.

Knowing this, let’s ask Codex to create such an event loop for us, by just asking “handle events”.

Running this, gives the following:

Wow! That’s almost what we asked for. Only the paddles are drawn horizontally rather than vertically. It’s easy enough to manually flip the width and height parameters, but let’s figure out what causes Codex to get this wrong. Is it because most comments it has seen in the wild first mention width, then height?

Could be, because flipping the spec for width and height did the trick. But now it no longer positions the paddles in the middle of the screen (vertically), which we didn’t actually specify, but which it seems to have picked up from a pong implementation it has seen out there somewhere…

Let’s try different wording.

Using tall rather than high seems to fix it as well... but that is kind of scary, isn't it? For graphical programs it's easy to spot that something is wrong, but what about a complex mathematical transformation? The spec can use words that are synonymous in English (or at least appear to be synonymous for a non-native English speaker), but the generated code is different...

With the above modification, the game screen now looks like this:

Figure 2 — The game with corrected paddles

Before we continue, let’s remove the event loop again, until we have defined all events we wish the system to control.

Part Two — Make the game interactive

Let’s add some code to make the paddles controllable by the players.

So in this case, Codex actually produces a comment saying which keys the respective players will be using. Indeed, the above request was under-specified, and Codex extended the specification. Pretty cool, no? But no code.

Let’s be more specific in what exactly we want.

Interestingly, Codex produces a function for this and introduces paddle1_velocity as a variable we haven't used before. Also, it refers to a global event variable, which is normally introduced in the event loop. This code obviously either requires event and paddle1_velocity to be declared global variables, or the event argument to be added as a function argument, or the code should be inlined in the event loop. Still, the generated code is super useful.

It does highlight a limitation of the current state of the art though. When Codex introduces new variables like event and it finds an event loop in its context, it should be able to inject the generated code in the right location. If on the other hand it doesn't find a definition for event, then it should ideally generate the event loop as well.

That being said, both of these features could be offered by the “app” surrounding the codex AI model. Not everything should be part of the AI model itself.

Another interesting point is that Codex understands the coordinate system points downwards, i.e. has its origin (0,0) at the top left of the screen, so that moving up means decreasing the y value! Of course, this point is more error-prone for human developers than their AI counterparts, since a typical computer coordinate system is pointing down…

Similarly, let’s generate the code for the other control keys.

Now let’s ask for generating the event loop again.

Really, really well done! That’s a big piece of code right there. It calls all the functions it defined before and copied all our earlier constructed rendering code into the event loop, including our own comments! The only problem is that it forgot to pass the event parameter to the functions, but that's again an easy fix. It also now uses its self-defined paddle1_velocity and paddle2_velocity variables to update a paddle position variable paddle1_pos and paddle2_pos. It will be our task to correctly initialize these position variables. So let's manually add this code to the top of the program we have so far:

and also inline the key handler code in the event loop.

Let’s run the game again.

The screen looks fine, but the paddles are not moving. Why? Because the generated code is not using the paddle1_pos and paddle2_pos variables in the updated drawing instructions! Let's correct that.

And now the game has functional paddles! All by all, very few changes were needed and didn’t require any in depth knowledge of the pygame library at all.

Part Three — Scoring

No game if we can’t win!

The result is a comment indicating when a player scores. Let’s split it up in 2 separate questions:

It remembers there’s 2 players and correctly sets the initial score to 0.

Cool, now it does actually make a decent attempt at drawing the scores! But it doesn’t work; we get an error indicating fonts have not been initialized.

Traceback (most recent call last):
  File "~/codex/play/pong.py", line 37, in <module>
    font = pygame.font.Font(None, 50)
pygame.error: font not initialized

So let’s be bold and ask codex to fix that error. You never know…

Sure enough, that’s the fix! I’m not even sure if this is an intentional feature of the model, but if it’s not, it’s even cooler!

Notes: The initialization code should be moved to the top and only run once, while the drawing code must be part of the event loop, since it needs to be redrawn at each loop. This we have to do manually.

Looks like a Minecraft hamster, but ok, Codex did a decent job at positioning the scores in a reasonable place!

Part 4 — Now let’s start playing…

Granted, the input was not actionable, and we get a comment as the result. Sounds reasonable. Now let’s play along with the comment it presented us:

Ok, the ball is positioned centrally, but not moving.

Ok it sets up a speed, but nothing is moving just yet, because that would be part of the event loop. So let’s remove the original event loop again, and ask it to re-generate it.

And sure enough! What an excellent piece of code this is! It effectively adds code to move the ball, it uses all the right dimensions as we set up earlier including hit detection with the paddles… It makes you wonder though, how much of this is “parroted” at verbatim from a pong game it has seen as part of its training data. However, a quick “word for word” search on Google didn’t reveal any hits for the comment it generated itself: “check for collisions with paddle1”. Of course, if you can deduce we are building the game of pong from the context, then, statistically speaking, it will have seen hit detection logic closely associated with the already written code. The fact that it also generates a nice descriptive comment for the collision detection parts is super awesome as it considerably helps in understanding what code has been generated!

Note: this last piece of code was broken off, because there is a limit to the number of tokens you request the model to generate. It would have correctly completed the next few lines of code as well which is really impressive for a piece of code this long! But in fact, we already have this code generated earlier, so we can just extract the new piece having to do with moving the ball and edge detection.

If you look more closely at the hit detection part of the code, you will notice that it is wrong. It used fixed vertical offsets (300 < y < 400) rather than the actual position of the paddle.

Interestingly, if we are specific about the speed at which the ball should be moving, e.g. 10 pixels per second, Codex nicely adapts the speed variables. But then it also generates different, and slightly better code in the event loop, since it uses the right height of the paddles (but still fixed offsets):

If we limit the context to just the essentials (i.e. stripping out the key handlers), the code looks much better, with variable paddle positions taken into account as well:

The problem is though that you have to try a lot of different things to finally get the right code out and even then, there seems to be no consistency in the approach to take. It feels a bit like the “I feel lucky” button Google once had…

Finally, after massaging the code generated so far a bit gives us an actual working version of the game of pong!

Conclusions

Codex is really good at surfacing the right libraries / API calls to use. To some degree it knows which parameter values to use, although in some cases it gets it completely wrong. Developers using Codex need to carefully watch parameter values.
Codex generates different code depending on your wording, while in English the meaning is the same. This indicates that Codex doesn’t really understand what you are asking, but mostly looks up from memory; if you pick wording X vs Y, you are likely to be presented a fragment of code that was biased towards project A vs B. This makes me wonder whether this problem will become worse as the model sizes grow: the more parameters, the more room for memorizing[3]. This is a fundamental issue since the trend in AI has always been to throw more resources (and data) at the problem to get better results.
Since there is a limit to the amount of context you can pass into Codex and the number of tokens you can get out as a result, in any realistic program, you will often need to distill the context down to the essentials pertaining to your next question, so that the model focusses only on that part. This takes some getting used to, and even then, what comes out sometimes depends on the order of the statements in the context…
It is very obvious that a lot of memorization [3] is going on, because codex often generates much more than we explicitly asked for. For example, the whole hit-detection and bouncing logic was generated completely for free. But I want to be clear that I could not find any hard evidence that a fragment of code it generated was copied at verbatim from some project it has seen before! It’s more likely that this is just statistics at play: if you have done X, then Y is the most likely next step. And in this case, that kind of memorization is a desirable “feature” of the AI model.
Don’t assume the model knows (even basic) math! If you rely on exact values to be calculated, the model is not your friend. It will do its best to approximate values, but that only works for graphical applications or toy problems where accuracy may not matter that much.
Expect average implementations to be predicted by the model. When looking more closely into the python Game of Pong implementations on GitHub, you often see that much more elaborate implementations were done, using better API calls [4], wrapping logic into classes/functions, better algorithms (like bouncing the ball with speedups or more natural angles), etc. This is where “traditional” developer support sites like StackOverflow (SO) still have an edge on AI models like Codex. When you look up something on SO, you will often see people suggest better ways of doing something. Given that the AI is trained to produce “familiar” code (i.e., code that is most likely to appear in its training set), and that it is trained on a very large corpus of code where quality across projects may vary tremendously, you can expect that the AI will produce rather “average” solutions in some cases.
A real AI co-pilot should not just be useful in an append-only mode, but should also be able to add stuff to existing code (e.g. within a function, a class or an event loop) or even update it. Note that we had to remove the event loop multiple times to have it be re-generated. This could of course be a feature again of the “app” surrounding the model, but does the model also support context “following” the to be predicted code?

All by all, not such a bad experience… for building a game. If you were to build anything serious though, you may lose a lot of time figuring out where Codex “misses the ball”. In the end, your job becomes debugging someone else’s code, in this case the code written by a machine, which may make very subtle mistakes no human would ever think of making. So it is unclear whether you would actually gain or lose time using an AI co-pilot.

That being said, I did manage to build the game of pong in little to no time, even though I had absolutely no experience with the pygame library. So for this toy example, of which there are a ton of implementations [5] on GitHub already, it did make me more productive.

Please note: these views are my own and do not necessarily reflect the views or opinions of my employer

Disclaimer: these experiments were performed in early October 2021, by directly calling the OpenAI API on the davinci-codex model in its default configuration. We did not have access to Microsoft's GitHub co-pilot tool which may have additional logic around this model or may use different hyper-parameter settings. The experiments may not be exactly reproducible, since changes might have occurred to the model in the mean time or because of potential randomness in the predictions. Finally, I have no affiliation with GitHub, Microsoft nor OpenAI, so here is my honest feedback.
New York Times headline — AI Can Now Write Its Own Computer Code. That’s good news for humans
I’m using the term “memorizing” here, which should really be understood as “stochastic parroting” as coined in the controversial paper On the Dangers of Stochastic Parrots: Can Language Models Be Too Big, defined as “a way of haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning.”
As an example: developers offering better solutions for collision detection using pygame: pong collision pygame or how do I detect collision in pygame
In fact, the query 'pong pygame language:python' on GitHub results in 1105 repositories at the time of writing.