Welcome to my write up of the IOCaste physics routine! What’s IOCaste? It’s an Atari 2600 demo that won first place at the Flashback 2019 old school demo competition!
You can check it out here:
The routine in question begins at 1:35.
Still no idea what I’m talking about? Well, you could google “demoscene” but if you need to do that this article is unlikely to be of interest to you.
Anyway! The physics routine. I’ve been asked about this a lot, so I’m going to attempt to explain the whole thing here.
The physics routine is split over three 4KB ROM banks, coming in at a total of 12KB.
This does not include the music, which has its own bank, or the text at the bottom, which is located in another bank altogether.
So each frame that you see is running code from five different banks.
Of the 12KB dedicated to the physics 3KB is code and the remaining 9KB is, yes, for those who have been asking — precalculated.
So how does it all work?
Lets start by looking at how each frame is rendered.
ATARI GRAPHICS BASICS
The Atari 2600 draws each raster line with a chip called the Television Interface Adapter (TIA). Calling it a graphics chip I think is a bit generous since it doesn’t even generate a frame of video signal. For any given state it only generates one line. The CPU needs to handle all the timing and tell it when to start and end each frame, timing the vertical blanking interval etc..
Each line consists of 160 visible pixel clocks (and an additional 68 invisible ones in horizontal blanking). There are exactly three pixel clocks per CPU cycle. A pixel clock is about twice as wide as it is tall, like multi-color mode on the C64. For our purposes we’ll assume that they’re exactly twice as wide as they are tall. It’s not really correct to say that one line consists of 160 pixels because there is no frame buffer and the TIA cannot just arbitrary draw what it likes at any pixel clock position. Rather it generates each line from six different graphics objects.
The two most complex graphics elements are two 8 bit “player” graphic registers. They’re not exactly sprites since the CPU has to manually load these two registers on a line by line basis to change them, but we’ll refer to them as such anyway. Each bit is shifted out of the TIA per pixel clock, so each bit is a pixel on the screen.
The other graphics object we need to know about is the playfield. The playfield is 20 bits wide and each bit is drawn over 4 pixel clocks. The astute reader will notice that this is only enough to fill half a raster line. The other half is drawn by either repeating or mirroring the first 20 bits. For our routine we’ll repeat. (It is possible to reprogram the registers in the middle of the scan line but we have enough going on as it is).
There are three other graphics objects, two “missiles” and a “ball” but we don’t use them.
So now that we know about the 2600 graphics objects that we’ll be using we can see how they’re used to construct on frame.
Interested in learning more? Here’s the Atari 2600 documentation: https://atariage.com/2600/programming/2600_101/docs/stella.html
ANATOMY OF A FRAME
This is a screen shot of the routine taken with the Stella emulator and set to use debug colors to make it clear what’s going on.
The pegs are generated using the playfield registers. As you can see, the left and right half of the screen are the same. Every 8 raster lines we write a different set of 20 bits into the playfield registers. There’s no more to it than that. The actual playfield layout is defined in a text file and a tool converts the text file into the bits that need to be written to the playfield registers.
Then there are three falling boxes. The state of each box is defined by three numbers: X and Y coordinates and an angle of rotation. Each XYA triplet is stored in a “box slot”. We can assume that the boxes are sorted from top to bottom. Additionally we are allowed to assume that at no point do any of the three boxes occupy the same raster line. We’ll return to how this constraint is guaranteed, but for now we’ll just assume that it is.
At first it might appear as though there are three 16 pixel wide sprites here. But the 2600 only has 8 pixel wide sprites so in reality we are seeing three pairs of 8 pixel wide sprites, positioned at three different x coordinates. The left and right half are colored differently here so you can see the two sprites being combined to form one box. So we have three boxes and each box takes up a 16x16 pixel grid requiring both 8 bit sprites per line.
Earlier on I said that each sprite pixel is twice as wide as it is tall. But this cancels out because we draw the entire screen in line pairs. Each pair of raster lines draws the same thing, giving us more time to calculate each pair and conveniently dealing with the aspect ratio issue.
So now that we know how each box is put to the screen we are interested in how the rotation is done.
PRECALCULATION NUMBER ONE
The box rotation is precalculated in the following way:
A C++ program generates a 16x16 grid and in the middle places an 11x11 pixel square. The size is selected such that the square can be rotated any angle and fix inside the 16x16 grid. The program then rotates the box 15 times to generate 16 frames. Each rotation increments by 5.625 degrees so 16 frames cover 90 degrees which gets us to where we started.
The grid is then split into left and right halves of 8 pixels each. Each set of 8 pixels fits in a byte, each box is 16 pixels high and 16 frames long so we can fit each half in exactly 256 bytes. Each half gets placed inside the ROM with a 256 byte alignment so it can be addressed with one byte. The second half follows the first so we can access each 8 pixel line segment’s complement by adding 256 bytes. All of this is contained in the same bank as the code.
DRAWING THE FRAME
Now we’re ready to put it all together and go through the complete process of how the screen is drawn.
Each line pair can be in one of four states. In each state it updates the playfield in exactly the same way so we can ignore that.
The pointer to the top most box slot is available to us, we’ll get to that later on.
The behavior of each line pair depends on the state as follows:
Sprite on: the sprite is drawing. We grab the left half of the line and shove it into sprite register 0 and do the same for the right half. We then advance our line pointer by one and check to see if we’ve drawn the 16th line of that frame. If we have then we change into the transition-to-off state.
Transition to off: We adjust our box slot pointer to point to the next box slot. Remember they’re guaranteed to be in order. The only complication is that any of the three box slots can be the top most one and they run backwards (we’ll get to why later). So we decrement and wrap around if we have to. At this point we have access to the Y coordinate of the next box which we’re going to need for the next state:
Sprite off: Here we are simply waiting for the raster beam to reach the Y coordinate of the next box which we obtained when we transitioned to off. Once we get there we need to switch to the final and by far the most complex state:
Transition to on: First the easy part. We grab the angle from the box slot and use it to point to the top line of the correct frame of the box animation.
Then comes the tricky part. We need to set the X coordinate. Those of you who have some experience coding 80s era machines might be thinking that you just take the X coordinate from the box slot, stick it in the sprite 0 X coordinate register, add 8 and stick that into the sprite 1 X coordinate register. And if we were coding an 80s era machine you would be totally correct. But the 2600 is 70s tech and we have some extra work to do. If you want to know why you should look up “polynomial counter”. Here we’re only interested in the how.
To position a sprite we need to wait for the raster beam to reach the position we want and then write into a strobe a register. Because there are three pixel clocks per CPU clock we could be off by one or two pixels. Fortunately there is another register than can do delta adjustments at the single pixel level. So we burn cycles until the right time and write into the strobe register for sprite 0 followed immediately by a write into the strobe register for sprite 1. The STA instruction takes three cycles which equals nine pixel clocks which means that sprite 1 will appear to the right of sprite 0 but with a one pixel gap. Since we need to apply an offset to the sprites for single pixel accuracy anyway this isn’t much of a problem — we just adjust the second offset by one pixel. The white lines on the left of the screen shot is where the 2600 makes the single pixel adjustment.
Finally we are in the sprite on state!
We are almost able to render our frame. But there is one more minor complication to deal with: the top sprite could be partially off screen. So we need to clip it by figuring out how many lines are off screen and adjusting the line pointer accordingly. Since we know sprites can’t overlap we can be guaranteed that only the top most sprite can be clipped.
So now we know how to take three XYA triplets from our box slots and render a frame. But where do those triplets come from? Well it isn’t actual real time physics simulation. It’s a bunch of precalculations.
PRECALCULATION NUMBER TWO
To calculate the actual physics I used a fantastic library called Box2D. Check it out, it does lots of cool stuff and pretty much every game you’ve ever seen that uses 2D physics is almost certainly using it.
Anyway, we’re going to simulate a bunch of box physics in Box2D and we’re going to store out each frame as an XYA triplet.
So lets begin with a quick calculation. If we store the state of one box with a 3 byte XYA triplet and we run at 50fps (this is a PAL demo) then we could animate that box at a rate of 150 bytes per second. If we dedicate two banks to store the animation for that box we could have 8192 / 150 = approximately 55 seconds of animation.
Not bad but we want three boxes so that gives us less than 20 seconds per box. Also we can’t quite fill each of the banks because of various overheads.
So we need some way to produce some extra variety. To start we’re going to generate 12 independent animation segments, each consisting of a box falling into the top of the screen and bouncing down through the pegs until eventually falling off the bottom. Why 12? Because that’s how many I got up to before running out of ROM space. The animation segments vary between about 2.5 and 4.5 seconds.
To generate the animation segments we need to configure Box2D. First the text file that defines the playfield is loaded in and each peg is set at the appropriate position as a static body.
Next up we need to create an initial condition for each animation and set it in motion. I did this by trial and error until I had enough animations that looked good and varied.
For each frame of the simulation the coordinates and angle of the box are taken from Box2D and quantized for the 2600’s resolution. The code that stores the animation to file only does so if the box would be visible. This means the initial conditions can be set to be off screen which makes it easier to find a good set of initial conditions. One a box is off screen the recording stops.
So now we have 12 independent animations segments. This data is stored in the two extra banks that the routine uses.
How exactly are we going to play them back to produce a lot of variety and, very importantly, ensure that no two boxes ever overlap. Well, for that we need some more precalculations:
PRECALCULATION NUMBER THREE
The third and final precalculation generates a “script”. It simply consists of sets of number that say “wait for X frames and then begin animation starting of address Y and going to address Z”.
We start by generating the sequence of animations. We could just use a random number generator to select numbers between 0 and 11. However people don’t find actual randomness very random. A truly random ordering would inevitably produce the same animation appearing 2, 3 or more times in a row. So to avoid this we use the CD shuffle method (generate the numbers 0 to 11 in a random order). We repeat the process until we fill enough time, just under a minute in our case.
Now we get to the final piece of the precalculation puzzle. We need to generate a 12x12 matrix that we index using the current animation and the next animation to obtain the number of frames that those two animations need to be separated by to ensure that no overlap can ever happen. But to keep boxes coming onto the screen as fast as possible we want to keep this number as low as possible. And to do that we use trial and error brute force. For each of the 144 combinations of animation pairs our program is going to do the following:
Start optimistic. Lets see if we can play our pair of animations zero frames apart. We work out the Y ranges of the first frame of both animations and check for an overlap. In this case there will be one. So we add an offset of one frame and try again and again. Eventually there will be no overlap. But that doesn’t mean there won’t be one in the future. So we advance both animations by one frame and do the check again. We might get through a dozen frames at which point the second box catches up and overlaps the first box. So we add another frame of offset and start all over again.
Eventually we will reach an offset such that the leading box falls off the bottom of the screen without any overlap being detected. And this offset goes into the matrix entry. Note that the matrix is not symmetrical. To see why imagine two boxes that just fall straight down, box A and B. The difference is that A is faster. If A goes first then B can go as soon as A drops a few lines. But if B goes first then A is going to have to give it a head start to make sure it doesn’t catch up. So our matrix entry for A,B is going to be different to B,A.
With our matrix and our CD shuffle randomization we can generate our script. We save out the delay time and the animation segment address. Since the animation segments are split into two banks we encode the bank in the high bit, which is ignored since we can only address up to 4K of ROM at a time. The playback code will use it to determine which bank to switch to.
So we have our precalculations and we have the ability to draw three boxes provided that they don’t overlap on the vertical axis. The only thing left is to update the box slots each frame.
The playback consists of two parts. The first part reads the script and sets up the animation segments. The second part updates the box slots each frame.
The first part runs a timer until the next box needs to be dropped. When the timer runs out it then checks to make sure that three boxes aren’t still on the screen, and if not it then grabs the next animation segment from the script and puts the correct animation offset into the box slot. This box is now the top-most box so this is updated for the render pass, since it must render in top to bottom order. The code then advances to the next box slot, wrapping if it needs to. The update adds the boxes in forward order, with the top most box being the most recently one added, and the renderer does the reverse, rendering the most recently added box first.
The second part updates each box slot every frame. It advances the animation pointer (checking to see if we’ve reached the end) which in turn gives us the location the XYA triplet.
Now that we know the triplet location we need to switch to the appropriate bank. You might think it would be a simple matter of writing some code to switch to the appropriate bank, read the required values and switch back. The problem with this approach is that as soon as we switch banks our code gets switched out as well and execution resumes at the equivalent address in the data bank. So we need to be a bit indirect.
We jump to one of two fixed addresses at the bottom of the main ROM bank, one for each of the two possible banks that contains our XYA triplet. The code here performs the bank switch. Then code in the destination bank takes over, grabs the correct values, jumps to another fixed address which switches us back to our main bank and into some code to return us back to the update code. All the code is carefully arranged so that when the banks switch, execution correctly resumes from the address of the next instruction in the new bank.
Now everything is updated, the render code will draw everything correctly and the routine is complete!
This routine a little unusual for me. Usually I try to write code that pushes the hardware to the brink.
But this one is more like a magic trick. The magician hasn’t really chopped anyone in half. It just looks like it. Similarly the Atari 2600 with its 128 bytes of RAM and budget version 6502 isn’t really simulating physics and resolving collisions, but it’s my hope that to those of you who watched the demo is actually looked like it.
It’s also by far the most elaborate bit of precalculation I’ve ever done. To generate 9KB I used a full featured off the shelf library and wrote 748 lines of code (just for the physics, excluding the other parts of the demo). The size of the precalculation code comes in at almost 18KB, so on average two bytes of code to precalculate one byte of data!