Writing a retro 3D FPS engine from scratch

EDIT: How to play: Install the TIC-80 virtual console on your computer. Launch TIC-80 and type the surf command. Select [tic.computer/play] and press Z to select it. Find FPS80 on the list and press Z to start. You can also play the game on the TIC-80 website (but sound is broken).

I always liked the classic first-person shooters of the 90s. I used to sit in front of my 386 playing Doom for hours, amazed as to how someone could possibly write code that drew real-time 3D graphics on my screen in real time at the amazing resolution of 320x200. I knew some programming (I had just started learning BASIC), so I realized that deep down it was all a bunch of math and bytes getting written to video memory. But arrays were a hard enough concept for me at the time, so I couldn’t even begin to fathom the complexity behind 3D rendering.

Back then, everyone wrote their 3D engines from scratch because there was no other way to do it. But nowadays, writing 3D rendering logic from scratch is probably a really bad idea. Really bad. Talk about reinventing the wheel! With so many 3D engines and libraries out there, which are much more well tested and optimized than anything you could come up with, there’s no reason why any reasonable developer could justify writing their own home-grown engine.

Unless…

Suppose you could go back in a time machine to the 90s, a time before OpenGL and DirectX, before GPUs. All you had was the CPU, and a screen full of pixels, and had to write everything from scratch.

If that’s your idea of fun, you’re not alone: that’s precisely the kind of thing that you can do with a fantasy console like the TIC-80.

After I wrote 8-bit panda, my first game for that console (you can read about it in my previous article), I was looking for an idea for a new game. So I decided to take on the challenge of writing a simple 3D engine from scratch and make a minimal first-person shooter to go with it.

That’s how I started writing FPS80.

In this article, I’ll describe how 3D rendering in FPS80 works, and talk a bit about all the performance tradeoffs (and numerous horrible hacks!) to implement fast 3D rendering on this limited machine.

Basics of 3D graphics

If you already know how 3D rendering works, this will section will be boring. Feel free to skip ahead to the next section!

A full explanation of 3D graphics is beyond our scope, but let’s give a quick summary as relevant to our usage.

The fundamental idea of 3D graphics is mapping 3D space into a 2D surface (your screen). In other words, one of the most basic things a 3D engine has to do is, given the (x, y, z) coordinates of a point in space, find the 2D coordinates (x, y) where that point should be on your screen.

To do this, 3D engines employ a series of transformations. A transformation is just a fancy way of saying “a mapping from one coordinate system to another.” They can represent motion (translation, rotation, scale) and projection. They are normally represented by matrices.

To calculate the screen position of a point, we first transform it by its model matrix M, which brings it to world space. Then we multiply by the view matrix V to account for the camera’s position, which brings it to camera (or eye) space. After that, we apply the projection matrix P, which does the perspective transformation, responsible for making closer objects look bigger and farther objects look smaller. After that comes the perspective divide, which brings us to actual screen (viewport) coordinates:

p’ = P * V * M * p

Applying this sequence of operations to a point is often called “projecting” the point. But scenes are made of objects, not points, so this is not all that a 3D engine does.

A 3D object is made of polygons. To draw it, the engine normally divides the polygon into triangles, then projects each vertex of each triangle using the formula above, then draws the resulting screen-space triangles to the screen. This process is called rasterization: it’s the conversion from something that’s vector-based (a triangle) to something that’s raster-based (the actual pixels on the screen). But drawings the triangles in a random order wouldn’t give good results, because you might wind up with a triangle that’s farther from you drawn on top of a triangle that’s nearer, so the engine has to have a strategy to avoid that. That’s normally accomplished either by sorting the polygons beforehand or having a depth buffer, which records the depth of each pixel to the screen, so you can know when to draw and when not to draw a pixel.

Add texture mapping and lighting, and that’s a whole other layer of math that the engine has to do to figure out what color the final pixel should be.

Do we need full 3D?

I actually started out implementing the full workflow for 3D graphics I described above, but realized that this was going to be too slow on the TIC-80. Just like in the old PCs, all the math and drawing can easily weigh down the CPU and quickly make the game laggy. I wrote a sample that drew just 4 texture-mapped rotating cubes, and even that was enough to sink the frame rate down below 10fps. Clearly going “full 3D” wasn’t going to work. In particular, the fill rate is the bottleneck, that is, the problem is not so much the math to project points, its the time it takes to draw all the pixels of the final triangles to the screen, especially when you have to do it multiple times due to several polygons occupying overlapping regions of the screen.

But do we really need full 3D? Actually, it turns out that a lot of the old games of the 90s don’t do “full 3D”. Instead, they implement some clever mechanisms to restrict the graphics such that they’re easier and cheaper to render.

So here are the restrictions I imposed:

  • The entire level is flat: no stairs, pits, balconies, second stories or anything of that sort. That is, the floor and ceiling are at fixed heights.
  • All walls are the same height, extending from floor to ceiling. And they are all opaque.
  • The player can turn to face any direction, but can’t tilt their head up or down, or tilt it sideways. In other words, the camera only has one degree of rotational freedom: the heading or yaw.
  • All objects are drawn as “billboards”, that is, 2D images painted in 3D space such that they always face the camera.
  • There are only point lights. One of them is always on the camera (like a hand-held torch that the player is carrying). Others can be created briefly for certain effects like explosions.

How do these restrictions help us?

The fact that the floor and ceiling are the same height and that every wall goes from floor to ceiling (and that the player can’t look up/down) creates an extraordinary property that will immensely help us: at every X position on the screen, there will be at most one wall. In other words, all the visible walls occupy the middle part of the screen and you never have two different walls occupying the same X coordinate:

Let’s highlight the walls in different colors so it’s easier to see:

As you can see, there are 6 walls in this shot (the door is also a “wall”), and they are all neatly arranged side by side on the screen: each X coordinate on the screen corresponds to exactly one wall.

This is very useful to us because we can render the walls in two steps:

  • Figure out which wall should end up on each X coordinate (in other words, build a map from X coordinates to walls).
  • Iterate (only once!) over all the X coordinates, drawing the correct piece of the wall for each.

For the entities (objects) like monsters, projectiles, columns, fountains, trees and the like, we will use a common technique called billboarding. Instead of having an actual 3D object representing the entity, we will simply draw a flat 2D image at a 3D position, like a cardboard cutout. This is very cheap to draw too.

Our projection routine

Even in our vastly simplified 3D world, we ultimately still need to project points from 3D space to 2D space. However, our constraints make the math significantly easier and faster to process. Instead of doing full matrix multiplications, we can actually get away with something a lot simpler: we figure out the fixed parts of the matrix math (the ones that don’t depend on the player’s or object’s position), and then just bake the pre-computed numbers into the code. We’ll give a quick overview, but if you’re interested in the full math, here’s the full derivation.

Remember the formula is:

p’ = P * V * M * p

Where p’ is the screen coordinates (output) and p is the world coordinates (input). In our simple model, M is the identity matrix (everything is represented in world space), so this becomes:

p’ = P * V * p

The view matrix V can be computed from just the translation and rotation of the camera, which is restricted to rotations about the Y axis. If the rotation is R and the translation is T, then:

p’ = P * R * T * p

The projection matrix P can be pre-computed offline because it’s only dependant on the field of view and near/far clipping planes, which are fixed throughout the game. Boiling down all the math, we end up with a pretty straight-forward (if extremely hard-coded) projection function:

function S3Proj(x,y,z)
local c,s,a,b=S3.cosMy,S3.sinMy,S3.termA,S3.termB
local px=0.9815*c*x+0.9815*s*z+0.9815*a
local py=1.7321*y-1.7321*S3.ey
local pz=s*x-z*c-b-0.2
local pw=x*s-z*c-b
local ndcx,ndcy=px/abs(pw),py/abs(pw)
return 120+ndcx*120,68-ndcy*68,pz
end

Notice all the magic numbers: 0.9815, 1.7321, etc. These all come from the pre-baked matrix computations and have no intuitive meanings by themselves (which is why I didn’t even bother to make them constants).

The variable terms, cosMy, sinMy, termA and termB are all computed from the current camera translation and rotation before this function is called, because they can be computed only once for all points.

Rendering the walls

For the first step, we will use an array, which we call H-Buf (horizontal buffer), which will indicate what will be drawn at each X coordinate of the screen. “H-Buf” isn’t standard nomenclature, it’s just something I made up. It’s called H-Buf because it’s H (horizontal) and it’s a Buf (buffer). I’m not terribly creative with names.

The TIC-80 screen has 240 columns (240x136), so our H-Buf has 240 positions. Each position corresponds to an X coordinate of the screen, and contains the information about what piece of what wall will be drawn there. To make things simpler, instead of saying “1-pixel wide piece of wall”, we will call this a “wall slice”. So each position in the H-Buf gives the information about what wall slice will be drawn at each X coordinate on the screen:

  • wall (object): a reference to the wall that will be drawn on this slice.
  • z (float): the screen-space z coordinate (depth) of this slice.
  • and more stuff… but not yet!

At least for the first step, this is all we really need! So we iterate over all walls in the level (we’ll talk about how to make that more efficient soon). For each wall, we:

  • Use the projection function on its four corners to determine where it is on the screen: left x, right x, left-top y, left-bottom y, right-top y, right-bottom y, left z (depth), right z (depth).
  • Iterate over every X coordinate (from left_x to right_x) of the wall, and fill the slice of the wall in the H-Buf, whenever its depth is smaller than the existing slice at that position in the H-Buf.

The test for “this slice has smaller depth than the existing slice” is called a depth-test. This is what enables us to render the walls in the correct order — the ones nearer to the player should occlude the ones that are farther. Here’s the relevant piece of code:

function _AddWallToHbuf(hbuf,w)
local startx=max(S3.VP_L,S3Round(w.slx))
local endx=min(S3.VP_R,S3Round(w.srx))
local step
local nclip,fclip=S3.NCLIP,S3.FCLIP
startx,endx,step=_S3AdjHbufIter(startx,endx)
if not startx then return end
for x=startx,endx,step do
-- hbuf is 1-indexed (because Lua)
local hbx=hbuf[x+1]
local z=_S3Interp(w.slx,w.slz,w.srx,w.srz,x)
if z>nclip and z<fclip then
if hbx.z>z then -- depth test.
hbx.z,hbx.wall=z,w -- write new depth and wall slice
end
end
end

After we have done this for every wall, the H-Buf will correctly represent what wall slice we need to render on each X coordinate, so to render them we can just iterate and render the correct portion of the correct wall. This is what makes this approach fast: when we draw the walls, there’s no hesitation and no wasted effort: we know exactly what to draw where, and only touch exactly the pixels that we need to touch. We never draw some part of the wall and then another wall on top of it (overdraw).

Here’s the rendering code. Note how it’s pretty straight-forward: for every X, just render a column of a texture at that X coordinate, nothing else:

function _S3RendHbuf(hbuf)
local startx,endx,step=_S3AdjHbufIter(S3.VP_L,S3.VP_R)
if not startx then return end
for x=startx,endx,step do
local hb=hbuf[x+1] -- hbuf is 1-indexed
local w=hb.wall
if w then
local z=_S3Interp(w.slx,w.slz,w.srx,w.srz,x)
local u=_S3PerspTexU(w,x)
_S3RendTexCol(w.tid,x,hb.ty,hb.by,u,z,nil,nil,
nil,nil,nil,w.cmt)
end
end
end

Get twice the frame rate with this one weird trick

Even though the TIC-80 is just 240x136, filling the entire screen still takes quite a bit of CPU power, even if we don’t waste any time at all and know exactly what to draw. Even the fastest rendering algorithm would only hit about 30fps on simple scenes, and drop significantly lower on more complex ones.

How do we get around this problem?

We have to question our assumption that we need to fill the entire screen. We don’t. The human eye is very forgiving to things happening in quick succession (and people who play retro games are very forgiving of slight visual glitches!), so we can speed up our rendering by only filling half the screen on each frame.

We do this:

  • On even frames, only draw the even-numbered X coordinates.
  • On odd frames, only draw the odd-numbered X coordinates.

This is also known as interleaved rendering. This means that on each frame we’re actually only rendering 120x136 pixels, not the full 240x136. We don’t clear the screen at each frame, so the pixels we don’t draw just remain as left-overs from the previous frame.

This causes some visible glitching, especially during fast movement, but it actually works for the game instead of against it, giving it a retro TV-set feel.

So this is what we actually render each frame (at least, this is what it would look like if the other columns weren’t already filled with the pixels from the last frame):

One frame of interleaved rendering. Unrendered columns left blank for clarity. In reality, they would be filled with the last frame’s rendering.

Rendering a wall slice

Ok, so now we’re down to rendering a wall slice. We already know which slice to render, and we already know where. All we have to do is figure out what pixels to draw on that column, and what colors they should be. How do we do that?

A wall slice (1-pixel-wide vertical column of a wall).
  • Calculate the light inciding on the slice by taking the light sources and distances into account.
  • Calculate the horizontal texture coordinate for the wall slice (with perspective correction).
  • Calculate the top and bottom screen Y coordinate for the slice, by interpolating the end points of the wall.
  • Fill each pixel from top to bottom, which means figuring out the vertical texture coordinate (affine mapping), sampling the texture, then modulating by the light, and finally writing the pixel to the screen.

We can get away with affine mapping (as opposed to perspective-correct mapping) for the vertical texture coordinate because the walls are always upright with respect to the screen (this is guaranteed by the fact that the player can’t tilt their head up and down, or sideways). If this were not the case, we’d have to implement perspective-correct texture mapping for both horizontal and vertical texture coordinates, which would be slower.

Rendering entities (enter the stencil buffer!)

Ok, walls are fun, but it wouldn’t be a game without some enemies. As we said before, the enemies are rendered as billboards, that is, floating images that always face the player.

Entities drawn as billboards (always facing the camera).

These images are not plain 2D though: they are drawn at a particular point in space, so they have depth. If an enemy goes behind a wall, the wall should be rendered in front of the enemy (the enemy should be occluded by the wall), not the other way around.

How do we achieve this effect?

If we wanted to keep things very simple, we could use the painter’s algorithm (which is more like the lack of an algorithm): just draw things back to front, and naturally the stuff in the front will be drawn over the stuff in the back, resulting in the right effect.

So what’s wrong with that? It’s simple to understand. It’s easy to write. But… it’s very slow, especially on the TIC-80, which is fill-limited. Doing that means you would be wasting a lot of effort drawing beautifully textured and lit pixels that will only be clobbered later by something else that you’ll draw on top.

This is where we have to introduce a new buffer: the stencil buffer. In 3D rendering, the stencil buffer is a 2D surface that’s as large as the screen, that indicates where you can and where you can’t draw. It’s like a filter. Each pixel in the stencil buffer can be “on” to mean “blocked, don’t draw”, or “off” to mean “free, please draw”. Depending on the implementation, the meanings might be switched, but in our code we use “true” to mean “occupied, don’t draw”.

When something “reads stencil”, it means that, before drawing, it will test to see if the stencil buffer is “on” at each position, and won’t render over an existing pixel if so.

When something “writes stencil”, it means that whenever it renders something at a given screen position, it writes to the stencil buffer to indicate that that position is now filled, and shouldn’t be drawn again.

So here’s what we do. Before we render the walls, but after we compute the H-Buf, we draw the entities. The full sequence is:

  1. Clear stencil buffer.
  2. Compute H-Buf.
  3. Render entities from near to far (read/write stencil, depth-test against H-Buf).
  4. Render walls using H-Buf (read stencil).

Every pixel of every entity we render, will write the stencil buffer, preventing anything else from rendering over it. So we have to render in the order from near to far. This means that whatever we draw is final. This means no overdraw, we never waste a pixel we have put on the screen.

When drawing the entities, we know their screen-space depth and we know the depth of the wall at that X position (courtesy of H-Buf!) so we can also depth-test that column. This means that we will correctly refrain from rendering any parts of an entity that will be later hidden behind a wall (even though, at that point in the rendering, the wall isn’t actually there yet!). Again, we don’t want to waste a single pixel. They are expensive.

Here’s an example scene with 4 billboards occluding each other, and the order in which they are rendered. First the fountain (#1, nearest), then the tree (#2), then the green monster (#3), then the orange monster (#4, farthest).

What about the floor and ceiling?

The floor and ceiling are rendered as the last step in the process, because they are the only pixels left over after rendering the entities and the walls. The ceiling is easy: it’s just solid black, we just fill those pixels in by looking at the H-Buf and figuring out where the wall starts. Every pixel above that is black (but we still check the stencil buffer). Done.

For the floor, it’s more complicated because we have to apply the lighting effect: we want the parts of the floor that are farther from the player to be dimmer.

But to compute the lighting, we have to know the depth of each pixel on the floor. Figuring this out for every pixel is actually surprisingly easy, because we can work backwards from the sreen space coordinates to find the world space coordinates, and figure out the distance from the player from that. And what’s more: we can precompute everything by hand and bake those magic numbers in, so the routine that does this is actually pretty simple:

function _S3FlatFact(x,y)
local z=3000/(y-68)
return _S3LightF(x,z)
end

Given x,y, screen coordinates of a pixel on the floor, the corresponding Z coordinate on the floor is just 3000/(y-68). I have no idea why, that’s just what came out of the pen-and-paper math. So we use that in _S3LightF to compute the amount of light inciding at that spot, and module the floor color appropriately.

Here’s what the light pattern on the floor looks like. Notice how it gradually gets darker as we get far from the player:

Additional Point Lights

We allow for additional temporary point lights for lighting effects like when the player throws a flame orb and it explodes:

The principle is the same: we compute the screen-space distance from the point we’re drawing to the secondary light source (the explosion), and add those contributions.

Note that we don’t use this effect for permanent light (torches on the wall?), because it’s expensive, and also not very robust. Because it’s computed in screen space, it doesn’t really deal well with rotation and translation of the camera, and has nasty singularities and divisions by zero when they happen too close to the player.

Color modulation and dithering

The TIC-80 only allows for 16 colors, so how can we create lighting effects, making pixels lighter or darker?

The technique I used was to create 3 different “hues” that have 4 different shades, going from dark to light. These hues can be modulated by selecting one of these 4 shades of each, plus black (color 0) for the darkest. So we end up with 3 “ramps”: gray, green and brown:

clrM={
-- Gray ramp
{1,2,3,15},
-- Green ramp
{7,6,5,4},
-- Brown ramp
{8,9,10,11}
},

These are just defined as arrays, so that when we’re trying to modulate a color by a certain light amount, all we have to do is look up the color and figure out what other shade on the same “ramp” we should pick.

What about in-between colors?

For that, I adopted a strategy used by old graphics hardware: dithering. Since we can’t have a single pixel of an intermediary color, we draw a pattern where some pixels have one color and some pixels have another color, and the net effect to the eye is that we get the average color:

Particle effects

To give the player some nice visual feedback when defeating monsters or blasting things, we use simple particle effects:

These are just little rectangles simulated in world space and transformed to screen space at render time. For speed, they aren’t even clipped or stenciled, they are just drawn on top of everything else (minor deviation from the “don’t waste pixels” principle). Luckily for us, interleaved rendering disguises the fact they are rectangles: since they move so fast, and get rendered at different positions every frame, they get torn apart by the interleaving and look more “organic”.

Conclusion

In this article we have briefly examined how rendering works in FPS80. I didn’t have any previous experience writing 3D engines from scratch, so I’m sure I could have done many things better, and my explanations are probably wrong in many aspects. Please, feel free to correct me (contact me on twitter).

Developing my own 3D rendering logic was entertaining and instructive. The need to squeeze every single bit of performance out of the algorithms was particularly fun, and gave me a (small) window into what it was like to create the 3D engines of the 90s!

The source code for FPS80 and the rendering engine is available on github. Please feel free to reuse in your own projects!