Two weeks ago, I became intrigued with the idea of doing some form of real-time lighting on the PICO-8, a tiny fantasy console. I figured that I would fiddle with it for a while and decide it’s impossible on the puny simulated CPU, but this turned out to be wrong — PICO-8 proved to be a pretty beefy machine after all! Working on this was so fun that I couldn’t stop pushing it, adding new features and optimizing the engine behind it until I got to this point:
I’ve been working on this for quite a while and happily tweeting about it, slowly turning this engine into a game. However, I recently got it to run at 60 FPS and started getting some disturbing responses.
Determined to prove I’m not a witch, I decided to release the magic incan… code as a cartridge, along with these totally scientific and completely non-esoteric explanations of how it all works.
Some things in here will be very specific to the engine, while other techniques will hopefully be more general and applicable to any piece of PICO-8 code you would like to heavily optimize. I tried to keep the post accessible, but I had to assume some familiarity with hexadecimal notation and bitwise operations, just to keep the already ballooning word count to a reasonable level.
When all you have is a palette, everything is a nail
In modern games, lighting is done on the GPU by shaders that have a 24-bit colorspace to work with and the horsepower to do gazillions of calculations per pixel. On the PICO-8, things are a bit more quaint — anything that has to do with manipulating color will usually boil down to one thing: a palette effect mapping the 16 colors to different ones. This is the way to do basic fades, and a lighting effect is just a more complicated fade. Instead of uniformly moving all pixels towards black the same amount, we pick that amount based on light level.
So, we know the basis of our effect will be setting up a few palettes — one for each lighting level we will support. For Dank Tombs, palette #1 is basically a palette that does nothing, since it represents “fully-lit”. Then, we have a number of intermediate palettes, with colors in each getting successively darker. The last palette will be for totally unlit regions — a palette that maps every single color to black.
I’ve chosen to have six lighting levels, including full brightness and pitch black. Every lighting level added is a significant performance hit — so six was chosen as kind of a sweet spot. The palettes are kept in sprite memory and read from them during initialization— that way they’re convenient to edit, and it’s easy to eyeball what color should go where for it all to work.
When it comes to using these palettes, there are basically two choices:
- For each thing to be drawn, set the right palette first (based on how it is lit), then draw it.
- Draw everything fully lit first, then manipulate pixels on the screen to apply the lighting effect.
The former option seems more attractive at first, since that’s exactly what PICO-8 supports out of the box through the
pal() function. The problem with that idea is that light level changes happen in the middle of objects — so either we couldn’t make it look quite right (supporting only one light level per object), or we’d have to draw each object multiple times with a masking/clipping setup reaching byzantine levels of complexity. Let’s not even think what would be needed to handle backgrounds —
map() is definitely not going to work when lighting changes once every tile.
We are left with the second option, which is refreshingly simple in comparison. We draw everything first just as we would normally, having the full comfort of using
spr() for their intended purposes. Once we’re done with that, we manipulate each pixel on the screen to apply the right palette.
On dividing, conquering, and getting sidetracked a few paragraphs in
Before we dive into coding, it’s worth taking a moment to consider how our filter is going to work. As a core concept, it will be a series of concentric discs centered on the light — each disc applying a darker lighting level as we go away from the center.
Writing code that does this in one fell swoop is kind of daunting, but every complex task can be broken down into simple ones. In graphics programming there is a standard approach to handling complex shapes — breaking them down into horizontal lines. Drawing a horizontal line is simple, so we can concentrate on implementing whatever effect we want within a tight loop. Meanwhile, the complex calculations involved into figuring out where exactly to draw the lines can be done in a separate place.
This is how the engine works, and part 1 of this series will concentrate on fading a single horizontal line as fast as possible, leaving the gruesome details of how to put them together for future posts. I’ll be using mostly pseudo-code here — for actual, final code, look up the
fl_blend() function in the cartridge.
If you want it done well, do it yourself
Before I went into an aside on general programming principles for three full paragraphs, we decided that we’re going to apply our effect after drawing everything onto the screen.
The problem with this approach is that built-in PICO-8 support for palettes cannot help us. If we had a function that worked like
spr() but copied data from screen memory, we could use that — but alas, there is no such thing. Woe us.
This means that we have to do all the work manually. Palettes just replace pixels of one color with a new color, so the straightforward way to apply one is something like that:
for each x, y we want to change:
local pixel = pget(x, y)
local mapped_pixel = palette[pixel]
pset(x, y, mapped_pixel)
That is very simple, and like most trivial algorithms, it’s dog slow. Applying an effect this way to more than a 32x32 region already makes us drop frames. That’s because
pget() have a lot of extra work to do: they have to clip the coordinates, apply the camera position, and they also have to deal with one extra thing — which I will now explain in more detail, as it will also be the idea behind our solution to the problem.
Going off the deep end: PICO-8 screen memory
PICO-8 simulates an old fashioned computing system. Our pixels don’t live somewhere in an anonymous screen buffer texture on a separate GPU card accessed via a 200MB driver. Instead, they are just chilling at address 0x6000 in our simulated RAM, waiting to be directly manipulated.
Since PICO-8 only has 16 colors available and 16*16=256, the designers of our imaginary hardware made the clever decision of storing two pixels in each memory byte. Each screen line is 64 bytes long, and each of those bytes stores two pixels as follows: byte = (left pixel) + 16 * (right pixel). It looks like this:
pset() has its work cut out for it — not only does it handle clipping/camera/palette, it actually has to go into the right byte and only manipulate half of it. That’s part of why it’s simulated to be quite slow, and why it would be slow if PICO-8 was real silicon.
Peeking and looking up
Fortunately, we can avoid all the extra work
pset() does — we can just go mad scientist on that thing and manipulate screen memory directly! This is made easy thanks to the uber-fast
If we think about it carefully, there is no reason our filter has to work one pixel at a time. A palette is basically a 16-entry table that says: for color X, use color Y instead. We can cook up a similar table, but one with 256 entries. That way each entry can handle an entire byte containing two pixels (like the
A8 representing the red-yellow pair in the image above) and apply the palette to both at the same time, returning a new byte with two new colors.
This is a very general technique called a look-up table, and various effects can be sped up by precalculating the results and just looking them up in the table at runtime.
Once we prepare our look-up table during initialization, based on our palettes in sprite memory, a first sketch of a solution would be something like this:
local start = calculate_address(x1, y)
local end = calculate_address(x2, y)
for address = start, end do
That’s much faster than
pset(), but since we’re mad scientists, we don’t want fast — we want crazy fast!
Doing more work to do less work
Just like before, the one-liner above has a place where a lot of extra work is hidden:
lookup[…] is secretly a performance hog.
In compiled languages like C, arrays are simply block of continuous memory. Accessing an element is simple: just take the address of the array, add the index (multiplied by element size), and voila!
In Lua, the table structure has to handle everything from simple arrays to complex nested objects. It doesn’t matter that we know our look-up table has only integers as indices. As far as Lua is concerned, we could change our mind at any time, say
lookup[“banana”] = “cucumber” and ruin its day. This means that Lua can’t use the simple array representation C does, relying on slower hashtables instead.
What are we to do as mad scientists? What mad scientists always do — implement it ourselves! There is a nice chunk of memory reserved for our convenience in the PICO-8 memory map, starting at 0x4300. If we keep our look-up tables in memory, we can replace
peek(tableaddr+index), which turns out to be much faster.
But we can do even better! Our tables are exactly 256 bytes long, so if we keep them at “round” (0x??00) addresses, we can get away with using bitwise OR (
bor() on PICO-8) in place of addition, taking advantage of that operation being even quicker.
After all is said and done, our loop now looks like this:
for screenaddr=start,end do
Sure doesn’t look like an improvement, but it’s not the first time optimization made something ugly.
Against all odds
We’ve got our filter working at ludicrous speeds, but as usual, there is one more problem. Screen coordinates are like people — about half of them are odd. When the line segment we want to draw starts or ends at an odd X coordinate, we can’t change the entire byte — we have to carefully change only the part describing the left (or right) pixel.
We could handle this inside the loop, but adding conditionals to the most-often repeated piece of code is right at the top of the list of big optimization no-no’s. This is why we handle the thorny stuff outside the main loop, carefully using bitwise operations to change only one half of the byte:
start = address of first full byte of our line
if x1 is odd:
local addr = start-1
local previous = peek(addr)
Similar handling is done for
x2 as well (but the unfortunate case is when
x2 is even). The real code has some extra magic in the condition in order to handle fractional coordinates as well, but the principle is the same.
End of the line
That’s the end of part 1. We are now able to darken a single horizontal line segment, which probably isn’t enough to impress people at parties yet.
But have no fear! The next post will be all about taking our horizontal line routine and putting the effect together. So, if you’re in the mood for more mad science, stay tuned for part 2 and look for it on my Twitter or on the lexaloffle.com forums. Let me know if you liked this part, and if you have any questions, feel free to tweet at me.
Until then, have fun hacking on the PICO-8!