Simulating an 8-bit 6502 microprocessor chapter 5: ray-tracing is assembly language

Linear algebra is not welcome here

William Seymour
9 min readJun 17, 2024

This is the fifth chapter in an odyssey that encompasses:

  1. Using test-driven development (TDD) to create an instruction-level simulation of the 6502 microprocessor.
  2. Reverse engineering the keyboard input and graphics output of an Apple 2 computer (a 1970s computer with the 6502 at its core).
  3. Combining the pieces into an emulation of the Apple 2 computer accurate enough to run real 1970s software.
  4. Rewriting the simulation in C++ to make it 500x faster than it was in reality.
  5. Testing that speed with a ray-tracing graphics engine written in 6502 assembly.
A sphere rendered with ray-tracing in 6502 assembly code, generated in 2–3 seconds @300Mhz

I: First forays in assembly

After drastically speeding up my 6502 to run at 100s of MHz (instead of the 1Mhz it originally achieved), I found myself wondering what this could enable. What would computer history have looked like, had 8-bit processors reached such speeds in the 1970s? The one way to find out, I thought, would be to try and build something that would have blown an original Apple 2 user’s mind.

This would mean coding in assembly language. Assembly is hard. You are programming at the instruction-level, telling the cpu exactly what to do each clock cycle. It doesn’t help that, in this language, ChatGPT is completely illiterate.

Below you can see a subroutine (essentially a function) that loops over each pixel in a matrix. It is equivalent to the preceding function in Python:

# Python 2D matrix loop

def run():
for row in rows:
for col in cols:
draw_pixel()
; Assembly 2D matrix loop

Run:
lda #$00
sta RowCounterAddress
sta BlockCounterAddress
sta ThirdCounterAddress

ForEachRow:
lda RowCounter
cmp #MaxRows
bcs EndRowLoop

jsr ResetColOffsetIndex

ForEachCol: ; each col is a pixel - 7 per address
; jsr DrawPixel

; Increment col offset index
lda Temp2 ; Increment col offset index with carry
clc
adc #$05
sta Temp2
lda Temp1
adc #$00
sta Temp1

lda ColCounter
clc
adc #$01
sta ColCounter
bcc ForEachCol ; Hard-coded 256 columns

AtEndOfRow:
; Increment row offset index
lda Temp4 ; Increment row offset with carry
clc
adc #$05
sta Temp4
lda Temp3
adc #$00
sta Temp3

; Reset col counter
lda #0
sta ColCounter

; Increment row counter
inc RowCounter
lda RowCounter

jmp ForEachRow

EndRowLoop:
rts

To get a feel for things, I started out trying to get a white square to bounce around in a field of static. This was surprisingly challenging, but I started to get my head around things and decided to raise my ambitions.

II: Towards ray-tracing

In a previous project, I built a ray-tracing graphics engine, and I wanted to see how hard it would be to get a basic version of it working on the Apple 2. Ray-tracing requires lots of floating-point arithmetic, so I would first need to figure out how to do that.

Later revisions of the Apple 2 included floating point routines in ROM, and these are indeed used by Basic. They are theoretically accessible to the assembly programmer, but I could find no documentation — only this short historic pamphlet, apparently the result of some forensic engineering:

Using these functions is pretty fiddly. To multiply two floats together, you have to do set up some pointers, put some values into memory, and then do something like this:

ldy TempF1High
ldx TempF1Low
jsr MOVMF ; Pack FAC to address
lda TempF1Low
jsr FMULT ; Unpack and multiply
ldx TempF1Low
jsr MOVMF ; Pack

Unfortunately, I found that a lot of these functions work in some unexpected, even misleading, ways. This, for instance, is supposed to be a sine wave:

After a lot of testing, I was able to redocument these subroutines and had uncovered enough functionality to do proper floating point arithmetic.

III: A new graphics card

You’ll remember from chapter 4 that high-res graphics mode is far from ideal for ray-tracing. We want to be able to control every pixel independently, but on the Apple 2 this isn’t really possible. One reason is how much memory it would require — an 8-bit, or 256-value, greyscale bitmap comprising 192 rows of 256 columns would take up 75% of the total addressable space of the cpu. In other words, it’s not practicable.

I solved this problem by inventing my own graphics card standard. It’s very simple, and would have been entirely feasible in the 1970s; it’s just that no-one would have made it because reducing memory usage was far more important than creating simple APIs.

My graphics card is essentially a separate memory bank that holds the 192x256 bitmap, and is accessed by writing to a few memory addresses, exactly how a standard add-on card for the Apple 2 works. To use my graphics card, all you need to do is provide an address and a value. Writing the value triggers the card to put that value in its address.

Now with our imaginary graphics card installed, and using floating point arithmetic, we can draw smooth greyscale images!

Run:
; initialise graphics card
lda #$00
sta VCARDLOW
sta VCARDHIGH

ForEachRow:
lda RowCounter
cmp #MaxRows
bcs EndLoop

ForEachCol:
jsr DrawPixel
lda ColCounter
clc
adc #$01
sta ColCounter
inc VCARDLOW
bcc ForEachCol ; Hard-coded 256 columns

lda #0
sta ColCounter

inc RowCounter
inc VCARDHIGH

jmp ForEachRow

EndLoop:
rts

IV: Ray-tracing

I set myself the goal of rendering a single sphere with a single light source. To start, I prototyped a function in Python and made sure it was as simple as possible. The algorithm goes something like this:

  • For each pixel, we fire a ray
  • The ray has an origin
  • The ray has a direction
  • A sphere has a centre
  • A sphere has a radius
  • Does the ray hit the sphere?
  • If it does, where is the point it first hits (not the point it exits)
  • What is the normal vector of that point (ie what direction is that point on the sphere facing)?
  • How close to the light vector is the normal vector? Or, how much light hits that point on the sphere.

Keen eyes will notice a couple of square root operations avoided. Square roots have a HUGE penalty in this architecture. A single extra square root operation visibly slows things down.

import numpy as np
import matplotlib.pyplot as plt

def ray_sphere_intersection(ray_origin, ray_direction, sphere_center, sphere_radius):

# Convert inputs to numpy arrays for vector operations
O = np.array(ray_origin)
D = np.array(ray_direction)
C = np.array(sphere_center)
r = sphere_radius

# Normalize the direction vector D
# D = D / np.linalg.norm(D) # removing sqrt seems to have no effect

# Compute coefficients of the quadratic equation
a = np.dot(D, D) # 1 if D is normalized - corrects for non-normalised
OC = O - C
c = np.dot(OC, OC) - r**2
b = 2 * np.dot(D, OC)

# Compute the discriminant
discriminant = b**2 - 4 * a * c

# Check for real solutions
if discriminant < 0:
return None
else:
# Compute the two points of intersection t1 and t2
sqrt_discriminant = np.sqrt(discriminant)

t1 = (-b - sqrt_discriminant) / (2 * a)
t1 = (-sqrt_discriminant -b) / (2 * a)

t2 = (-b + sqrt_discriminant) / (2 * a)

# Points of intersection
P1 = O + t1 * D
P2 = O + t2 * D

return {
"t1": t1,
"Point 1": P1,
"t2": t2,
"Point 2": P2,
"normal": C - P1,
}

# Example usage:
ray_origin = [0, 0, 0] # Ray starts at the origin
sphere_center = [0, 0, 5] # Center of the sphere
sphere_radius = 3 # Radius of the sphere

light_vector = np.array([-0.1, 1, 0.2])

# Call the function

pixels = np.zeros((200, 200))

z = 100
for i, y in enumerate(np.arange(-100, 100, 1)):
for j, x in enumerate(np.arange(-100, 100, 1)):
ray_direction = [x, y, z]

intersection = ray_sphere_intersection(ray_origin, ray_direction, sphere_center, sphere_radius)

if intersection != None:

value = np.dot(intersection['normal'], light_vector)

pixels[i][j] = max(0, value)

plt.imshow(pixels)
Target output

V: Linear algebra on an 8-bit microprocessor

This was a real test. For our MVP ray-tracer, we need a few linear algebra operations: we need to add and subtract 3D vectors, and also calculate dot products. Remember that the 6502 can only handle integers between 0 and 255? Floating point numbers are encoded using 5 bytes, using the schema below:

   7       0     31                                  0
+-+-------+ +-+-------+--------+--------+--------+
|S|eeeeeee| |S|mmmmmmm|mmmmmmmm|mmmmmmmm|mmmmmmmm|
+-+-------+ +-+-------+--------+--------+--------+
sign exponent sign implied-1 mantissa

Each vector comprises three 5-byte floating points. Because the 6502 can only operate on two 1-byte operands at a time, something like a vector addition will take thousands of clock cycles as the component bytes are moved in and out of registers.

Here’s how to calculate the dot product from two vectors in Python:

A = np.array([1, 2, 3])
B = np.array([-1, 0.5, 2])

dot_product = np.dot(A, B)

Here’s the subroutine I wrote to do the same thing in assembly. You’ll notice that this is quite involved, pretty obscure, and, as a novice in assembly, it took me some time.

; All of this code puts pointers to the 2 operand vectors
; into the vector pointers. Because it is a 16-bit
; address space with an 8-bit processor, each vector has
; two pointers, the high 8-bits and the low 8-bits.

; VECTOR ARITH AREA
; Pointers to Vectors
Init:
V1XHigh = $2400
V1XLow = $2401
V1YHigh = $2402
V1YLow = $2403
V1ZHigh = $2404
V1ZLow = $2405

V2XHigh = $2406
V2XLow = $2407
V2YHigh = $2408
V2YLow = $2409
V2ZHigh = $240a
V2ZLow = $240b

V3XHigh = $240c ; Where you want the result to get stored
V3XLow = $240d ; Might be a whole vector, might be a scalar
V3YHigh = $240e ; Consider this space used
V3YLow = $240f
V3ZHigh = $2410
V3ZLow = $2411

; LIGHT SOURCE 1 VECTOR3
Light1XHigh = $20
Light1XLow = $28
Light1YHigh = $20
Light1YLow = $2d
Light1ZHigh = $20
Light1ZLow = $32

; RAY ORIGIN VECTOR3
RayOriginXHigh = $20
RayOriginXLow = $0a
RayOriginYHigh = $20
RayOriginYLow = $0f
RayOriginZHigh = $20
RayOriginZLow = $14

SetupPointers:
lda #Light1XHigh
sta V1XHigh
lda #Light1XLow
sta V1XLow

lda #Light1YHigh
sta V1YHigh
lda #Light1YLow
sta V1YLow

lda #Light1ZHigh
sta V1ZHigh
lda #Light1ZLow
sta V1ZLow

lda #RayOriginXHigh
sta V2XHigh
lda #RayOriginXLow
sta V2XLow

lda #RayOriginYHigh
sta V2YHigh
lda #RayOriginYLow
sta V2YLow

lda #RayOriginZHigh
sta V2ZHigh
lda #RayOriginZLow
sta V2ZLow

lda #$25
sta V3XHigh
lda #$00
sta V3XLow

lda #$25
sta V3YHigh
lda #$05
sta V3YLow

lda #$25
sta V3ZHigh
lda #$0a
sta V3ZLow

Vector3DotProduct:
; Dot Product of Vector pointed to by V1 from Vector pointed to by V2
; result left in floating point accumulator as float
; uses rest of V3 as temporary storage

Vector3DotProductX:
ldy V1XHigh
lda V1XLow
jsr CONUPK
jsr MOVFA

ldy V2XHigh
lda V2XLow
jsr FMULT

ldy V3XHigh
ldx V3XLow
jsr MOVMF

Vector3DotProductY:
ldy V1YHigh
lda V1YLow
jsr CONUPK
jsr MOVFA

ldy V2YHigh
lda V2YLow
jsr FMULT

ldy V3YHigh
ldx V3YLow
jsr MOVMF

Vector3DotProductZ:
ldy V1ZHigh
lda V1ZLow
jsr CONUPK
jsr MOVFA

ldy V2ZHigh
lda V2ZLow
jsr FMULT

Vector3DotProductSum:
ldy V3XHigh
lda V3XLow
jsr FADD

ldy V3YHigh
lda V3YLow
jsr FADD

Finally, we could behold ray-tracing on an Apple 2…

That’s it, for now. I have no desire to programme anything else in assembly. And I don’t think I have much to contribute to the retro computing / emulation community, which is already so well served by 6502-based simulations. But I hope this has been interesting, and I can thoroughly recommend this kind of project.

Don’t hesitate to hmu with any questions!

--

--