How to Optimize your Rust Program for Slowness

Write a Short Program That Finishes After the Universe Dies

20 min readApr 1, 2025

--

Source: https://openai.com/dall-e-2/. All other figures from the author.

Also available: A Python version of this article.

Everyone talks about making Rust programs faster [1, 2, 3], but what if we pursue the opposite goal? Let’s explore how to make them slower — absurdly slower. Along the way, we’ll examine the nature of computation, the role of memory, and the scale of unimaginably large numbers.

Our guiding challenge: write short Rust programs that run for an extraordinarily long time.

To do this, we’ll explore a sequence of rule sets — each one defining what kind of programs we’re allowed to write, by placing constraints on halting, memory, and program state. This sequence isn’t a progression, but a series of shifts in perspective. Each rule set helps reveal something different about how simple code can stretch time.

Here are the rule sets we’ll investigate:

  1. Anything Goes — Infinite Loop
  2. Must Halt, Finite Memory — Nested, Fixed-Range Loops
  3. Infinite, Zero-Initialized Memory — 5-State Turing Machine
  4. Infinite, Zero-Initialized Memory — 6-State Turing Machine (>10↑↑15 steps)
  5. Infinite, Zero-Initialized Memory — Plain Rust (compute 10↑↑15 without Turing machine emulation)
  6. Bonus — Running Slow Things Fast(er)

Aside: 10↑↑15 is not a typo or a double exponent. It's a number so large that “exponential” and “astronomical” don’t describe it. We’ll define it in Rule Set 4.

We start with the most permissive rule set. From there, we’ll change the rules step by step to see how different constraints shape what long-running programs look like — and what they can teach us.

Rule Set 1: Anything Goes — Infinite Loop

We begin with the most permissive rules: the program doesn’t need to halt, can use unlimited memory, and can contain arbitrary code.

If our only goal is to run forever, the solution is immediate:

fn main() {
loop {}
}

This program is short, uses negligible memory, and never finishes. It satisfies the challenge in the most literal way — by doing nothing forever.

Of course, it’s not interesting — it does nothing. But it gives us a baseline: if we remove all constraints, infinite runtime is trivial.

In the next rule set, we’ll introduce our first constraint: the program must eventually halt. Let’s see how far we can stretch the running time under that new requirement — using only finite memory.

Rule Set 2: Must Halt, Finite Memory — Nested, Fixed-Range Loops

If we want a program that runs longer than the universe will survive and then halts, it’s easy. Just write four nested loops, each counting over a fixed range from 0 to 2¹²⁸−1:

fn slow_enough() {
for a in 0..u128::MAX {
for b in 0..u128::MAX {
for c in 0..u128::MAX {
for d in 0..u128::MAX {
if d % 1_000_000_000 == 0 {
println!("{} {} {} {}", a, b, c, d);
}
}
}
}
}
}

You can see that this program halts after
u128::MAX × u128::MAX × u128::MAX × u128::MAX steps. That’s 340,282,366,920,938,463,463,374,607,431,768,211,455⁴ or about 10¹⁵⁴) steps. And — ignoring the println!—this program uses only a small amount of stack memory to hold its four u128 loop variables—just 64 bytes.

My desktop computer runs the release build of this program at about 1 billion steps per second. But suppose it could run at Planck speed (the smallest meaningful unit of time in physics). That would be about 10⁵⁰ steps per year — so 10¹⁰⁴ years to complete.

Current cosmological models estimate the heat death of the universe in 10¹⁰⁰ years, so our program will run about 10,000 times longer than the projected lifetime of the universe.

Aside: Practical concerns about running a program beyond the end of the universe are outside the scope of this article.

For an added margin, we can use more memory. Instead of 64 bytes for variables, let’s use 64 gigabytes — about what you’d find in a well-equipped personal computer. That’s 10⁹ times more memory, which gives us about 4 × 10⁹ variables instead of 4. If each variable iterates over the full u128 range, the total number of steps becomes roughly u128::MAX^(4×10⁹), or about 10^(165 billion) steps. At Planck speed — roughly 10⁵⁰ steps per year — that corresponds to 10^(165 billion − 50) years of computation.

Can we do better? Well, if we allow an unrealistic but interesting rule change, we can do much, much better.

Rule Set 3: Infinite, Zero-Initialized Memory — 5-State Turing Machine

What if we allow infinite memory — so long as it starts out entirely zero?

Aside: Why don’t we allow infinite, arbitrarily initialized memory? Because it trivializes the challenge. For example, you could mark a single byte far out in memory with a 0x01—say, at position 10¹²⁰—and write a tiny program that just scans until it finds it. That program would take an absurdly long time to run — but it wouldn’t be interesting. The slowness is baked into the data, not the code. We’re after something deeper: small programs that generate their own long runtimes from simple, uniform starting conditions.

My first idea was to use the memory to count upward in binary:

0
1
10
11
100
101
110
111
...

We can do that — but how do we know when to stop? If we don’t stop, we’re violating the “must halt” rule. So, what else can we try?

Let’s take inspiration from the father of computer science, Alan Turing. We’ll program a simple abstract machine — now known as a Turing machine — under the following constraints:

  • The machine has infinite memory, laid out as a tape that extends endlessly in both directions. Each cell on the tape holds a single bit: 0 or 1.
  • A read/write head moves across the tape. On each step, it reads the current bit, writes a new bit (0 or 1), and moves one cell left or right.
A read/write head positioned on an infinite tape.
  • The machine also has an internal variable called state, which can hold one of n values. For example, with 5 states, we might name the possible values A, B, C, D, and E—plus a special halting state H, which we don’t count among the five. The machine always starts in the first state, A. You can think of the set of states as similar to a Rust enum:
enum State { A, B, C, D, E, H }

We can express a full Turing machine program as a transition table. Here’s an example we’ll walk through step by step.

A 5-state Turing machine transition table.
  • Each row corresponds to the current tape value (0 or 1).
  • Each column corresponds to the current state (A through E).
  • Each entry in the table tells the machine what to do next:
  • → The first character is the bit to write (0 or 1)
  • → The second is the direction to move (L for left, R for right)
  • → The third is the next state to enter (A, B, C, D, E, or H, where H is the special halting state).

Now that we’ve defined the machine, let’s see how it behaves over time.

We’ll refer to each moment in time — the full configuration of the machine and tape — as a step. This includes the current tape contents, the head position, and the machine’s internal state (like A, B, or H).

Below is Step 0. The head is pointing to a 0 on the tape, and the machine is in state A.

Looking at row 0, column A in the program table, we find the instruction 1RB. That means:

  • Write 1 to the current tape cell.
  • Move the head Right.
  • Enter state B.

Step 0:

This puts us in Step 1:

The machine is now in state B, pointing at the next tape cell (again 0).

What will happen if we let this Turing machine keep running? It will run for exactly 47,176,870 steps — and then halt.

That number is astonishing on its own, but seeing the full run makes it more tangible. We can visualize the execution using a space-time diagram, where each row shows the tape at a single step, from top (earliest) to bottom (latest). In the image:

  • The first row is blank — it shows the all-zero tape before the machine takes its first step.
  • 1s are shown in orange.
  • 0s are shown in white.
  • Light orange appears where 0s and 1s are so close together they blend.
Space-time diagram for the champion 5-state Turing machine. It runs for 47,176,870 steps before halting. Each row shows the tape at a single step, starting from the top. Orange represents 1, white represents 0.

In 2023, an online group of amateur researchers organized through bbchallenge.org proved that this is the longest-running 5-state Turing machine that eventually halts.

Aside: They formally verified their proof with the Coq proof assistant. If you’re curious about formal verification and Rust, see Check AI-Generated Code Perfectly and Automatically and Nine Rules to Formally Validate Rust Algorithms with Dafny in Towards Data Science.

Want to see this Turing machine in motion? You can watch the full 47-million-step execution unfold in this pixel-perfect video:

Or interact with it directly using the Busy Beaver Blaze web app:

The video generator and web app are part of busy-beaver-blaze, the open-source Rust project that accompanies this article. For tips on porting Rust code to WASM, see my Nine Rules for Running Rust in the Browser in Towards Data Science.

It’s hard to believe that such a small machine can run 47 million steps and still halt. But it gets even more astonishing: the team at bbchallenge.org found a 6-state machine with a runtime so long it can’t even be written with ordinary exponents.

Rule Set 4: Infinite, Zero-Initialized Memory — 6-State Turing Machine (>10↑↑15 steps)

As of this writing, the longest running (but still halting) 6-state Turing machine known to humankind is:

    A   B   C   D   E   F
0 1RB 1RC 1LC 0LE 1LF 0RC
1 0LD 0RF 1LA 1RH 0RB 0RE

Here is a video showing its first 10 trillion steps:

And here you can run it interactively via a web app.

So, if we are patient — comically patient — how long will this Turing machine run? More than 10↑↑15 where “10 ↑↑ 15” means:

This is not the same as 10¹⁵ (which is just a regular exponent). Instead:

  • 10¹ = 10
  • 10¹⁰ = 10,000,000,000
  • 10^10^10 is 10¹⁰⁰⁰⁰⁰⁰⁰⁰⁰⁰, already unimaginably large.
  • 10↑↑4 is so large that it vastly exceeds the number of atoms in the observable universe.
  • 10↑↑15 is so large that writing it in exponent notation becomes annoying.

Pavel Kropitz announced this 6-state machine on May 30, 2022. Shawn Ligocki has a great write up explaining both his and Pavel’s discoveries. To prove that these machines run so long and then halt, researchers used a mix of analysis and automated tools. Rather than simulating every step, they identified repeating structures and patterns that could be proven — using formal, machine-verified proofs— to eventually lead to halting.

Up to this point, we’ve been talking about Turing machines — specifically, the longest-known 5- and 6-state machines that eventually halt. We ran the 5-state champion to completion and built visualizations to explore its behavior. But the discovery that it’s the longest halting machine with 5 states — and the identification of the 6-state contender — came from extensive research and formal proofs, not from running them step-by-step.

That said, I’ve built a Rust-based visualizer that can emulate these machines for trillions of steps — just enough to explore their early behavior. But even 10 trillion steps isn’t an atom in a drop of water in the ocean compared to the full runtime of the 6-state machine. And running it as far as we did doesn’t get us any closer to understanding why it runs so long.

Aside: Rust “interpreted” the Turing machines up to some point — reading their transition tables and applying the rules step by step. You could also say it “emulated” them, in that it reproduced their behavior exactly. I avoid the word “simulated”: a simulated elephant isn’t an elephant, but a simulated computer is a computer.

Returning to our central challenge: we want to understand what makes a short program run for a long time. Instead of analyzing these Turing machines, let’s construct a Rust program whose 10↑↑15 runtime is clear by design.

Rule Set 5: Infinite, Zero-Initialized Memory — Plain Rust (compute 10↑↑15 without Turing machine emulation)

Our challenge is to write a small Rust program that runs for at least 10↑↑15 steps, using any amount of zero-initialized memory.

To achieve this, we’ll compute the value of 10↑↑15 in a way that guarantees the program takes at least that many steps. The ↑↑ operator is called tetration—recall from Rule Set 4 that ↑↑ stacks exponents: for example, 10↑↑3 means 10^(10^10). It’s an extremely fast-growing function. We will program it from the ground up.

Rather than rely on built-in operators, we’ll define tetration from first principles:

  • Tetration, implemented by the function tetrate, as repeated exponentiation
  • Exponentiation, via exponentiate, as repeated multiplication
  • Multiplication, via multiply, as repeated addition
  • Addition, via add, as repeated increment

Each layer builds on the one below it, using only zero-initialized memory and in-place updates.

We’ll begin at the foundation — with the simplest operation of all: increment.

Increment

Here’s our definition of increment:

fn increment(increment_acc: &mut BigUint) {
*increment_acc += 1u32;
}

For example:

let mut b = BigUint::from(4u32);
print!("++{b}\t= ");
increment(&mut b);
assert_eq!(b, BigUint::from(5u32));
println!("{b}");

Output:

++4     = 5

We’re using BigUint, a type from the num-bigint crate that represents arbitrarily large unsigned integers—limited only by available memory. In this article, we’ll pretend memory is infinite.

To keep things simple and consistent with our rules, we’ll restrict ourselves to just a few operations on BigUint:

  • Creating a BigUint::ZERO
  • In-place increment (+= 1u32) and decrement (-= 1u32)
  • Comparison with BigUint::ZERO

We don’t allow copying, cloning, or arithmetic beyond increment and decrement. All work is in-place — no temporary values.

Each function among our computation layers follows a consistent structure: it updates a mutable accumulator in place. Most functions also take an input value a, but increment—being the simplest—does not. We give our accumulators descriptive names like increment_acc, add_acc, and so on to make each operation clear and to prepare for later sections, where multiple accumulators will appear together.

Addition

With increment defined, we next define addition as repeated incrementing.

fn add(a: u32, add_acc: &mut BigUint) {
for _ in 0..a {
// We in-line `increment` manually, to keep our work explicit.
*add_acc += 1u32;
}
}

The function adds a to add_acc by incrementing add_acc a times.

Aside: You might wonder why add doesn't just call our increment function. We could write it that way—but we're deliberately in-lining each level by hand. This keeps all loops visible, makes the control flow explicit, and helps us reason precisely about how much work our function does.

Let’s use our new add function:

let a = 2;
let mut b = BigUint::from(4u32);
print!("{a} + {b}\t= ");
add(a, &mut b);
assert_eq!(b, BigUint::from(6u32));
println!("{b}");

Output:

2 + 4     = 6

Even though BigUint supports direct addition, we don’t use it. We're deliberately working at the most primitive level possible—incrementing by 1u32—to keep the logic simple and slow, and to make the amount of work explicit.

As with increment_acc, we update add_acc in place, with no copying or temporary values. The only operation we use is += 1u32, repeated a times.

Next, we define multiplication.

Multiplication

With addition in place, we can now define multiplication as repeated addition. Here’s the function:

fn multiply(a: u32, multiply_acc: &mut BigUint) {
let mut add_acc = BigUint::ZERO;
for () in multiply_acc.count_down() {
for _ in 0..a {
add_acc += 1u32;
}
}
*multiply_acc = add_acc;
}

This multiplies a by the value of multiply_acc by adding a to add_acc once for each time multiply_acc can be decremented. Then it stores the result back into multiply_acc.

Let’s use multiply:

let a = 2;
let mut b = BigUint::from(4u32);
print!("{a} * {b}\t= ");
multiply(a, &mut b);
assert_eq!(b, BigUint::from(8u32));
println!("{b}");

Output:

2 * 4     = 8

As in the previous functions, we perform all computation in-place, using no copying, temporary values, or built-in arithmetic. The only operations allowed are incrementing, decrementing, and zero comparisons.

You might wonder what this line does:

for () in multiply_acc.count_down()

Because BigUint doesn't support loops like for _ in 0..n, we use a custom method .count_down(). It loops as long as the value is nonzero, decrementing it by 1 at each time. This gives us a controlled, in-place version of "loop n times," even when n is arbitrarily large.

We’ve built multiplication from repeated addition. Now it’s time to go a layer further: exponentiation.

Exponentiation

We define exponentiation as repeated multiplication. As before, we’ll perform every multiplication step in-place, using only zero-initialized memory.

Here’s the function:

fn exponentiate(a: u32, exponentiate_acc: &mut BigUint) {
assert!(
a > 0 || *exponentiate_acc != BigUint::ZERO,
"0^0 is undefined"
);

let mut multiply_acc = BigUint::ZERO;
multiply_acc += 1u32;
for () in exponentiate_acc.count_down() {
let mut add_acc = BigUint::ZERO;
for () in multiply_acc.count_down() {
for _ in 0..a {
add_acc += 1u32;
}
}
multiply_acc = add_acc;
}
*exponentiate_acc = multiply_acc;
}

This raises a to the power of exponentiate_acc, using only incrementing, decrementing, and loop control. We initialize multiply_acc to 1 with a single increment—because repeatedly multiplying from zero would get us nowhere. Then, for each time exponentiate_acc can be decremented, we multiply the current result (multiply_acc) by a. As with the earlier layers, we inline the multiply logic directly instead of calling the multiply function—so the control flow and step count stay fully visible.

Here’s how we use exponentiate:

let a = 2;
let mut b = BigUint::from(4u32);
print!("{a}^{b}\t= ");
exponentiate(a, &mut b);
assert_eq!(b, BigUint::from(16u32));
println!("{b}");

Output:

2^4     = 16

Aside: And how many times is += 1u32 called? Obviously at least 2⁴ times—because our result is 2⁴, and we reach it by incrementing from zero. More precisely, the number of increments is:

• 1 increment — initializing multiply_acc to one
Then we loop four times, and in each loop, we multiply the current value of multiply_acc by a = 2, using repeated addition:

• 2 increments — for multiply_acc = 1, add 2 once
• 4 increments — for multiply_acc = 2, add 2 twice
• 8 increments — for multiply_acc = 4, add 2 four times
• 16 increments — for multiply_acc = 8, add 2 eight times

That’s a total of 1 + 2 + 4 + 8 + 16 = 31 increments, which is 2⁵-1. In general, the number of calls to increment will be exponential, but the number is not the same exponential that we are computing.

With exponentiation defined, we’re ready for the top of our tower: tetration.

Tetration

Here’s the function:

fn tetrate(a: u32, tetrate_acc: &mut BigUint) {
assert!(a > 0, "we don’t define 0↑↑b");

let mut exponentiate_acc = BigUint::ZERO;
exponentiate_acc += 1u32;
for () in tetrate_acc.count_down() {
let mut multiply_acc = BigUint::ZERO;
multiply_acc += 1u32;
for () in exponentiate_acc.count_down() {
let mut add_acc = BigUint::ZERO;
for () in multiply_acc.count_down() {
for _ in 0..a {
add_acc += 1u32;
}
}
multiply_acc = add_acc;
}
exponentiate_acc = multiply_acc;
}
*tetrate_acc = exponentiate_acc;
}

This computes a ↑↑ tetrate_acc, meaning it exponentiates a by itself repeatedly, tetrate_acc times.

For each decrement of tetrate_acc, we exponentiate the current value. We in-line the entire exponentiate and multiply logic again, all the way down to repeated increments.

Here’s how we use tetrate:

let a = 2;
let mut b = BigUint::from(3u32);
print!("{a}↑↑{b}\t= ");
tetrate(a, &mut b);
assert_eq!(b, BigUint::from(16u32));
println!("{b}");

Output:

2↑↑3     = 16

As expected, this computes 2^(2^2) = 16. You can run this yourself on the Rust Playground.

We can also run tetrate on 10↑↑15. It will start running, but it won’t stop during our lifetimes — or even the lifetime of the universe:

let a = 10;
let mut b = BigUint::from(15u32);
print!("{a}↑↑{b}\t= ");
tetrate(a, &mut b);
println!("{b}");

Let’s compare this tetrate function to what we found in the previous Rule Sets.

Rule Set 1: Anything Goes — Infinite Loop

Recall our first function:

fn main() {
loop {}
}

Unlike this infinite loop, our tetrate function eventually halts — though not anytime soon.

Rule Set 2: Must Halt, Finite Memory — Nested, Fixed-Range Loops

Recall our second function:

fn slow_enough() {
for a in 0..u128::MAX {
for b in 0..u128::MAX {
for c in 0..u128::MAX {
for d in 0..u128::MAX {
if d % 1_000_000_000 == 0 {
println!("{} {} {} {}", a, b, c, d);
}
}
}
}
}
}

Both slow_enough and our tetrate function contain a fixed number of nested loops. But tetrate differs in an important way: the number of loop iterations grows with the input value. In slow_enough, each loop runs from 0 to u128::MAX—a hardcoded bound. In contrast, tetrate’s loop bounds are dynamic — they grow explosively with each layer of computation.

Rule Sets 3 & 4: Infinite, Zero-Initialized Memory — 5- and 6-State Turing Machine

Compared to the Turing machines, our tetrate function has a clear advantage: we can directly see that it will call += 1u32 more than 10↑↑15 times. Even better, we can also see — by construction — that it halts.

What the Turing machines offer instead is a simpler, more universal model of computation — and perhaps a more principled definition of what counts as a “small program.”

That wraps up our efforts to make programs slower. In the conclusion, we’ll summarize and reflect on the surprises.

But before that, a twist: let’s make some slow programs run faster.

Bonus: Rule Set 6 — Running Slow Things Fast(er)

After all that effort deliberately slowing things down, here’s a surprise: let’s look at techniques for making Rust programs run faster.

Some simple but effective optimizations made the Turing machine visualizer fast enough to render 10 trillion steps using pixel binning (a term I’ll define shortly).

Even the Rust WASM version — now part of bbchallenge.org — can run 100 million steps in under a minute on many devices, including phones and tablets. (Try: Blaze mode, dark theme, steps = 100 million.)

bbchallenge — select “Blaze” & dark. Set steps to 100 million.

Rather than cover every optimization, I’ll just outline the ones that mattered most.

Optimization: Keep the Turing Machine Simple

The Turing machine emulator represents the infinite tape using two Vec<u8>: one for negative indices and one for non-negative. Each cell uses a full u8, even though it only holds 0 or 1.

Why not compress further? In theory, bit-level storage would save memory and improve cache usage by fitting more of the tape into the cache. But in practice, using run-length encoding with my range-set-blaze crate made things slower.

I haven’t tried bitvec, but in this case, simplicity—and letting the CPU’s memory system do its job—seems to win.

Visualization: Background Information

To understand the visualization optimizations, we first need to look at how the visualizer handles massive histories. To visualize step counts beyond a trillion, the visualizer compresses the history into a fixed-size image — for example, 1920×1080 pixels.

Key techniques:

  • Pixel Binning:
    Much like digital image sensors, the visualizer uses pixel binning: each pixel represents the average value over a rectangular region — in this case, a block of steps and tape positions. For example, a single pixel might summarize 500 billion steps across 1,000 tape cells, compressed into one 8-bit value.
  • Adaptive Binning:
    Initially the visualizer stores every step and tape cell. Once the data exceeds twice the image resolution in either dimension, it halves that dimension using averaging. This keeps both memory usage and runtime proportional to the image size — not to the step count or tape width.
  • Practical Tape Width:
    Though the tape is conceptually infinite, we only track the active region — from the leftmost to the rightmost written cell.

This method enables compact yet faithful representations of vast computations, preserving meaningful structure even at extreme compression ratios.

Visualization Optimization 1: SIMD Everywhere

For over 20 years, CPUs have been able to perform dozens of simple operations in a single instruction. This is called SIMD (Single Instruction, Multiple Data). Even WebAssembly (WASM) — the portable language of the web — supports it.

While traditionally hard to program, SIMD becomes approachable in Rust with the nightly compiler, thanks to its standard, portable SIMD library. (See my Nine Rules for SIMD Acceleration of Your Rust Code in Towards Data Science.)

In the code, I replaced the Vec<u8>’s with AVec<u8> from aligned-vec. I then used SIMD everywhere I could think of:

  • Averaging many tape values into one pixel value
  • Merging adjacent pixels on the same row
  • Merging two adjacent rows of pixels into one row

Each of these operations benefits from SIMD — leading to fewer instructions, better cache usage, and higher throughput.

Visualization Optimization 2: One-Pixel Update

At each simulation step, the tape changes at most one cell. That means the resulting line of pixels — after averaging — can differ from the previous one in at most one pixel.

So instead of recomputing the entire line from scratch, we:

  • Copy the previous line of pixels.
  • Compute just the one possibly new pixel.

This replaces full recomputation with a simple copy plus a targeted update. If your line is 2000 pixels wide, you just copy 2000 bytes and compute one pixel.

This optimization leverages temporal coherence. It’s a small change in code, but a big win in efficiency.

Visualization Optimization 3: “Impossible” Multiprocessing

As far as I know, it’s impossible to speed up our Turing machine itself using multiple processors. Its step-by-step, stateful nature resists all parallelization.

That leaves visualization as the target for parallelism. But attempts to use multiple processors in the inner loops of visualization — such as merging two rows of pixels — backfire. They hurt cache locality and, in my benchmarks, slowed visualization down.

I have a 16-core desktop computer. Rendering these videos was taking over a week, while CPU usage hovered around a pathetic 7% — meaning 93% of my machine sat idle. Frustrating.

The solution: Visualization is roughly 50× slower than running the Turing machine alone. So instead of parallelizing inner loops, I created a new outer loop. Specifically, I divided the total number of Turing machine steps into 32 independent chunks, each responsible for visualizing 1/32 of the final image.

Each thread does two things:

  1. Runs the Turing machine as fast as possible until it reaches the first step it needs to visualize.
  2. Generates the visualization information for its assigned range.

This approach keeps all CPU cores busy, pushing usage to 100%. Once all the chunks are complete, it carefully stitches the results together to produce the final video or visualization output.

Conclusion

So, there you have it — a journey through writing absurdly slow programs, culminating in techniques to make even the slow parts fast(er). Along the way, we explored the boundaries of computation, memory, and performance, using everything from nested loops to Turing machines to a manually-inlined tetration function.

Here’s what surprised me:

  • Nested loops are enough.
    If you just want a short program that halts after outliving the universe, four nested loops with 64 bytes of memory will do the job. I hadn’t realized it was that simple.
  • Turing machines escalate fast.
    The jump from 5 to 6 states unleashes a dramatic leap in complexity and runtime. Also, the importance of starting with zero-initialized memory is obvious in retrospect — but it wasn’t something I’d considered before.
  • There’s something beyond exponentiation — and it’s called tetration.
    I didn’t know this. I wasn’t familiar with the ↑↑ notation or the idea that exponentiation could itself be iterated to form something even faster-growing. It was surprising to learn how compactly it can express numbers that are otherwise unthinkably large.
    And because I know you’re asking — yes, there’s something beyond tetration too. It’s called pentation, then hexation, and so on. These are part of a whole hierarchy known as hyperoperations. There’s even a metageneralization: systems like the Ackermann function and fast-growing hierarchies capture entire families of these functions and more.
  • Writing Tetration with Explicit Loops Was Eye-Opening
    I already knew that exponentiation is repeated multiplication, and so on. I also knew this could be written recursively. What I hadn’t seen was how cleanly it could be written as nested loops, without copying values and with strict in-place updates.
  • Rust’s portable SIMD (Single Instruction, Multiple Data) on nightly is a joy.
    The nightly SIMD library gave a huge boost to performance. I hope it gets stabilized soon.
  • Visualization: Easy and Hard
    Adaptive pixel binning kept memory use constant and scaling smooth. The hard part? Dividing the work into 32 chunks was easy — but stitching them back together without introducing dozens of off-by-one errors was not.

Thank you for joining me on this journey. I hope you now have a clearer understanding of how small Rust programs can run for an astonishingly long time — and what that reveals about computation, memory, and minimal systems. We’ve seen programs that halt only after the universe dies, and others that run even longer.

Interested in future articles? Please follow me on Medium. I write about Rust and Python, scientific programming, machine learning, and statistics. I tend to write about one article per month.

--

--

Carl M. Kadie
Carl M. Kadie

Written by Carl M. Kadie

Ph.D. in CS & ML. Retired MSFT & Microsoft Research. Volunteer, open-source projects related to ML, Genomics, Rust, and Python. Published in Science & Nature.

Responses (1)