Curta Gas Golf Course #2-Factorial

Rohan Nero
28 min readApr 6, 2024

--

Curta is a place where developers can come together to solve smart contract puzzles and create optimized bytecode solutions. This article documents my own personal process and experience with this challenge, as well as other public solutions submitted by other developers.

Screenshot from https://twitter.com/curta_ctf/status/1768734398583583198

We are focusing on the second Gas Golfing course called Factorial. We can see from the course page that the requirements to satisfy this challenge are that your contract can take a uint256 input and return the factorial as uint256 output, unless the input is greater than 57, in this case, the call should revert.

// SPDX-License-Identifier: MIT
pragma solidity ^0.8.21;

import { ICourse } from "src/interfaces/ICourse.sol";

interface FactorialLike {
function perform(uint256 input) external pure returns (uint256 output);
}

contract Factorial is ICourse {
/// @inheritdoc ICourse
function name() external pure returns (string memory) {
return "Factorial";
}

/// @inheritdoc ICourse
function run(address target, uint256 seed) external view returns (uint32 usedGas) {
/// @solidity memory-safe-assembly
assembly {
function verifyReturns(target_, input_, output_) {
mstore(0x00, 0xd097bc0d) // `perform(uint256)`.
mstore(0x20, input_)
let success_ := staticcall(gas(), target_, 0x1c, 0x24, 0x00, 0x20)
success_ := and(gt(returndatasize(), 0x1f), success_)
success_ := and(eq(mload(0x00), output_), success_)
if iszero(success_) {
mstore(0x00, 0x62cebecf) // `IncorrectSolution()`.
revert(0x1c, 0x04)
}
}

function verifyReverts(target_, input_) {
mstore(0x00, 0xd097bc0d) // `perform(uint256)`.
mstore(0x20, input_)
if staticcall(gas(), target_, 0x1c, 0x24, 0x00, 0x00) {
mstore(0x00, 0x62cebecf) // `IncorrectSolution()`.
revert(0x1c, 0x04)
}
}
}
}

Before we get started let’s think through what our code needs to do:

  1. Read input and store it somewhere
  2. Revert if input is greater than 57
  3. Return the factorial of the input. Which may look like a loop that starts at 1 and increments by 1 until reaching the value of the input. We will also need to store our output somewhere and multiply it by the loop counter each iteration.

Now that we know what our code must do, let's start writing it. For this, I’ll be using EVM Codes playground.

First, we use CALLDATACOPY to store the input inside memory:

PUSH1 0x20 // 32 byte size since input is uint256
PUSH1 0x04 // 4 byte offset accounting for function signature
PUSH1 0x00 // 0 byte destOffset in memory
CALLDATACOPY

Then we check to ensure the input value is between 0 and 57:

PUSH0 // 0 offset in memory
MLOAD // load our input value
PUSH1 0x3a // 58 in hex
GT // compare 58 > input ? 1 : 0
PUSH1 0x14 // JUMPDEST location
JUMPI // if input is less than 58 we jump
PUSH1 0x00
PUSH1 0x00
REVERT // if input is greater than 57 revert with no message
JUMPDEST

Ok now that we’ve ensured the input is valid, all that is left to do is calculate the factorial:

PUSH1 0x01 // initialize loop counter as 1
PUSH1 0x20 // 32 in hex representing the memory offset
MSTORE // store at offset
PUSH1 0x01 // initialize output as 1
PUSH1 0x40 // 64 in hex representing the memory offset
MSTORE // store at offset

JUMPDEST // landing destination for restarting loop


PUSH0 // memory offset 0
MLOAD // load input
PUSH1 0x20 // memory offset 32
MLOAD // load loop counter

// Compare the values (loop counter < input)
// If 1 continue loop, if 0 jump to return
LT // compare loop counter < input ? 1 : 0
ISZERO // negate the result so that we jump when loop counter = input

PUSH1 0x5a // final JUMPDEST location right before RETURN
JUMPI
// If here, need to increment index, store it. Multiply index by output, store it. Jump to previous JUMPDEST.
// Load index and add 1 to it
PUSH1 0x20
MLOAD // load loop counter from memory
PUSH1 0x01
ADD // add 1 to loop counter
PUSH1 0x20
MSTORE // store the incremented counter
PUSH1 0x20
MLOAD // load incremented loop counter
PUSH1 0x40
MLOAD // load current output
MUL // multiply them together
PUSH1 0x40
MSTORE // store updated output
PUSH1 0x2d // JUMPDEST location to restart loop
JUMP // always jumps

and return it:

JUMPDEST // final landing location for skipping the loop
PUSH1 0x20 // 32 byte output size
PUSH1 0x40 // output memory offset
RETURN

Now we have our runtime code:

602060046000375f51603a1160145760006000fd5b600160205260016040525b5f51602051101560405760205160010160205260205160405102604052601f565b60206040f3

Which can be used to make this creation code:

7f602060046000375f51603a1160145760006000fd5b600160205260016040525b5f527f5f51602051101560405760205160010160205260205160405102604052601f56602052655b60206040f360d01b60405260465ff3

Congratulations we have now completed the challenge and created a minimalistic bytecode contract that returns the factorial of any input passed to i-

Damn.

We may have met the requirements but our solution cost…. a lot more than the most optimized one. Let's fix that.

(Please note Rohan Nero Inc. made a thoughtful decision to revise the opcodes in this first solution slightly for this article because I wanted to save the reader some pain and the original is an abomination that I despise much like Victor and his tragic creation. If you absolutely must see it, then I’m sure you will find it).

Based on the next solution above mine from 0xjust.q00t.eth, we can see that they handled storing values much differently than I did.

My solution at the end of execution when passed `7`
0xjust.q00t.eth’s solution at end of execution when passed `7`

Instead of constantly storing things in memory with MSTORE, they let things live on the stack and would use a combination of SWAP and DUP to move them around as well as POP to remove items from the stack when necessary.

Some of the opcodes used in their solution

A big reason why this solution is more efficient than ours is because it avoids unnecessary memory expansion by using a single slot for the output and never storing the loop counter or original input in memory. With this in mind, we can start again on a more optimized version of our solution.

First, we load our input value and leave it on the stack. Then, since our loop will start by multiplying the input by the input minus 1, we will jump past the loop and return 1 if the input value is less than 2.

PUSH1 0x04 // 4 byte offset for function sig
CALLDATALOAD // load input to stack
PUSH1 0x01 // push 1, which will be output if we jump
PUSH1 0x02 // push 2 to stack so we can compare it against input
DUP3 // dup the input value so we can use it in the comparison
LT // compare input < 2 ? 1 : 0
PUSH1 0x29 // JUMPDEST location
JUMPI // jump to the RETURN
POP // remove the 1 from stack if we didn't jump

Next, following our 3-step program, we will compare the input value to 58 to ensure it is within the allowed range.

DUP1 // duplicate the input
DUP1 // duplicate the input
PUSH1 0x3a // 58 in hex
GT // compare 58 > input ? 1 : 0
PUSH1 0x18 // JUMPDEST location for skipping revert
JUMPI // jumps if input is less than 58
PUSH0
PUSH0
REVERT // revert with no reason
JUMPDEST // landing point for input values 2-57

Finally, all that’s left to do is calculate the factorial and return it.

JUMPDEST // starting location for loop
PUSH1 0x01 // push 1 to stack
SWAP1 // swap 1 and our current loop counter
SUB // subtract 1 from our loop counter
DUP1 // duplicate the updated loop counter
SWAP2 // swap loop counter and current output value
MUL // multiply current output by current loop counter
SWAP1 // swap the current output with loop counter
PUSH1 0x02 // push 2 to stack
DUP2 // dup the current loop counter
GT // compare current counter > 2
PUSH1 0x1a // starting loop JUMPDEST location
JUMPI // jumps if current loop counter is greater than 2
POP // remove loop counter from stack
JUMPDEST // landing point for input values 0 and 1
MSIZE // push 0 to stack as offset since we know current memory is empty
MSTORE // store output so we can RETURN it
MSIZE // push 32 bytes to stack since we just stored output in slot 0
PUSH0 // push 0 to stack as memory offset
RETURN

We’ve recreated the exact same function using an entirely different method of handling values.

The runtime code for this solution: 600435600160028210602b57508080603a11601a5760006000fd5b600190038091029060028111601a57505b5952595ff3

The creation code for this solution: 0x7f600435600160028210602b57508080603a11601a5760006000fd5b60019003805f527091029060028111601a57505b5952595ff360781b60205260315ff3

Instead of using 3 memory slots, we used only one for the output which saved us a lot of gas:

Screenshot of the top 8 positions on the leaderboard

Using this one new piece of knowledge we were able to optimize our code and move up in the leaderboards, but the majority of the solutions are still more efficient than ours. The next two solutions have almost identical gas costs so they probably used the same optimization method, lets take a look at cwu.eth and atarpara.eth’s solutions to see what we can improve on.

The first thing I notice is how long the runtime code is: 600435600101600a81116022576001805b8160010191028183116010573d52593df35b60148111604257600a620589805b8160010191028183116030573d52593df35b601e811160675760146701b02b93068900005b8160010191028183116055573d52593df35b60288111609157601e6c6f99461a1e9e1432dcb60000005b816001019102818311607f573d52593df35b6032811160c2576028730392ac33e351cc7659cd325c1ff9ba88000000005b81600101910281831160b0573d52593df35b603a811160fa5760327a017a88e4484be3b6b92eb2fa9c6c087c778c8d79f9c000000000005b81600101910281831160e8573d52593df35b3d3dfd

How can more code be more efficient? Let's look at the mnemonic form of the bytecode and see if we can make sense of it.

PUSH1 0x04
CALLDATALOAD // Load calldata

PUSH1 0x01
ADD // Increment input by 1
PUSH1 0x0a
DUP2
GT // Compare input + 1 > 0x0a
PUSH1 0x22
JUMPI // If input + 1 > 0x0a JUMP

PUSH1 0x01
DUP1

JUMPDEST // Loop start
DUP2
PUSH1 0x01 // push 1
ADD // add 1 to input
SWAP2
MUL
DUP2
DUP4
GT // i+1 > 2, JUMP if i+1 > loop counter
// Using stack value as loop counter to track how many times the loop has run

PUSH1 0x10 // Restart loop
JUMPI
// Once loop counter = input + 1, we stop JUMPing
// Stack = output, index, input+1
RETURNDATASIZE // Pushes 0 to stack, not sure why these are being used but it is currently used
// as memory offset for the return value
MSTORE // Output in memory, Stack = i+1, loop counter
MSIZE
RETURNDATASIZE // Using this to push 0 for memory offset, same as PUSH0
RETURN // Return the output (ends here for input 0-10)

JUMPDEST // Landing point for input > 10
PUSH1 0x14 // 20 in hex
DUP2 // Dup i+1
GT // i+1 > 20 ? 1 : 0
PUSH1 0x42
JUMPI // If input is greater than 20 we jump again.

PUSH1 0x0a
PUSH3 0x058980
JUMPDEST // landing for inputs 11-20
DUP2
PUSH1 0x01
ADD
SWAP2
MUL
DUP2
DUP4
GT
PUSH1 0x30
JUMPI
RETURNDATASIZE
MSTORE
MSIZE
RETURNDATASIZE
RETURN // Ending point for inputs 11-20

JUMPDEST // Landing point for inputs > 20
PUSH1 0x1e
DUP2
GT
PUSH1 0x67
JUMPI
PUSH1 0x14
PUSH8 0x01b02b9306890000
JUMPDEST
DUP2
PUSH1 0x01
ADD
SWAP2
MUL
DUP2
DUP4
GT
PUSH1 0x55
JUMPI
RETURNDATASIZE
MSTORE
MSIZE
RETURNDATASIZE
RETURN // Ending point for inputs 21-30

JUMPDEST // Landing point for inputs > 30
PUSH1 0x28
DUP2
GT
PUSH1 0x91
JUMPI
PUSH1 0x1e
PUSH13 0x6f99461a1e9e1432dcb6000000
JUMPDEST
DUP2
PUSH1 0x01
ADD
SWAP2
MUL
DUP2
DUP4
GT
PUSH1 0x7f
JUMPI
RETURNDATASIZE
MSTORE
MSIZE
RETURNDATASIZE
RETURN // Ending point for inputs 31-40

JUMPDEST // Landing point for inputs > 40
PUSH1 0x32
DUP2
GT
PUSH1 0xc2
JUMPI
PUSH1 0x28
PUSH20 0x0392ac33e351cc7659cd325c1ff9ba8800000000
JUMPDEST
DUP2
PUSH1 0x01
ADD
SWAP2
MUL
DUP2
DUP4
GT
PUSH1 0xb0
JUMPI
RETURNDATASIZE
MSTORE
MSIZE
RETURNDATASIZE
RETURN // Ending point for inputs 41-50

JUMPDEST // Landing point for inputs > 50
PUSH1 0x3a
DUP2
GT
PUSH1 0xfa
JUMPI
PUSH1 0x32
PUSH27 0x017a88e4484be3b6b92eb2fa9c6c087c778c8d79f9c00000000000
JUMPDEST
DUP2
PUSH1 0x01
ADD
SWAP2
MUL
DUP2
DUP4
GT
PUSH1 0xe8
JUMPI
RETURNDATASIZE
MSTORE
MSIZE
RETURNDATASIZE
RETURN // Ending point for inputs 51-57

JUMPDEST // Landing point for input > 57
RETURNDATASIZE
RETURNDATASIZE
REVERT // Revert with no reason REVERT(0,0)

What atarpara.eth did here was pretty clever. Instead of starting the factorial calculation loop once after confirming the input is within an acceptable range, they created 5 separate loops that start at the maximum output value of the previous loop plus 1.

This is important because instead of looping 0–57 times, the maximum amount of times a jump can be made is 15. This is possible because each loop can run a maximum of 10 times since that is the range of input the loops handle (The first loop handles input values 10 or less, the second loop handles input values 11–20, and so on), and the largest amount of times an input value can jump before reaching the factorial calculation loop is 5 (Input values 51–57 have to jump past the first four loops to reach the loop that handles input values above 50).

Now let’s see how similar cwu.eth’s solution is to this one and what makes it slightly more efficient.

I am speechless after reading through this bytecode, I almost didn’t bother reading it assuming that it was extremely similar to the Atarpara’s solution, but I am so glad I did. This bytecode is built on the same idea of breaking the one loop up into smaller loops depending on what group the input falls under, however, that is essentially where the similarities end. It took me a while to even understand what was going on here. I still have no idea how he built this or why he decided on certain values, I just know that this has blown my mind.

PUSH1 0x04
CALLDATALOAD // get input
PUSH1 0x3a // push 58 to stack
DUP2 // dup input
LT // compare input < 58
PUSH2 0x000e
JUMPI // jump past revert if input is less than 58
PUSH0
PUSH0
REVERT // for input > 57

JUMPDEST // land here if input < 58
PUSH1 0x0a
DUP2
GT // input > 10 ? 1 : 0
PUSH2 0x0047
JUMPI // Jump if input is greater than 10
// lookup table containing factorials for 0-10
PUSH31 0x0375f0016260009d80004ec0002d00001e0000180000180000200000400001
DUP2 // dup input
PUSH1 0x16 // push 22 to stack
MUL // multiply input by 22
SHR // Shift bytes using lookup table as value and our product as shift size
PUSH3 0x3fffff // push this as mask
AND // bitwise AND the unmasked bytes from lookup table with 0x3fffff
PUSH0 // memory offset 0
MSTORE // store the output in memory
PUSH1 0x20 // push 32 bytes size
PUSH0 // memory offset 0
RETURN // return the output

JUMPDEST // landing point for input values greater than 10
PUSH1 0x01 // push 1 to stack
ADD // add 1 to input
PUSH0
MSTORE // store input + 1 in memory
PUSH1 0x0b // push 11 to stack
PUSH3 0x375f00 // push 10! (3628800)
PUSH1 0x0b // push 11 to stack
PUSH1 0x1f // push 31 to stack
PUSH0 // push 0 to stack
MLOAD // load input + 1 from memory
LT // compare input + 1 < 31 ? 1 : 0
PUSH2 0x0070 // JUMPDEST location
JUMPI // jump if input + 1 is less than 31
PUSH1 0x1e // push 30 to stack
PUSH13 0x6f99461a1e9e1432dcb6000000 // 30! / 30
PUSH1 0x1e // push 30 to stack
JUMPDEST
MUL // multiply 30 by lookup table
DUP2 // dup 30 so top of stack is 30,product,30,10, 10!,10
PUSH1 0x01 // push 1
ADD // add 1 to 30 so stack is 31, product, 30
SWAP1 // stack = product, 31, 30
DUP2
PUSH0 // push 0
MLOAD // load input + 1 from memory
DUP2 //
LT // 31 (loop counter) < input + 1
PUSH2 0x0070 // JUMPDEST location
JUMPI // jump if 31 (loop counter) is less than input + 1
POP
PUSH0
MSTORE // store product in memory overriding input + 1
PUSH1 0x20 // 32 bytes size for RETURN
PUSH0 // memory offset 0
RETURN

Instead of breaking up the 0–57 range into 6 separate loops for 0–10,11–20, 21–30, 31–40, 41–50, and 51–57, as atarpara did, cwu broke the range into 3 groups. The first group is any input within 0–10, the second is 11–30, and the final group is 31–57. Both solutions store the previous group’s max value inside the next group but cwu uses a lookup table to retrieve the factorial directly for input 0–10 instead of always calculating it with a series of multiplication.

The look-up table handles retrieving any factorials for inputs 0–10: 0x0375f0016260009d80004ec0002d00001e0000180000180000200000400001

This second value that looks like a lookup table is the hex version of 30! / 30: 0x6f99461a1e9e1432dcb6000000

For values bigger than 30 he uses old-fashioned JUMPI after comparing the loop counter with LT.

How is it possible to store values and retrieve values from inside a lookup table? And how can we create a more optimized or at least similar version of this ourselves? First, I need to decide what the lookup table will store. I.e factorials or JUMPDEST locations.

After thinking about it for a while, I’ve decided that I want to use a combination of both. I want to use the first lookup table to store JUMPDEST locations, and then once the code has reached the respective landing point, we will use a second lookup table containing the factorials for 0–10, exactly like cwu did. For input values greater than 10, we will use hard-coded factorials in tandem with simple loop structures like atarpara did.

Hopefully by combining these two solutions we can further optimize our own. The hard part about constructing the first lookup table is that all of the opcodes must be present before we can determine what the JUMPDEST locations will be. So I’ll begin by writing the entire code with filler values for now.

PUSH1 0x04
CALLDATALOAD
// original lookup table with filler data
PUSH8 0xffffffffffffffff

Another challenging part about this, is that we need to determine how our lookup table is going to work, and what groupings our input will fall under. Since we are using cwu’s lookup table for inputs 0–10, we know that is our first group. Atarpara used groupings of 10 for his loops, and since we want to be more efficient than him, we will try groups of 8.

// 0-10 - get output from lookup table
// 11-18, 19-26, 27-34, 35-42, 43-50, 51-57 - basic loop calculation
// 58+ - revert

Now that we have our groups lined up, we need to figure out how the correct value will be retrieved from our lookup table. Since we want 0–10 to fall into the same group, we will have to alter our input somehow. I decided that to get the correct byte out of our table, we first alter the input by adding 5 to it, then divide it by 8.

// How are we going to get correct JUMPDEST from the table?
// if our groups are 0-10,11-18,19-26,27-34,35-42,43-50,51-57,58+
// A: we add 5 to the input and then divide by 8.
// input 0-10 will fall into category 0-1
// input 11-18 will fall into category 2
// input 19-26 will fall into category 3
// input 27-34 will fall into category 4
// input 35-42 will fall into category 5
// input 43-50 will fall into category 6
// input 51-57 will fall into category 7
// input 58+ will fall into category 8

Perfect! Now we have our structure planned out and its time to implement it in order to get the JUMPDEST locations for our first lookup table.

// JUMPDEST locations: 19,4a,66,86,ab,d6,ff,0139

We got our locations, but now there is a new problem. The final JUMPDEST location takes up two bytes, which means we can’t use BYTE to retrieve the correct value for the input belonging to that group. In order to resolve this, I slightly altered the code to contain a new JUMPDEST location belonging to a single byte offset, that literally only lands, and then JUMPs again to the real start of the group’s factorial calculation.

JUMPDEST // This has a single byte offset location,
PUSH2 0x0139 // and has the sole purpose of jumping to the final group.
JUMP // Go to start of factorial calculation for inputs 51-57.

Once we fill in the lookup table filler bytes with the real JUMPDEST locations we’ve got a working solution! Just kidding nothing is ever that easy, the current logic is slightly flawed.

Since we use a lookup table to get the value for any input, and the final group JUMPDEST simply reverts upon landing, we expect any input greater than 57 to follow this path. However, this only works correctly for input values 58–65 and anything higher than this will retrieve 0 from the lookup table resulting in an invalid jump.

Against my initial desires, I was forced to remove that final group and add a conditional to the start of the code that ensures the input is less than 58 before continuing.

PUSH1 0x04 // skip function signature
CALLDATALOAD // load input
DUP1 // dup input
PUSH1 0x3a // 58 in hex
GT // 58 > input? 1 : 0
PUSH1 0x0d // JUMPDEST location
JUMPI
PUSH0
PUSH0
REVERT // revert with no message for invalid input
JUMPDEST // landing for inputs less than 58

Finally, I think we’ve fixed all the little bugs and can finish our solution:

PUSH1 0x04
CALLDATALOAD // get input
DUP1
PUSH1 0x3a // push 58
GT // 58 > input
PUSH1 0x0c // JUMPDEST location
JUMPI
PUSH0
REVERT
JUMPDEST
DUP1
PUSH1 0x05
ADD // input + 5
PUSH1 0x08 // push 8
SWAP1 // swap 8 and input to divide
DIV // (input + 5) / 8
PUSH1 0x18 // 24 in hex
ADD // add 25 to offset to account for 22 bytes of padding
// Now use "key" to get correct JUMPDEST location from our lookup table.
// First group's jumpdest location is in lookup table twice
// since it takes up two groups.
PUSH8 0x2525567292b6e4df// 8 bytes for each checkpoint/group
SWAP1
BYTE // retrieve the JUMPDEST location from table
JUMP // Jump to correct group starting point

// inputs 0-10
JUMPDEST
PUSH31 0x0375f0016260009d80004ec0002d00001e0000180000180000200000400001
DUP2 // dup input
PUSH1 0x16 // push 22 to stack
MUL // multiply input by 22
SHR // Shift bytes using lookup table as value and our product as shift size
PUSH3 0x3fffff // push this as mask
AND // bitwise AND the unmasked bytes from lookup table with 0x3fffff
PUSH0 // memory offset 0
MSTORE // store the output in memory
PUSH1 0x20
PUSH0
RETURN // Return output for inputs 0-10

// landing for inputs 11-18
JUMPDEST // offset 3c
PUSH1 0x01
ADD // add 1 to input
PUSH1 0x0a // push 10
PUSH3 0x058980 // push 9!
JUMPDEST // loop start
DUP2 // dup loop counter
PUSH1 0x01 // push 1
ADD // add 1 to count
SWAP2 // swap count with input
MUL // mul input x output
DUP2 // dup counter
DUP4
GT // compare counter to input
PUSH1 0x5f // loop start for input 11-18
JUMPI // jump if input > counter
RETURNDATASIZE // push 0
MSTORE // store in memory
MSIZE // push 0x20
RETURNDATASIZE // push 0
RETURN // Ending point for inputs 11-18

// landing for inputs 19-26
JUMPDEST // offset 3c
PUSH1 0x01
ADD // add 1 to input
PUSH1 0x12 // push 18
PUSH7 0x1437EEECD8000 // push 17!
JUMPDEST // loop start
DUP2 // dup loop counter
PUSH1 0x01 // push 1
ADD // add 1 to count
SWAP2 // swap count with input
MUL // mul input x output
DUP2 // dup counter
DUP4
GT // compare counter to input
PUSH1 0x7f // loop start for input 19-26
JUMPI // jump if input > counter
RETURNDATASIZE // push 0
MSTORE // store in memory
MSIZE // push 0x20
RETURNDATASIZE // push 0
RETURN // Ending point for inputs 19-26

// landing for inputs 27-34
JUMPDEST // offset 3c
PUSH1 0x01
ADD // add 1 to input
PUSH1 0x1a // push 26
PUSH11 0x0CD4A0619FB0907BC00000// push 25!
JUMPDEST // loop start
DUP2 // dup loop counter
PUSH1 0x01 // push 1
ADD // add 1 to count
SWAP2 // swap count with input
MUL // mul input x output
DUP2 // dup counter
DUP4
GT // compare counter to input
PUSH1 0xa3 // loop start for input 27-34
JUMPI // jump if input > counter
RETURNDATASIZE // push 0
MSTORE // store in memory
MSIZE // push 0x20
RETURNDATASIZE // push 0
RETURN // Ending point for inputs 27-34

// landing for inputs 35-42
JUMPDEST // offset 3c
PUSH1 0x01
ADD // add 1 to input
PUSH1 0x22 // push 34
PUSH16 0x0688589CC0E9505E2F2FEE5580000000 // push 33!
JUMPDEST // loop start
DUP2 // dup loop counter
PUSH1 0x01 // push 1
ADD // add 1 to count
SWAP2 // swap count with input
MUL // mul input x output
DUP2 // dup counter
DUP4
GT // compare counter to input
PUSH1 0xcc // loop start for input 35-42
JUMPI // jump if input > counter
RETURNDATASIZE // push 0
MSTORE // store in memory
MSIZE // push 0x20
RETURNDATASIZE // push 0
RETURN // Ending point for inputs 35-42

// offset for last checkpoint so each checkpoint takes up a single byte
JUMPDEST
// landing here we know input is 50-57
PUSH2 0x0111 // inputs 51-57 landing
JUMP

// landing for inputs 43-50
JUMPDEST // offset 3c
PUSH1 0x01
ADD // add 1 to input
PUSH1 0x2a // push 42
PUSH21 0x16E39F2C684405D62F4A8A9E2CD7D2F74000000000 // push 41!
JUMPDEST // loop start
DUP2 // dup loop counter
PUSH1 0x01 // push 1
ADD // add 1 to count
SWAP2 // swap count with input
MUL // mul input x output
DUP2 // dup counter
DUP4
GT // compare counter to input
PUSH1 0xff // loop start for input 43-50
JUMPI // jump if input > counter
RETURNDATASIZE // push 0
MSTORE // store in memory
MSIZE // push 0x20
RETURNDATASIZE // push 0
RETURN // Ending point for inputs 43-50

// landing for inputs 51-57
JUMPDEST // offset 3c
PUSH1 0x01
ADD // add 1 to input
PUSH1 0x32 // push 50
PUSH27 0x17A88E4484BE3B6B92EB2FA9C6C087C778C8D79F9C00000000000// push 49!
JUMPDEST // loop start
DUP2 // dup loop counter
PUSH1 0x01 // push 1
ADD // add 1 to count
SWAP2 // swap count with input
MUL // mul input x output
DUP2 // dup counter
DUP4
GT // compare counter to input
PUSH2 0x0133 // loop start for input 51-57
JUMPI // jump if input > counter
RETURNDATASIZE // push 0
MSTORE // store in memory
MSIZE // push 0x20
RETURNDATASIZE // push 0
RETURN // Ending point for inputs 51-57

Which translates to this runtime code: 60043580603a11600d575f5ffd5b8060050160089004601801672525567292b6e4df901a565b7e0375f0016260009d80004ec0002d00001e0000180000180000200000400001816016021c623fffff165f5260205ff35b600101600a620589805b8160010191028183116060573d52593df35b60010160126601437eeecd80005b8160010191028183116080573d52593df35b600101601a6a0cd4a0619fb0907bc000005b81600101910281831160a4573d52593df35b60010160226f0688589cc0e9505e2f2fee55800000005b81600101910281831160cd573d52593df35b610113565b600101602a7416e39f2c684405d62f4a8a9e2cd7d2f740000000005b816001019102818311610100573d52593df35b60010160327a017a88e4484be3b6b92eb2fa9c6c087c778c8d79f9c000000000005b816001019102818311610135573d52593df3

And this creation code: 7f60043580603a11600d575f5ffd5b8060050160089004601801672525567292b65f527fe4df901a565b7e0375f0016260009d80004ec0002d00001e00001800001800006020527f200000400001816016021c623fffff165f5260205ff35b600101600a620589806040527f5b8160010191028183116060573d52593df35b60010160126601437eeecd80006060527f5b8160010191028183116080573d52593df35b600101601a6a0cd4a0619fb0906080527f7bc000005b81600101910281831160a4573d52593df35b60010160226f06885860a0527f9cc0e9505e2f2fee55800000005b81600101910281831160cd573d52593df35b60c0527f610113565b600101602a7416e39f2c684405d62f4a8a9e2cd7d2f7400000000060e0527f5b816001019102818311610100573d52593df35b60010160327a017a88e4484b610100527fe3b6b92eb2fa9c6c087c778c8d79f9c000000000005b8160010191028183116161012052670135573d52593df360c01b610140526101485ff3

Updated screenshot of leaderboards with rohannero at #5

As you can see, our new solution uses 128,363 gas! Much lower than our previous solution of 179,986. This allowed us to move up to the #5 spot on the leaderboards!

Ok, that was a lot but now that we’ve figured out how to use a lookup table and used it to further optimize our code, let's see how much crazier the next solution, by fedebianu.eth, is!

PUSH1 0x04
CALLDATALOAD
PUSH1 0x01
PUSH1 0x3a // 58 in hex
DUP3
LT // compare input < 58 and leave 1 or 0 on stack
PUSH11 0x92615c565049423b342c24 // lookup table
PUSH1 0x15
PUSH1 0x06
DUP6
PUSH1 0x05
ADD // add 5 to input
DIV // divide by 6
ADD // add 21 (input = ((input + 5) / 6) + 21)
BYTE // takes the above value as offset and takes the
// hard-coded lookup table as value

JUMPI // jumps as long as input is less than 58
RETURNDATASIZE
RETURNDATASIZE
REVERT // for inputs > 57

JUMPDEST // landing for input 55,56,57
PUSH5 0x045461b590 // 54! - 48!
MUL
JUMPDEST // landing for input 49,50,51,52,53,54
PUSH5 0x020ea2db80 // 48! - 42!
MUL
JUMPDEST // landing for input 43,44,45,46,47,48
PUSH4 0xe11fed20 // 42! - 36!
MUL
JUMPDEST // landing for input 37,38,39,40,41,42
PUSH4 0x53971500 // 36! - 30!
MUL
JUMPDEST // landing for input 31,32,33,34,35,36
PUSH4 0x197b6830 // 30! - 24!
MUL
JUMPDEST // landing for input 25,26,27,28,29,30
PUSH4 0x05c6b740 // 24! - 18!
MUL
JUMPDEST // landing for input 19,20,21,22,23,24
PUSH3 0xcbf340 // 18! - 12!
MUL
JUMPDEST // landing for input 13,14,15,16,17,18
PUSH3 0x0a26c0 // 12! - 6!
MUL
JUMPDEST // landing point for input = 7,8,9,10,11,12
PUSH2 0x02d0 // 6! value
MUL

JUMPDEST // landing point for input < 7
PUSH6 0x72908a847e78 // look up table
PUSH1 0x1a // push 26
PUSH1 0x06 // push 6
DUP5 // duplicate input
MOD // input % 6
ADD // add to 26
BYTE // get byte out of lookup table using the result from (input % 6) + 26
JUMP // Jump
JUMPDEST // landing point for 54,48,42,36,30,24,18,12,6 (9 items)
PUSH1 0x05
DUP3
SUB // input - 5
MUL
JUMPDEST // landing point for 53,47,41,35,29,23,17,11,5 (9 items)
PUSH1 0x04
DUP3
SUB // input - 4
MUL
JUMPDEST // landing point for 52,46,40,34,28,22,16,10,4 (9 items)
PUSH1 0x03
DUP3
SUB // input - 3
MUL
JUMPDEST // landing point for 57,51,45,39,33,27,21,15,9,3 (10 items)
PUSH1 0x02
DUP3
SUB // input - 2
MUL
JUMPDEST // landing point for 56,50,44,38,32,26,20,14,8,2 (10 items)
PUSH1 0x01
DUP3
SUB // input - 1
MUL
JUMPDEST // landing point for 55,49,43,37,31,25,19,13,7,1 (10 items)
MUL
JUMPDEST // landing location for 0 input
RETURNDATASIZE
MSTORE
MSIZE
RETURNDATASIZE
RETURN

This solution is quite clever as well. Instead of chunking the loop into multiple smaller loops, it uses a dynamic JUMPDEST location that skips further ahead depending on the input. I.e. bigger input skips the least distance and then goes through each subsequent multiplier for each group. The main difference is that these groups aren’t directly loops, more like checkpoints where smaller inputs skip past the bigger multipliers that don’t apply to them.

Leftover values still needing to be multiplied, which were missed by the broad checkpoint value, are put into 6 groups since there are six tiers to the “leftovers”. For example, the checkpoint for input values 13–18 holds the factorial of 12! — 6!, and each input in the group will get multiplied by 6! to end at 12!. Meaning input value 13 falls into the first tier of leftovers because it only needs to be multiplied once by the output (12!) to get the actual output (13!). On the opposite end, 18 is in the highest tier of leftovers since it needs to be multiplied by 13,14,15,16,17, and 18 to get the actual output (18!).

fedebianu.eth’s solution is similar to cwu’s because it uses lookup tables to retrieve values depending on the input, yet different from his because cwu used lookup tables to retrieve factorials while fedebianu used them to retrieve JUMPDEST offset locations. It is also structured differently from cwu and atarpara’s because it skips past larger input factorial calculations for smaller inputs to land at a single RETURN, in contrast to previous solutions that had separate grouped loops that each containing their own RETURN.

To explain this better reference this diagram. Imagine that the range of possible inputs is 0–3 and that the starting value being multiplied is 1.

Extremely simplified visual representation of fedebianu.eth’s solution

The input starts on the left in the blue square. This is where the first lookup table is used to retrieve the appropriate JUMPDEST location for the given input. If inputted with 1 it should jump to the third red triangle representing our JUMPDEST. If inputted with 3, it should start at the first triangle and be multiplied by 3 and 2. If the values were small, this solution would work exactly like this. However, since the range of values is 0–57, fedebianu grouped JUMPDEST locations together.

This time imagine the possible input values are 0–7:

Simplified diagram of fedebianu.eth’s grouping solution. (0–7 input)

Grouping the JUMPDEST almost tripled the amount of input variables we can handle while maintaining the same amount of code! If we input 5, we get 60 as our product from the first circle and 120 as our final product after the second circle, which is correct since 120 is the factorial of 5. If we try again with 2 as the input we end up with 2 as our output which is also correct! But what happens if we try 3 or 7? Currently, it isn’t possible to get the correct output for half of the allowed input. This is where the second lookup table comes into play.

Simplified diagram of second lookup table structure in fedebianu.eth’s solution. (0–7 input)

Now that we have our initial output value, we will check for leftover values that still need to be multiplied and handle them accordingly. For example, take an input of 4, we start with 2 based on our calculated output from the previous diagram, then we multiply it by 3, (input-1), to get 6. Finally, we multiply 6 by the 4, (input), to get the output of 24. If we try this again with any of the other values with leftovers, we will get the correct output factorial.

Here is a simplified overview of the solution:

Simplified overview of fedebianu.eth’s solution (0–7 input)

By calculating a factorial in this way, you are able to limit the maximum number of iterations/opcodes your code will have to run.

All of the solutions thus far have been extremely clever and I’m excited to see what the final group of optimizations look like:

Screenshot of top 3 leaderboard positions

We’ll continue up the leaderboards starting with divergencearran’s solution. Right away we can see that it uses the same structure as fedebianu used, starting with a lookup table to determine which checkpoint JUMPDEST to go to and then a second lookup table storing which leftover JUMPDEST to go to.

// divergencearran.eth
PUSH1 0x04
CALLDATALOAD // load input without function signature
PUSH1 0x01 // push 1
PUSH1 0x3a // push 58
DUP3 // dup input
LT // compare input < 58
PUSH12 0x13375e59534d463f38312921 // first lookup table
PC // push 0x16 (20) to stack
PUSH1 0x06 // push 6 to stack
DUP6 // dup input
DIV // input / 6
ADD // (input / 6) + 20
BYTE // get JUMPDEST location from lookup table using our modified input
JUMPI // jump if input < 58
RETURNDATASIZE // push 0
RETURNDATASIZE // push 0
REVERT // revert if input > 57

// start of checkpoint system
JUMPDEST // landing for 54-57
PUSH5 0x045461b590 // 54! / 48!
MUL
JUMPDEST // landing for 48-53
PUSH5 0x020ea2db80 // 48! / 42!
MUL
JUMPDEST // landing for 42-47
PUSH4 0xe11fed20 // 42! / 36!
MUL
JUMPDEST // landing for 36-41
PUSH4 0x53971500 // 36! / 30!
MUL
JUMPDEST // landing for 30-35
PUSH4 0x197b6830 // 30! / 24!
MUL
JUMPDEST // landing for 24-29
PUSH4 0x05c6b740 // 24! / 18!
MUL
JUMPDEST // landing for 18-23
PUSH3 0xcbf340 // 18! / 12!
MUL
JUMPDEST // landing for 12-17
PUSH3 0x0a26c0 // 12! / 6!
MUL
JUMPDEST // landing for 6-11
PUSH2 0x02d0 // 6!
MUL
JUMPDEST // landing for 1-5
PUSH6 0x8987817b756f // second lookup table
PUSH1 0x1a // push 26
PUSH1 0x06 // push 6
DUP5 // dup input
MOD // input % 6
ADD // (input % 6) + 26
BYTE // get JUMPDEST location from lookup table using our modified input
JUMP // always jump to correct leftover JUMPDEST

// start of leftover logic
JUMPDEST // landing for 5,11,17,23,29,35,41,47,53
PUSH1 0x04
DUP3
SUB // input - 4
MUL
JUMPDEST // landing for 4,10,16,22,28,34,40,46,52
PUSH1 0x03
DUP3
SUB // input - 3
MUL
JUMPDEST // landing for 3,9,15,21,27,33,39,45,51,57
PUSH1 0x02
DUP3
SUB // input - 2
MUL
JUMPDEST // landing for 2,8,14,20,26,32,38,44,50,56
PUSH1 0x01
DUP3
SUB // input - 1
MUL
JUMPDEST // landing for 1,7,13,19,25,31,37,43,49,55
MUL
JUMPDEST // landing for 0,6,12,18,24,30,36,42,48,54

RETURNDATASIZE // push 0
MSTORE // store output in memory
MSIZE // 32 bytes
RETURNDATASIZE // push 0
RETURN

There are a few differences between this code and fedebianu’s that causes this one to cost less gas. The biggest difference is that the checkpoint groups in this solution actually contain the checkpoint value in them, whereas fedebianu’s solution required the lowest value in a checkpoint to still be multiplied at least once to get the correct output value.

For example, this solution’s smallest checkpoint value is 6! and the values inside that checkpoint are 6–11. Meaning if 6 is the input, it will land inside the checkpoint and the output will be immediately ready. On the other hand in fedebianu’s solution, the smallest checkpoint value is 6! but the checkpoint values are 7–12. So if we input 7 we still have to multiply the 6! by 7 to get the correct output (7!).

Because of this, we can see that the largest checkpoint in fedebianu’s code contains 54–57 while divergencearran’s is 55–57. We can also see that this code contains only 6 leftover JUMPDEST codes and the output can be multiplied by a maximum of 5 times, while fedebianu’s contains 7 JUMPDEST locations and can multiply the output a maximum of 6 times. Those are the main optimizations between the two solutions, an example of a small one would be divergencearran using PC to push 20 to the stack as opposed to a PUSH1 0x14 that takes 2 bytes.

Screenshot of top 2 leaderboard positions

The top two solutions are identical meaning we’ve seen and understood the second most optimized solution. Let’s see what the final and best one looks like:

// philogy and shung.crypto-frens.eth
PUSH1 0x04
CALLDATALOAD // load input
PUSH1 0x01 // push 1
PUSH1 0x3a // push 58
DUP3 // dup input
LT // compare input < 58
PUSH11 0x5c57514b443d362f271f00 // first lookup table (11 bytes instead of 12 like divergence)
PC // push 0x15 (21) to stack
PUSH1 0x06 // push 6
DUP6 // dup input
DIV // input / 6
ADD // (input / 6) + 21
BYTE // get JUMPDEST location from lookup table
JUMPI // jumps if input < 58
RETURNDATASIZE // (only 1 RETURNDATASIZE, whereas divergence used 2)
REVERT // revert if input > 57

// checkpoint system start
JUMPDEST // landing for 54,55,56,57
PUSH5 0x045461b590 // 54! / 48!
MUL
JUMPDEST // landing for 48-53
PUSH5 0x020ea2db80 // 48! / 42!
MUL
JUMPDEST // landing for 42-47
PUSH4 0xe11fed20 // 42! / 36!
MUL
JUMPDEST // landing 36-41
PUSH4 0x53971500 // 36! / 30!
MUL
JUMPDEST // landing for 30-35
PUSH4 0x197b6830 // 30! / 24!
MUL
JUMPDEST // landing for 24-29
PUSH4 0x05c6b740 // 24! / 18!
MUL
JUMPDEST // landing for 18-23
PUSH3 0xcbf340 // 18! / 12!
MUL
JUMPDEST // landing for 12-17
PUSH3 0x0a26c0 // 12! / 6!
MUL
JUMPDEST // landing for 6-11
PUSH2 0x02d0 // 6!
MUL
JUMPDEST // landing for 0-5
PUSH6 0x87857f79736d // second lookup table
PUSH1 0x06 // push 6
DUP4 // dup input
MOD // input % 6
PUSH1 0x1a // push 26
ADD // (input % 6) + 26
BYTE // get JUMPDEST location from lookup table
JUMP // always jump to appropriate leftover tier
JUMPDEST // second landing for 5,11,17,23,29,35,41,47,53
PUSH1 0x04
DUP3
SUB
MUL
JUMPDEST // second landing for 4,10,16,22,28,34,40,46,52
PUSH1 0x03
DUP3
SUB
MUL
JUMPDEST // second landing for 3,9,15,21,27,33,39,45,51,57
PUSH1 0x02
DUP3
SUB
MUL
JUMPDEST // second landing for 2,8,14,20,26,32,38,44,50,56
PUSH1 0x01
DUP3
SUB
MUL
JUMPDEST // second landing for 1,7,13,19,25,31,37,43,49,55
MUL
JUMPDEST // second landing for 0
RETURNDATASIZE // push 0
MSTORE // store output in memory at offset 0
MSIZE // push 0x20 (32) to stack
RETURNDATASIZE // push 0
RETURN

Philogy made minor optimizations that put him ahead of the competition. The first one we see is that his lookup table consists of 11 bytes whereas the previous solution contained 12. This solution also utilized PC to push a value to the stack as opposed to the slightly more expensive PUSH1 0x15. Finally we can also see that he removed one of the RETURNDATASIZE opcodes to cut his solution down by an extra byte, using previous stack items instead of pushing two 0 values.

Gas golfing is a beautiful art form. There are many different initial approaches to solving a certain problem or creating a contract that handles specific functionality, such as returning the factorial of input less than 58. However, once someone figures out an efficient method to solve a problem, it's a race from there to find micro-optimizations to improve the best structure even further.

Screenshot of gas golf progression tab

The amount of creativity and freedom you are given makes these challenges extremely fun! I look forward to the next Gas Golfing course. Huge thank you to everyone at Curta for providing developers the platform to come together and compete against one another while learning and sharing ideas. Also thank you to Harrison and optimizoor for creating this challenge!

You can find links to everyone involved in this challenge below, as well as a link to the course page on Curta’s website.

Curta Course page: https://www.curta.wtf/golf/2

Curta twitter: https://twitter.com/curta_ctf

Harrison/Pop Punk: https://twitter.com/PopPunkOnChain

Gaslite twitter: https://twitter.com/gasliteGG

vectorized.eth: https://twitter.com/optimizoor

philogy.eth: https://twitter.com/real_philogy

shung.crypto-frens.eth: https://twitter.com/shunduquar

divergencearran.eth: https://twitter.com/divergencearran

0xjust.q00t.eth: https://twitter.com/0xjustadev

atarpara.eth: https://twitter.com/atarpara

fedebianu.eth: https://twitter.com/fedebianu

cwu.eth: https://twitter.com/chiachih_wu

Tools I used or found helpful:

Evm Codes: https://www.evm.codes/?fork=cancun

Remix: https://remix.ethereum.org

Tenderly: https://tenderly.co/transaction-simulator

Hex <-> Decimal: https://www.rapidtables.com/convert/number/hex-to-decimal.html

Bother Phil: https://web.telegram.org/k/#@philogy

--

--