Speeding up factorial computation by changing the order of multiplications

Published in

Concerning Pharo

5 min readMay 24, 2016

This is a story about a small, seemingly innocent code change that speeds up a very simple computation. It is pretty magical and serves as an example of how things are not always what they seem.

Factorial

If you know a bit about programming and mathematics, you probably know factorial. The factorial of a non-negative integer number n, denoted as n!, is defined as the product of all positive integers less than or equal to n. In formula form,

n! = n x (n-1) x (n-2) x (n-3) x … x 3 x 2 x 1

According to the convention for an empty product, the factorial of zero is one, o! = 1

Factorial is a popular example to explain recursion, as it is defined by the following two relations,

factorial(0) = 1factorial(n) = n * factorial(n-1)

In programming languages that are powerful enough to transparently switch between different integer representations, factorial is often used to demonstrate this ability — the result grows very quickly.

The largest factorial that fits in a 32-bit integer is 12! and the largest one that fits in a 64-bit integer is 20! — after that you get an overflow.

In Pharo, factorial is built in, as a message understood by Integer objects. Computing 32! is seamless,

32 factorial  => 263130836933693530167218012160000000

The standard implementation

The easiest, clearest and most beautiful way to implement factorial is to directly translate the recursive definition. This is the implementation that we find in Pharo.

Integer>>#factorial
  “Answer the factorial of the receiver.”  self = 0 ifTrue: [ ^ 1 ].
  self > 0 ifTrue: [ ^ self * (self — 1) factorial ].
  self error: ‘Not valid for negative integers’

This does n multiplications, going down in the recursion and coming back out of it, starting with the smallest number building up towards a potentially large result.

Yes, replacing the recursion with a simple iteration will be a bit faster, but it won’t make much difference.

This implementation works quite well in practice. When you actually try it, looking at the result, and thus rendering such huge numbers, usually takes longer than the computation itself. How large a number are we talking about, you might ask ?

10000 factorial numberOfDigits  => 35660

This last number is not even representable as a double IEEE floating point number, as it is ~ 2.846259681 x 10 ^ 35659.

The faster implementation

Here is a slightly different way to compute the product of all numbers between 1 and n. We can reformulate factorial using the more general product of all integers between n and m. Here is our particular implementation.

Integer>>#productTo: upper
  “Return the product of all integers between 
   the receiver (exclusive) and upper (inclusive).”
 
  | delta product middle |
  delta := upper — self.
  delta <= 0 ifTrue: [ ^ CollectionIsEmpty signal ].
  ^ delta < 5
      ifTrue: [ 
        product := upper.
        1 to: delta — 1 do: [ :each | 
          product := (upper — each) * product ].
        product ]
      ifFalse: [ 
        middle := (self + upper) bitShift: -1.
        (self productTo: middle) * (middle productTo: upper) ]

We start by taking the difference, delta, between the upper and lower bounds of the interval for which we have to do the multiplications. If the delta is zero or negative we signal an exception — the interval is empty. Remember, the lower bound is exclusive.

In the most general case, the ifFalse block at the bottom, we apply a divide and conquer strategy. We split the interval in the middle and call ourselves recursively on each half, multiplying both sub results and we’re done.

If the difference is small, here less than 5, we don’t bother and just do the necessary multiplications. For example, if the delta between n and m is is 3, we compute and return m x (m-1) x (m-2).

As you can see, we do the same multiplications, just in a different order.

The speedup

Let’s run some benchmarks to see how our new implementation compares to the original one, and to test the new implementation for correctness.

(1 to: 16) collect: [ :each | 
  | n t1 t2 f1 f2 |
  n := 2 ** each.
  t1 := [ f1 := n factorial ] timeToRun.
  t2 := [ f2 := 1 productTo: n ] timeToRun.
  self assert: f1 = f2.
  { n. t1. t2 } ]

We compute the factorials of the first 16 powers of 2 and see how long each computation took. This is the result.

an Array(
  an Array(2 0:00:00:00 0:00:00:00)
  an Array(4 0:00:00:00 0:00:00:00)
  an Array(8 0:00:00:00 0:00:00:00)
  an Array(16 0:00:00:00 0:00:00:00) 
  an Array(32 0:00:00:00 0:00:00:00) 
  an Array(64 0:00:00:00 0:00:00:00)
  an Array(128 0:00:00:00 0:00:00:00)
  an Array(256 0:00:00:00 0:00:00:00)
  an Array(512 0:00:00:00 0:00:00:00)
  an Array(1024 0:00:00:00 0:00:00:00)
  an Array(2048 0:00:00:00.003 0:00:00:00)
  an Array(4096 0:00:00:00.011 0:00:00:00.001)
  an Array(8192 0:00:00:00.047 0:00:00:00.006)
  an Array(16384 0:00:00:00.097 0:00:00:00.026)
  an Array(32768 0:00:00:01.198 0:00:00:00.115)
  an Array(65536 0:00:00:07.913 0:00:00:00.535))

As you can see, from the moment the computation takes longer than our clock’s resolution, the new implementation is faster, by about a factor 10 to about a factor 20. Let’s concentrate on some of the larger numbers.

#(25000 50000 75000 100000) collect: [ :each | 
  | t1 t2 f1 f2 |
  t1 := [ f1 := each factorial ] timeToRun.
  t2 := [ f2 := 1 productTo: each ] timeToRun.
  self assert: f1 = f2.
  { each. t1. t2 } ]

Which gives us similar results.

an Array(
  an Array(25000 0:00:00:00.787 0:00:00:00.065)
  an Array(50000 0:00:00:04.550 0:00:00:00.433)
  an Array(75000 0:00:00:17.679 0:00:00:00.852)
  an Array(100000 0:00:01:12.878 0:00:00:01.623))

How is that even possible, since it is basically doing the same number of multiplications ?

The explanation

The reason our alternative implementation is faster is that it tries to multiply numbers that are of similar size, specifically when the numbers become large.

The original implementation builds up a large number by multiplying the running product by each consecutive small number in the interval.

The divide and conquer strategy of the alternative implementation creates a tree of intervals, to gradually grow the size of the numbers in a more balanced way.

And why would that be faster ?

The answer lies in how large (or big) integer arithmetic, specifically multiplication, is implemented. The standard algorithms to do this are variants of the Karatsuba algorithm which is more efficient when multiplying two large numbers of comparable size.

By modifying the order in which our multiplications are done, we use that fact to our advantage.

More about factorial

The improved factorial implementation presented here is in no way the best one. If you want to explore the many possible implementations of this simple function, consult the incredible detailed Fast Factorial Functions website.