Solving the mystery behind Abstract Algorithm’s magical optimizations

Yesterday, I reported the bizarre observation that certain functions can behave as if they had negative complexity. You don’t need to read that article before this one. In short, the λ-term f(bits) = copy(comp(inc,n,bits)) is asymptotically faster than g(bits) = comp(inc,n,bits); i.e.,copy (a O(1) operation for a fixed size) behaves as if it had a O(1/n) complexity, causing the program to run faster by doing more things (!?).

That’s not the only bizarre complexity result I had when experimenting with the Abstract Algorithm. Years ago, I noticed that it is capable of computing programs that wouldn’t fit in a computer the size of the universe. Recently, I posted how it can apply an O(1) operation N times in O(log(N)) time (!?). Despite insightful answers being provided, it was hard to make sense of such unintuitive behaviors.

After a lot of research and some recent insights, it finally hit me. Turns out the explanation for all of that is pretty simple, and, in this article, I’ll explain what is going on carefully (not jokingly this time). If you’re in a hurry, feel free to skip to the tl;dr at the end.

1. Optimizable functions are those that fuse when self-composed

First, let’s start with the simplest example on which something interesting happens: repeated not. Let’s implement it in two different ways:

True  = λt. λf. t
False = λt. λf. f
not_a = λb. (b False True)
not_b = λb. λt. λf. (b f t)

If you don’t understand those definitions, you can read about lambda encodings, but that’s not too important. Now, how would consecutive applications of not_a and not_b perform in Absal (an implementation of the Abstract Algorithm)? Here is a chart:

Complexity (total rewrites) of applying “not” N times to True

Here, complexity is measured as graph rewrites. Don’t worry about that, though, just assume the time it takes to run a λ-term is proportional to the number of graph rewrites. As you can see, not_a performs linearly, while not_b performs logarithmically. That’s a huge difference, for such a small change! And is also very unintuitive, because you shouldn’t be able to apply N nots in O(log(N)). Then why that happens? Before answering, let’s first make an experiment. Let’s compose both functions with themselves and check their normal forms.

  • norm(not_a.not_a): λb. (b (λt.λf.f) (λt.λf.t) (λt.λf.f) (λt.λf.t))
  • norm(not_b.not_b): λb. λt. λf. (b t f)

Interesting: once self-composed, not_a grows larger, but not_b stays the same size. Why? Because the later function fused. That’s the same effect that allows Haskell to turn several passes over a list in a single pass, eliminating intermediate structures. This leads us to hypothesize that the magical speedup could have something to do with that property. Let’s check that hypothesis by testing the map operation on lists:

Nil   =         λc. λn. n
Cons = λh. λt. λc. λn. c h (t c n)
map_a = λf. λl. l (λh. λt. Cons (f h) t) Nil
map_b = λf. λl. λc. λn. l (λh. λt. c (f h) t) n
list = Cons True (Cons False (Cons True (Cons False Nil)))

Here, we once again defined it in two different ways: map_b fuses and map_a doesn’t. Let’s measure the complexity of applying map repeatedly to a list:

Complexity (total rewrites) of applying “map” N times to a list of 4 elements

As you can see, the same behavior is observed here: the version that fuses performs logarithmically, while the version that doesn’t performs linearly, supporting our hypothesis. Woa! But why would that happen?

(Here is the code for reproducing that experiment and generate the charts.)

2. Exponentiation by squaring = composition by sharing

To understand that, let’s first talk about integer exponentiation. If you’re a developer, chances are you’ve heard about exponentiation by squaring. It is an ancient, simple algorithm capable of computing A^B in much less steps than it’d take to multiply A repeatedly, B times. To refresh our minds, let’s visualize how it works:

# Computing 13^8 with repeated multiplication
13^2 = 13 * 13 = 169
13^3 = 13 * 169 = 2197
13^4 = 13 * 2197 = 28561
13^5 = 13 * 28561 = 371293
13^6 = 13 * 371293 = 4826809
13^7 = 13 * 4826809 = 62748517
13^8 = 13 * 62748517 = 815730721
# Computing 13^8 with exponentiation by squaring
13^2 = 13 * 13 = 169
13^4 = 169 * 169 = 28561
13^8 = 28561 * 28561 = 815730721

As you can see, the first approach multiplies 13 one-by-one, needing N multiplications to reach 13^N (linear time), while the second repeatedly squares the first number instead, allowing it to jump from 13^1 to 13^2 to 13^4 to 13^8 in a single multiplication for each jump (logarithmic time). To compute values that aren’t perfect squares, you just multiply sub-terms, i.e., 13^11 = 13^8 * 13^2 * 13^1.

Neat, right? But how is that relevant? Due to a less known fact, for any function that can be fused by composition, we can also compute its repeated application in O(log(N)) steps using that same trick! Let’s visualize it too:

# Computing not^8(t) with repeated application
not^2(t) = (λb.λf.λt. b f t) . (λb.λf.λt. b f t) = (λb.λf.λt. b t f)
not^3(t) = (λb.λf.λt. b f t) . (λb.λf.λt. b t f) = (λb.λf.λt. b f t)
not^4(t) = (λb.λf.λt. b f t) . (λb.λf.λt. b f t) = (λb.λf.λt. b t f)
not^5(t) = (λb.λf.λt. b f t) . (λb.λf.λt. b t f) = (λb.λf.λt. b f t)
not^6(t) = (λb.λf.λt. b f t) . (λb.λf.λt. b f t) = (λb.λf.λt. b t f)
not^7(t) = (λb.λf.λt. b f t) . (λb.λf.λt. b t f) = (λb.λf.λt. b f t)
not^8(t) = (λb.λf.λt. b f t) . (λb.λf.λt. b f t) = (λb.λf.λt. b t f)
# Computing not^8(t) with composition by sharing
not^2(t) = (λb.λf.λt. b f t) . (λb.λf.λt. b f t) = (λb.λf.λt. b t f)
not^4(t) = (λb.λf.λt. b t f) . (λb.λf.λt. b t f) = (λb.λf.λt. b t f)
not^8(t) = (λb.λf.λt. b t f) . (λb.λf.λt. b t f) = (λb.λf.λt. b t f)

Ok, that is way too hard to read. Let’s replace lambdas by their names:

# Computing not^8(x) with repeated application
not^2(x) = not . not = id 
not^3(x) = not . id = not
not^4(x) = not . not = id
not^5(x) = not . id = not
not^6(x) = not . not = id
not^7(x) = not . id = not
not^8(x) = not . not = id
# Computing not^8(x) with composition by sharing
not^2(x) = not . not = id 
not^4(x) = id . id = id
not^8(x) = id . id = id

Can you see it? The first approach applies nots one by one, needing N applications to reach the final result (linear time), while the second repeatedly self-composes the initial function, jumping from not^2 to not^4 to not^8 with just a single composition for each jump (logarithmic time). In other words, as long as the self-composition of f fuses, we can implement repeated application in logarithmic time, using the same trick behind exponentiation by squaring!

3. Absal performs runtime fusion

Sadly, knowing that trick isn’t sufficient to speed-up fusible algorithms: we also need a functional language capable of performing short-cut fusion during runtime. Haskell, for one, isn’t capable of doing that for native functions. To illustrate, let’s implement “composition by sharing” in it:

This function takes a number n, a function f, and returns a new function that applies f repeatedly to x, n times, using the “composition by sharing” method. As an example, (comp(not,10,x)) evaluates to the equivalent of:

Notice how this function applies not 10 times to its input, even though it only uses composition 3 times? That’s why it should be efficient. So, what is the complexity of comp(not, N, x) in Haskell? Not logarithmic as we’d expect, but linear! You can test that fact by running this Gist. If you change 2^20 to something higher such as 2^40, it will run out of memory. Why? Because it is too lazy and never evaluates inside lambdas, which would be necessary to fuse those intermediate expressions. And, even if it did, it’d still not work. This separate posts investigates why, examining Haskell and JavaScript’s evaluation model and why they’re not optimal.

Now we’re real close to solving our puzzle. But, first, we need to talk about parallel universes; I mean, parallel reductions. The main insight behind Absal is the use of interaction nets instead of simpler textual representations. Interaction nets are a model of computation based on graphs with labelled ports (one being the “main port”) and a very peculiar caveat: whenever two “main ports” met, the nodes connected by them are rewritten.

Rewrite rules of interaction combinators, a type of interaction net

That causes a very interesting effect: no matter what order you rewrite your nodes, the total number of steps stays constant. If, thus, you compile λ-terms to interaction nets, then you can reduce expressions in any order: functions first, arguments first, even in parallel. In the end, the performed work will be the same!

The trick, of course, is how to properly translate λ-terms to interaction nets. Sadly, there is no solution to that covers all λ-terms efficiently. But, for a huge class of them (probably whatever you’d write in practice), there is a simple and elegant way to do it: that’s the Abstract Algorithm. To explain it shortly, each lambda and each application of a λ-term becomes a ternary node on the net, and each repeated occurrence of a variable becomes a “fan node”. Also, the main port of an application node points to the function port, which is why redexes can never be duplicated.

If you’re overwhelmed, don’t worry. The main insight is that, by being free to reduce terms in whatever order it wants, Absal is capable of being as eager as possible without any performance penalty, which, in turn, allows it to fuse intermediate compositions. This is exactly what is needed to exploit the “composition by sharing” trick. If you do want to have a better understanding of the Abstract Algorithm, I recommend reading this book, it is extremely pedagogical. Alternatively, you can just contemplate the image below:

It depicts the evaluation of the(λg. g (g λx. x)) (λh. (λf. f (f λz. z)) (h λy. y)) term on the abstract algorithm. Notice how sub-expressions can be shared by several terms, how we can evaluate inside lambdas without ever risking duplicating work (and while still being able to be lazy), and how a single constant-time beta-reduction in this format is equivalent to several beta-reductions on more traditional representations. Beautiful, no?


But what about the negative complexity stuff?

Let’s recap all that we have learned so far. First, some λ-calculus functions fuse when they are self composed. By “fuse”, I mean the size of their normal forms stay constant. Second, those functions can be applied N times with O(log(N)) complexity using the “composition by sharing” strategy, as long as the host language can eagerly normalize sub-expressions inside lambdas. Third, Absal is capable of evaluating inside lambdas, which allows it to do exactly that, fusing inner compositions during runtime. That leads us to one logical conclusion: copy must somehow cause inc to fuse! Wow! Could it be? Let’s test that hypothesis by looking normal forms!

-- Normal form of `inc`:
λa. λb. λc. λd. (((a c) λe. (b λf. λg. λh. (((e g) λi. (f λj.
λk. λl. (((i k) λm. (j m)) l))) h))) d)
-- Normal form of `inc . inc`:
λa. λb. λc. λd. (((a λe. (b λf. λg. λh. (((e g) λi. (f λj. λk
. λl. (((i k) λm. (j m)) l))) h))) λe. (c λf. λg. λh. (((e g)
λi. (f λj. λk. λl. (((i k) λm. (j m)) l))) h))) d)
-- Normal form of `inc . inc . inc`:
λa. λb. λc. λd. (((a λe. (c λf. λg. λh. (((e g) λi. (f λj. λk
. λl. (((i k) λm. (j m)) l))) h))) λe. (b λf. λg. λh. (((e λi
. (f λj. λk. λl. (((i k) λm. (j m)) l))) λi. (g λj. λk. λl. (
((i k) λm. (j m)) l))) h))) d)
-------------------------------------------------------------
-- Normal form of `copy . inc`:
λa. (((a λb. λc. λd. λe. (d (((b λf. λg. λh. λi. (g (((f λj.
λk. λl. λm. (k j)) λj. λk. λl. λm. (l j)) λj. λk. λl. l))) λf
. λg. λh. λi. (h (((f λj. λk. λl. λm. (k j)) λj. λk. λl. λm.
(l j)) λj. λk. λl. l))) λf. λg. λh. h))) λb. λc. λd. λe. (c (
((b λf. λg. λh. λi. (h (((f λj. λk. λl. λm. (k j)) λj. λk. λl
. λm. (l j)) λj. λk. λl. l))) λf. λg. λh. λi. (g (((f λj. λk.
λl. λm. (l j)) λj. λk. λl. λm. (k j)) λj. λk. λl. l))) λf. λg
. λh. h))) λb. λc. λd. d)
-- Normal form of `copy . inc . inc`:
λa. (((a λb. λc. λd. λe. (c (((b λf. λg. λh. λi. (h (((f λj.
λk. λl. λm. (k j)) λj. λk. λl. λm. (l j)) λj. λk. λl. l))) λf
. λg. λh. λi. (g (((f λj. λk. λl. λm. (l j)) λj. λk. λl. λm.
(k j)) λj. λk. λl. l))) λf. λg. λh. h))) λb. λc. λd. λe. (d (
((b λf. λg. λh. λi. (h (((f λj. λk. λl. λm. (k j)) λj. λk. λl
. λm. (l j)) λj. λk. λl. l))) λf. λg. λh. λi. (g (((f λj. λk.
λl. λm. (l j)) λj. λk. λl. λm. (k j)) λj. λk. λl. l))) λf. λg
. λh. h))) λb. λc. λd. d)
-- Normal form of `copy . inc . inc . inc`:
λa. (((a λb. λc. λd. λe. (d (((b λf. λg. λh. λi. (h (((f λj.
λk. λl. λm. (k j)) λj. λk. λl. λm. (l j)) λj. λk. λl. l))) λf
. λg. λh. λi. (g (((f λj. λk. λl. λm. (l j)) λj. λk. λl. λm.
(k j)) λj. λk. λl. l))) λf. λg. λh. h))) λb. λc. λd. λe. (c (
((b λf. λg. λh. λi. (g (((f λj. λk. λl. λm. (l j)) λj. λk. λl
. λm. (k j)) λj. λk. λl. l))) λf. λg. λh. λi. (h (((f λj. λk.
λl. λm. (l j)) λj. λk. λl. λm. (k j)) λj. λk. λl. l))) λf. λg
. λh. h))) λb. λc. λd. d)
-- Note: using a fixed depth of 3 for better visualization

There it is! That’s exactly what happens. Don’t worry about the contents of those terms, we’re just considering their sizes. Here, the composition of inc with itself is small, but as we keep appending more inc.s, the normal form grows bigger, linearly. In other words, inc does not fuse with itself. If we just append copy. to the sequence of inc.s, then the normal form of the resulting function is larger, but stays constant, regardless of the number of compositions. In other words, its self-composition merely permutes variables and, thus, fuses. That, in turn, allows the “composition by sharing” effect to kick off, which causes the Nth composition of inc to have logarithmic complexity. Feels good, inc.!

And that’s it. Mystery solved. Here is the code for replicating this experiment with Absal (see my previous post for install instructions).


A last pending question would be: why copy causes inc to fuse? But that’s not hard to understand. First, let’s visualize the definitions of copy and inc:

Now, let’s fix a depth d=2 and visualize what inc becomes when composed with copy:

Turns out copy sucked inc into itself, feeding it a bunch of constructors, which transformed it into a function that exhaustively pattern-matches all branches of its input to the maximum depth, instead of prematurely returning when no more work is needed, as it did before. Notice how line 16 on the first snippet was replaced by lines 8-21 on the second one. In other words, inc essentially became a big permutation / lookup table instead of a normal recursive function. That sounds like a bad idea, but turns out this complete coverage is exactly what was needed for the function to be able to fuse with itself. Without it, the symbolic variables that are prematurely returned in certain branches couldn’t be reduced further, accumulating in self-composed terms, causing them to grow larger and larger. Or, alternatively, the additional pattern matches self-destroy under self-composition, keeping the function’s body small.

Conclusion

Well, that was quite a trip. Turns out that the Abstract Algorithm had simple explanations behind its magical results. I’m glad I finally figured those things out, and thankful to the Redditors who commented with insights and speculations; in special, /u/matt-noonan, who mentioned the composition by sharing trick which put all pieces together in my head.

tl;dr

In Absal, an optimal implementation of functional languages, repeated application is performed with church-numbers, which are represented by functions with a high amount of internal sharing. For example, 21 is represented as:

λf1. λx. 
let f2 = f1 . f1 in
let f4 = f2 . f2 in
let f8 = f4 . f4 in
let f16 = f8 . f8 in
f16 (f4 (f1 x))

That format is very similar to “exponentiation by squaring”, a classic algorithm capable of computing the exponentiation of A^B in O(log(N)) instead of O(N). That alone isn’t sufficient to explain the phenomena; in Haskell, for example, using that function gives no speedups. Absal, though, is extremely eager (despite suffering no slowdowns for it), which makes it capable of computing the normal form of intermediate f . f compositions, essentially performing fusion at runtime. That’s only useful when f fuses when self-composed, otherwise the growth of its normal form would cause an exponential slowdown due to the cost of copying increasingly larger terms.

If f fuses under self-composition, though, then Absal manifests a “composition by sharing” effect, which is similar to “exponentiation by squaring”, except with functions and composition instead of integers and multiplication. This is what allows it to compute N repeated applications of f in O(log(N)) time. Finally, turns out copy . inc transforms inc into a function that self-fuses, by replacing symbols that accumulate by lambdas that self-destroy under self-composition, which is why it is asymptotically faster than inc alone.

tl;dr tl;dr

Magic.