Laziness in Clojure

As you probably already know, Clojure employs a call by value parameter passing mechanism and is, in most cases, eagerly evaluated. So where does lazy evaluation fit into the language? The answer is in lazy sequences. Clojure has the ability to treat any data structure conforming to the sequence interface as a lazy sequence, which provides countless benefits to those who know how to handle them.

How it works.

I like to think of lazy sequences in Clojure as anonymous collections. We don’t know what elements inhabit this collection, nor do we know how many elements are in the collection. The only things we do know about a lazy sequence is that it implements the Clojure’s sequence abstraction, and that it’s got a head.

The head of a lazy sequence is the first unrealized element in the sequence, where unrealized means that Clojure has not yet computed its final value. Inside the head of every lazy sequence lives a thunk. The thunk is an uninvoked closure that takes no arguments, and when called, will realize the rest of the lazy sequence. To give you an idea of how the head and the thunk work together, take this example from The Joy of Clojure.

Notice that, although we’re saving the result of the first 10 loops of lazy-range to a first-ten reference, the lazy-range function has only looped twice. That’s because lazy-range returns a lazy sequence! When we call take 2 and print the result to the console, we’re forcing Clojure to evaluate the first two values of first-ten, but nothing else. The diagram below illustrates how this process works.

The first two steps are the realized state of the lazy sequence at some point, and together, they represent the head of this sequence. The thunk is simply an uninvoked evaluation on the rest of our sequence. Because we only grabbed the first two items with take, our sequence was only stepped through twice and the function body was only executed twice.

Clojure usually applies a technique called “chunking” (which I’ll get to later) to batch evaluate steps of a lazy-sequence, but we were able to combine recursion and the lazy-seq macro to generate a de-chunked lazy sequence in order to demonstrate the process one step at a time.

So if we’re taking 3 in the above snippet, REALIZED will only get printed once because Clojure will cache all of the realized elements of first-ten to avoid re-computation.

How can I take advantage of Lazy Sequences in Clojure?

Since lazy sequences are such an integral part the language, many of Clojure’s core functions are built around them to optimize computation. Take this imperative Java example:

If you’re unfamiliar with Java syntax, we’re just looping through a list of titles, and running a handful of computations for each title — first we compute the checksum, and we increase our counter if our verifyOK(checksum) function returns a truthy value. An idiomatic Clojure equivalent would look like the following:

You’ve probably noticed that this interpretation is passing over the collection more times than its Java counterpart. However, because both map and filter are optimized to work with lazy sequences, their values aren’t computed until the sequence is realized. Therefore, functions like map and filter can be chained together, and Clojure will compose these functions, only passing over the collection when you need to realize it (in our case, with count).

lazy-seq

Clojure provides the lazy-seq macro which allows the developer to return a lazy sequence when producing collections. Even though many of Clojure’s core functions are built to work with lazy sequences, wrapping the lazy-seq macro around the body of your function ensures that everything in your collection will become realized only as they are needed.

Let’s assume we have a client that’s quite fond of the number 23, and wants a program that will take a bunch of numbers and return a collection of those numbers multiplied by 23.

We might write a solution like this and feel good about ourselves for using recursion. But if our program took a large input…

…it breaks. However if we just wrap our function body in lazy-seq

… our function can handle large inputs. This is just one example of how the lazy-seq macro can be especially useful when using recursion to build collections, because it will not consume as much of the stack.

Make your code work with lazy sequences

Let’s see what happens when we take our lazy-range function from before and plugged it into our original multiply-by-twenty-three function.

Even though our lazy-range function returns a lazy sequence, all of our elements became realized as soon as we passed them into multiply-by-twenty-three. This is because multiply-by-twenty-three isn’t optimized to work with lazy sequences, and has to realize its argument in order to operate on that sequence and return a new one.

Since lz-multiply-by-twenty-three utilizes the lazy-seq macro, it never realizes more of the sequence than it needs to. In this case it only needs to realize the first element, its head.

Make use of core functions designed for lazy seq’s

Other than than map and filter, Clojure has countless functions designed with lazy sequences in mind. For example, we could have re-written our lz-multiply-by-twenty-three example without explicitly including the lazy-seq macro:

In this example, lazy-cat does the same thing at using concat inside lazy-seq.

Clojure also has a number of functions that can produce infinite data structures. Functions like range, cycle, repeat, and iterate all produce lazy sequences. It’s helpful to know the algorithmic complexity of certain functions before using them on a lazy sequence. Head-item functions such as take and take-while are relatively inexpensive because they operate on the head of the sequence. Tail-item functions like butlast and take-last run the risk of becoming very computationally expensive because they have to realize the majority of the sequence they act on.

Using infinite data structures and functions that are optimized for lazy sequences is crucial to writing idiomatic Clojure. And although it can be intimidating at first, the benefits to taking advantage of this core language feature are enormous.

Gotcha’s

Some common “gotcha’s” when dealing with lazy sequences will come in the form of realizing your lazy sequence before you intended to. Imagine we wanted our lz-multiply-by-twenty-three function to return a vector, we might be able to write our new implementation like this:

It looks like we’ve achieved our goal, but let’s make sure that it can handle large inputs…

What happened? We didn’t remove lazy-cat, and all we did was add an innocent vec function call to the end of our chain — so why has the laziness been sapped from our function?

The answer is because we wanted to turn our sequence into a vector at each step of the way. When we call the vec function, we’re converting our return type from LazySeq to PersistentVector. When we try to cast our LazySeq to another type, we’re forcing the sequence to realize all of its elements — essentially un-lazifying our sequence. Changing a sequence from one type to another (via into or sequence “create” functions like set, vec, list) can mistakenly be seen as a trivial undertaking for those new to the language and not well versed in the fundamental differences between Clojure’s data structures, but doing so on a lazy sequence can yield unintended consequences, especially in the context of infinite sequences.

Hanging on to your Head

This gotcha is can be more difficult to recognize than accidentally converting a lazy sequence. Consider this example from Clojure Programming:

In the snippet above, both t and d represent the same values. That is, they are the de-structured parts of an enormous sequence —(range 1e8) — with t representing the head and d representing the tail. It’s important to keep in mind that both t and d are both lazy sequences themselves as well.

The difference between the two snippets is that in the bottom one, the number of elements in d is being calculated before the elements in t, which causes an OutOfMemoryError to be thrown.

Remember that the head of a lazy sequence holds a thunk, or a function that points to our original input and will realize the rest of the t when called. In the top example, Clojure will evaluate t until it is fully realized. At this point, Clojure knows that this is the last use of t and begins the garbage collection process for t while it executes (count d).

The OutOfMemory error in the bottom example occurs for a combination of reasons. Before it is realized, the thunk being held in t keeps a pointer to the original input of our function, (range 1e8), as the head must do. When we start counting d before realizing t, t will still hold a reference to our large input function. So as we realize d through (count d), we’re still holding on to the unrealized head of our original input.

In the first example, our code executes as expected when (count t) gets run first because the thunk will get smaller and smaller and t realizes its elements. On the other hand, by executing (count d) first we’re forcing Clojure to hold a reference to the onerous head of (range 1e8) while our program counts to 999 million, causing the OutOfMemory error.


When you hear some variation of the phrase, “Don’t hang on to your head”, in the context of lazy sequences, you’ll know the reference is to this particular topic. The main takeaway is that when working with lazy sequences, be sure not to hold a reference to that sequence (or its head) over continuous iterations via a let, def, or argument binding.

Lazy Sequences & Side Effects

Lazy sequences are great for allowing the developer to express computations without specifying when those computations should happen. Because we’re not specifying the point at which these computations should be executed, lazy sequences do not work well when the realization process involves side-effects.

Let’s assume that we have a program that takes a collection of identification numbers, and returns a collection of records from the database filtered according to a vampire? specification. Our code might look like this:

In the above snippet, vampire-database is simply a map of maps used to loosely simulate a real database, get-vampire takes a single id and gets that corresponding record with a 100 millisecond penalty, vampire? acts as the litmus test for determining what constitutes a vampire, and identify-vampires takes a collection of ids and puts together the final result.

If we wanted to track down the first vampire, we could pass in a collection of numbers, then grab the first element. There’s only a 100ms penalty on the database lookup and since we’re only realizing the first element, this process should only take 100ms.

The good news is that we got our first vampire, the weird part is that we had to wait about 3.2s to get the first element. This is because Clojure uses a technique called “chunking” every time it realizes the head of a lazy sequence. In this process, Clojure will “chunk” together the first 32 elements of a lazy sequence and realize them all at once. Although this might seem counterintuitive, the strategy actually makes lazy sequences much more efficient when dealing in larger quantities, which is what lazy sequences are intended for. However in this example, chunking shows us how delayed computation can become cumbersome when that computation involves a side-effect.

There are multiple techniques that can generate a “non-chunking” lazy sequence (see our first example), but Clojure’s default chunking behavior is preferred since the developer usually won’t notice the effects of chunking in their regular workflow…unless they design with side effects.


Conclusion

Lazy sequences in Clojure have two parts, the head is the first unrealized element of the sequence, and the thunk is an uninvoked, argument-less function that, when called, will realize the rest of the elements in the sequence starting from the head. Clojure’s core functions, like map, filter, and others are already optimized to work with lazy sequences, but you can also utilize the lazy-seq macro to produce a lazy sequence when producing potentially large collections. Lazy sequences can have some unexpected “gotcha’s”. Type coercing any lazy sequence will realize it’s value in full, and holding a reference to the head of a lazy sequence can cause a program to balloon memory. Side effects should also be avoided when dealing with lazy sequences due to Clojure’s chunking optimization. If you have any questions, comments, or disputes about the content of this post, then feel free to leave a comment — all of this information was new to me at the time of writing this, so I’d love to hear different perspectives on the subject matter.

Sources

Show your support

Clapping shows how much you appreciated Pat Wentz’s story.