Mastering data structures in Ruby — Persistent lists

To close up this series, I want to introduce the concept of persistent data structures, a term coined by Driscoll et al. in an article from 1986 that these days it’s at the core of most purely functional data structures.

The first time I heard the term was in a talk by Rich Hickey, the creator of Clojure, where he showed how Closure’s data structures internally deal with mutations (or the lack of them, actually).

On this post, we are going to implement a “persistent” linked list, which is a data structure that behaves almost the same as the singly linked lists that we built in the first post of this series but is immutable and has a copy on write semantics.

Let’s start by reviewing how mutations work.

On imperative programming languages, when you want to modify a data structure, you do it in such a way that the current version of the data structure is destroyed and replaced by the new one. For instance, if you have a linked list that contains an item and you add another one, you end up with a list that contains two items. The previous version of the list (the one that contained one item) is no longer available. Hence there are not persistent. (Incidentally, that is why this kind of updates are called “destructive assignments”).

> list = (1)
> list.insert(2)
> list.print
(1, 2)

On functional programming languages though, things are a bit different because you can’t modify data structures.

Although it sounds a bit crazy at first, on functional programming languages variables are immutable, so you can’t change them. On functional settings, updates in place are not even an option; if you have a linked list that contains an item and you add another one, you end up with two lists; the original version that contains the sole item, and the new version that contains two. This kind of data structure is said to be persistent because as long as you hold a reference to them, different versions of the same data structure can coexist in our programs.

The way persistence is achieved by making a copy of the original data structure and applying the changes to that copy rather than modifying the original. A conventional technique to do this kind of updates is called COW (copy on write); which as it names implies, consists in making a copy of the data structure before writing changes to it.

> list1 = List.new(1)
> list2 = List.insert(list1, 2)
> list1.print
(1)
> list2.print
(1, 2)

Notice how, as long as we hold a reference to them, both versions of the linked list can coexist in the same program.

A nice thing about persistent data structures is that they can be shared on concurrent programs without worrying about locks or unwanted mutations. Since they are immutable, and changes to one version are not visible to the other, they can be shared safely between multiple threads and execution contexts.

(If you are thinking that having to copy the entire list over and over may introduce performance penalties, you are right. However, you don’t have to worry about that, because as we will see when we get to the implementation details, there are a couple of techniques we can apply to reuse unaffected nodes and amortize the whole process.)

Persistent list interface

The interface of this data structure is strongly shaped after the interface of the singly linked list that we built in the first post of this series, but there is a significant difference though, on persistent lists, all methods are class methods. We cannot make calls directly on this data structure; since instances are immutable, all interactions happen thru class methods.

Methods

Notice as in contrast to what happens on singly linked lists the time complexity of most methods is O(n). That’s because before doing any work on the list they have to make a copy of them.

Implementation details

For the internal storage, we are going to use a modified version of the singly linked list from the first post of this series. We can not use the original version as is because we need read-only nodes to store values and a method that allows us to reuse nodes from a given position to the tail of the list.

To store values on the persistent list, we are going to use a homemade version of “write once nodes” to emulate immutable behavior. Is not a perfect solution, but it’ll be good enough for the scope we are trying to cover on this post. What we are going to do is to call freeze method once the node’s data has been set to prevent accidental mutations.

As it happens with regular linked lists, each node has two attributes.

Now that we know the nodes representation and the internal storage for this list let’s jump to the code and see how each method works.

Empty

This method creates an empty list. Its complexity is O(1).

def self.empty
list = LinkedList.new
list.freeze
end

Insert

This method inserts a new item into a read-only copy of the specified list and returns that copy.

Since we have to copy the whole list the complexity of this method is O(n).

def self.insert list, data
ret = self.copy list
ret.insert data
ret.freeze
end

Update

Updates an item into a read-only copy of the specified list and returns that copy.

Update is the first method where amortization takes place. The strategy we use is:

  • Create a copy of each element until we get to the target node.
  • Update the value on a new version of the target node.
  • Reuse elements from the node that’s next to the target node until we get to the end of the list.

The complexity of this method is O(n) where n is the number of elements from head to the target node.

def self.update list, node, data
ret = LinkedList.new
reuse = false
found = false
list.each do |nd|
unless found
found = (nd.data == node.data)
if found
ret.insert(data)
reuse = true
next
end
end
unless reuse
ret.insert(data)
else
# Reuse nodes from target to tail.
ret.reuse_from_node(nd)
break
end
end
ret.freeze
end

Remove

This method removes an item from a read-only copy of the specified list and returns that copy.

The strategy we use here is the same we use for updates but instead of creating a new version of the target node we just skip it while copying the list.

As it happens with updates, reuse as many nodes as we can is a top priority.

The complexity of this method is O(n) where n is the number of elements from head to the target node.

def self.remove list, node
ret = LinkedList.new
reuse = false
found = false
list.each do |nd|
unless found
found = (nd.data == node.data)
if found
reuse = true
next # skip the target node.
end
end
unless reuse
ret.insert(nd.data)
else
# Reuse nodes from target to tail.
ret.reuse_from_node(nd)
break
end
end
ret.freeze
end

Cat

This method creates a read-only list that contains a copy of all nodes from LHS (left-hand side) and RHS (right-hand side).

Cat is another method where amortization takes place, this time the trick is to copy LHS only and catenate RHS to its end. Since RHS remains untouched, it can be safely shared with the rest of our program.

The complexity of this method is O(n) where n is the length of LHS.

def self.cat lhs, rhs
ret = self.copy lhs
ret.cat rhs
ret.freeze
end

Len

This method returns the length of the given list.

Since the internal list has an attribute that contains that information, the complexity of this method is O(1).

def self.len list
list&.length || 0
end

Find First

This method finds the first occurrence of the given predicate applied to the specified list.

The complexity of this method is O(n).

def self.find_first list, &predicate
return nil unless list
return list.find_first &predicate
end

Each

This method walks the specified list yielding one element at a time to the given block.

The complexity of this method is O(n).

def self.each list, &block
return nil unless list
list.each &block
end

Print

This method prints the contents of the specified list. The complexity of this method is O(n).

def self.print list
unless list
print "empty"
else
list.print
end
end

Copy

This method is not part of the public API, but I decided to include it as part of the analysis of this interface because it is the only method that can mutate lists. What this method does is copy elements from the source list into a new list and return that list.

Note that the list that this method returns is not read-only! We have to take extra care each time we use it from our code, and, of course, never expose it to the public.

The complexity of this method is O(n).

private
def self.copy src
dst = LinkedList.new
src.each { |node| dst.insert node.data }
dst
end

When to use persistent data structures

I like to think of persistence as a nice side effect of not being able to afford updates in place. So any use-case that meets that constraint would be a nice fit for this kind of data structures.

Also, if your code is tangled with mutexes and locks to coordinate execution on concurrent environments, you may want to give purely functional data structures a try. They are well suited to work in such settings.

So that’s it for this brief introduction to persistent data structures. I hope you enjoyed it!

As usual, you can get the code from here.

Now that this series is done, what is next? I would probably do a short series on mastering algorithms in ruby as a follow up to this series. If you like it so far, stay tuned!

Thanks for reading! Also, don’t forget to clap if you like this post :)

PS: This post is part of a series on mastering data structures in Ruby, as we move forward subjects that were thoroughly discussed on previous posts are sometimes glossed over (if mentioned at all), if you want to get the most of this series, I recommend you to start from the very first post on singly linked list.

PS2: If you are a member of the Kindle Unlimited program, you can read a bit polished version of this series for free on your Kindle device.

Data Structures From Scratch — Ruby Edition.