Faster Purely Functional Data Structures for Java
Since the arrival of Java 8 there has been a Cambrian explosion in the number and variety of libraries offering persistent collections. PCollections has been around for many years and more recently has been joined by Paguro, Dexx and JavaSlang among others.
A common set of interfaces
This article sets out to demonstrate both why maintaining flexibility of implementation and a range of choice (of Persistent Collections) is critical to building high performance applications, and how a common set of interfaces can help achieve that.
Cyclops extends the interfaces defined in PCollections to provide more complete set of functionality for working with persistent collections. We have also ensured that much of that additional functionality executes lazily — driving performance gains as chained operations can be carried during a single iteration over a collection. With eagerly executed commands, each additional command in the chain results in an additional iteration.
What are persistent collections and why should I care?
Persistent collections are immutable collections with efficient copy-on-write operations. That is rather than mutate a collection in place we efficiently generate a new copy by leveraging the existent parts of our current collection (all of which are immutable). [For a more detailed analysis see Chris Osaki famous thesis on Purely Functional Data Structures here!]
Mutable state breeds complexity
Mutable state increases the complexity of our applications, the more parts can change within a component the less sure we are of its state at any point in time and the more unit tests we need to write to be confident that it works. Once mutable components integrate with other mutable components we get a combinatorial effect on complexity and an application that is challenging to reason about and fully test. Writing purely functional programs presents it’s own set of challenges, many of which, the Java language is ill-equiped to meet. But we can take some of the best ideas from functional programming and leverage them in a dual paradigm Object-Functional language like Java to minimize application complexity.
Persistent collections bring simplicity
We can visualize a persistent List as a series of linked nodes, terminated by an empty list. To prepend a value to the list we simply create a new node that references the start of our existing list. Prepending to persistent lists is very efficient, operating in constant time.
When choice can equal performance
We can safely share list with other classes, API consumers or even across threads secure in the knowledge it can’t change. Operations that add, remove or update elements in the list all result in new Lists without changing the original. The key to all this is to perform these operations as efficiently as possible, and each of the different implementations will involve trade-offs in efficiencies for different operations. Having a broad choice of persistent collection implementations can be crucial to having good application performance.
Persistent collections matter because as engineers we are increasingly building applications that execute code in parallel across multiple cores. With Immutable collections we can be sure that our data is not being modified on another thread, and can forgo the complexities of synchronization and locking. With persistent collections we gain the benefits of immutability with the ability to grow / shrink and update our collections traditionally the realm of mutable collections.
Persistent collections in Java
PCollections defined both a set of interfaces for Persistent Collections that extend JDK collection interfaces, using plus and minus as functional analogues to mutable add and remove methods, and also a suite of persistent implementations PStack (List), HashTreeSet, HashTreeBag, TreeVector and more. Later libraries aimed to port persistent collections from Clojure (e.g. Paguro) and Scala (e.g. JavaSlang and Dexx). The upshot is, as of late 2016 Java Developers have a rich choice of Persistent Collection libraries to pick from. The 8.4.0 release of cyclops added a couple more — the persistent collections from Scala and Clojure themselves.
Calling Scala directly from Java
You could already use Scala and Clojure collections from Java (if you are of a particularly masochistic nature). E.g. to prepend one list to another we could define this monstrosity in Java -
Calling Scala from cyclops in Java
The equivalent code using cyclops in Java is altogether more reasonable.
JDK interfaces and their persistent analogues
The extended interfaces used by cyclops are listed below :-
In cyclops-react itself all implementations of P<xxx>X are eager. However in the cyclops-integration modules we leverage Pivotal’s powerful Reactor library to implement a much faster set of Lazy Extended Collections.
The interfaces provide a common way to do a large range of functional operations such as map / flatMap / zip/ reduce ( and many, many more).
The Cyclops Performance benefit
Ok, so working smoothly with Scala and Clojure collections is something new, but we already have JavaSlang, why do we need cyclops-javaslang, what value does it add (you might well ask)?
We can define a benchmark that will highlight why users of other persistent collections might want to leverage the Lazy extensions support in cyclops. It creates a large Vector and performs some pretty standard functional operations on it — map, filter, reduce
The difference in result is significant as we can see in the chart below. cyclops-javaslang captures all of the functional operations and executes them in one single iteration resulting in much lower ms per op.
Note : We would see a similar performance gain when performing these types of operations with Scala or Clojure collections (or even the PCollections as bundled directly with cyclops-react) but the code to write the benchmark for the former in Java would be significantly more gnarly and not presentable in a respectable blog!
To answer the question posed at the start of this section, Lazy Persistent Collections offer significant performance gains when performing multiple chained functional operations regardless of the persistent collection type.
How to choose your collection type
There are many different ways to implement a persistent collection, and not all persistent collections are created equal, different operations have different time complexities per implementation algorithm. As we have seen already a persistent LinkedList is a very simple structure that has efficient prepends, unfortunately appends are lot slower especially when compared to structures like mutable ArrayLists. Fortunately their are a lot of different algorithms for implementing persistent collections — and a broad of libraries and implementations for Java devs to choose from.
The performance difference between the BitMappedVectorTrie backed Vector implementations and the TreeVector alternative is explained well by Daniel Spiewek in his talk “Extreme Cleverness : Functional Data Structures in Scala”.
Set and Map Types
Rich Hickey’s talk on Clojure’s Data Structures is useful if you are thinking of making use of them in your application.
Smooth integration with third party APIs
There is a rich eco-system of Clojure and Scala open source libraries with APIs that accept and return native collection types. With cyclops-clojure and cyclops-scala we hope that you will be able to build applications that integrate smoothly with these APIs from Java, without incurring collection conversion costs.
Heterogeneous development environments
Many of us are lucky enough to work in modern heterogenous development environments where some teams work with Scala, other Java or Clojure. Perhaps even on the same team some of your services are written in a different JVM language to the rest.
Whether you want to use third party APIs, or use libraries written by collegues in Scala from Java, we hope that you will find our Scala (and other) integrations useful.
Immutable collections help us build simpler applications that are more readily parallelizable. However building efficient persistent collections involves trade-offs and research breakthroughs can make a larger difference here than in the mutable world of the Java Collection Framework. As development on the traditional Java Persistent Collection Framework (PCollections) scaled back, we found we had to find another way to ensure our applications built using these libraries could stay current and as performant as possible. Extending our powerful Lazy Extended Collection infrastructure across the eco-system seemed the best course of action to ensure flexibility and choice in core persistent implementation selection, coupled with the real power and speed advantages offered by the cyclops lazy extensions.
Where to find out more?
To find out more about cyclops and it’s extended collections checkout the user guide here.