Dysfunctional programming in Java 7 : Immutable Collections
From Dysfunctional to Functional
In episode 2 of Dysfunctional Programming in Java we covered why and how to make your Java Objects Immutable, in this article we will dive deeper with Immutable Collections. (Elsewhere in the series we have covered laziness, functional composition, null handling, error handling, and concurrency.)
Recap on Immutability
Method signatures like this are painfully common in Java
Looking at the signature it isn’t clear what it does, but a reasonable guess would be that performs some form of I/O or mutation of it’s input parameters. The void return type let’s us know that it must be an impure statement — there is no return value and thus no way of interacting with outside world other than by changing it’s state in some way.
A good principle to follow when developing imperative applications is Command — Query Separation from wikipedia
It states that every method should either be a command that performs an action, or a query that returns data to the caller, but not both. In other words, Asking a question should not change the answer. More formally, methods should return a value only if they are referentially transparent and hence possess no side effects.
It’s good practice in OO to use void as a return type methods that mutate state or perform I/O. If handleAccounts is only performing I/O and merely reading from it’s parameters then it would be far from the worst method with this type of signature I’ve seen. Even C++ programmers regard the mutation input parameters as inherently dysfunctional (🙊 ). We can prevent the possibility by making our Objects immutable.
Mutable Collections Result in Incidental Complexity
Making use of Immutable Collections in our APIs can prevent ourselves and our teammates producing psuedo functional code like this :-
We managed to put a nice fluent Stream together and then ruin it all by mutating an input parameter. But why do I think that is bad? Three reasons
- ⚠️🦟Mutation Bugs🕷⚠️ It is a potential source of bugs when clients calling handleAccounts do not expect List<Account> provided to be mutated or it gets mutated in a different way that expected in the future.
- ⚠️🤷🏽♀️Distributed Object Creation😕⚠️. It can be difficult to keep track of the scope of changes made to input parameters (building up an Object or Collection can become incredibly distributed across your code base)
- ⚠️🏎Race Conditions🏇⚠️. It is not safe to handleAccounts asynchronously if the List<Account> is expected to be used on another thread in some manner (problems can occur whether reading from or writing too the collection)
Immutable Collections Keep Things Simple
A better approach is to make use of an Immutable List type as input and return a new list with the accounts added.
Cyclops and other functional libraries for Java provide a range of immutable collections that efficiently reuse shared memory to allow ‘updates’ or changes to the collections.
We can make use of the Cyclops immutable LinkedList type called Seq for this purpose
Switching to an Immutable (and functional) type offers the following advantages
- ✅Eliminate Mutation Bugs✅ We avoid bugs caused by client code expecting an unchanged Seq<Account> list passed as an input parameter — it never changes!
- ✅Eliminate Distributed Object Creation / Population✅ Our ultimate Seq<Account> creation can be managed in a single place
- ✅Eliminate Race Conditions✅. We can safely share a Seq<Account> list across threads.
- ✅Eliminate Boilerplate✅ Because Seq has a lot of useful functional combinators!
Imperative code with mutable datastructures
Going back to our application, we have a method where we attempt to load all of a users Data files into memory given a List of the DataFileMetadata Objects that describe where the files are and how to load them. There are a number of constraints
- We need to ensure that the User is authorized to load files
- We need to ensure that the User actually has files to load
There are a number of potential pitfalls with the code as it stands
- It’s not particularly robust
We are throwing three different types of Exceptions and there are as many opportunities to fail as there Data files to load (+2 — empty check and User Authorization check). Any of these failures, fails the entire process.
2. It’s doing a lot of I/O sequentially on a single thread
It’s going to perform like a dog. getContents is loading data either from the file system or the cloud, and we are currently loading each file in turn.
3. Because of the mutable state parallelism is harder to implement
The method is still pretty simple so far, but even moving getContents onto a separate set of threads that populates the mutable List<String> result is fraught with potential gotchas — as our application grows the potential problem areas will multiply!
Let’s simplify with more functional collections
As a first step we can rework the method signature to accept a Vector, which is an efficient Immutable analog of an ArrayList.
Being a functional type Vector supports a number of very useful functional combinators. We can, for example, replace the external iteration over the files and population of a mutable ArrayList with a simple map operation. Map transforms all the contents of the Vector from one state to another, in this case from our input type DataFileMetadata to our output type String.
Making Illegal States Unrepresentable
It is good practice to design our APIs in such a way as they guide developers using them along the correct path, rather than allow illegal states to be represented in code -which forces us to throw an Exception, it would be better to use appropriate types to prevent this.
If a List must have at least one entry, then we should only accept a NonEmptyList!
We can update our functional operations on files to also include a conversion to vector.
Now we can remove the empty check and IllegalArgumentException when files is empty, and our updated code now looks like this :-
Errors as values, not Exceptional control flow
Instead of throwing a UserNotAuthorizedException, we could return an Either instead. That is our return type would either be an Error or the Vector<String>.
The type signature is verbose, but with local variable type inferencing in Java 10 we should be able to avoid having to redeclare it elsewhere.
We can refactor our method implementation to return an Either.left if the user is unauthorized and an Either.right with the Vector of Strings if they are and we download the data.
We can make this code better still, if modify the return type of AuthorizationService::isAuthorized to be an Either e.g.
This means we can simply chain additional operations onto the returned Either.
Where loadContents transforms the Vector<DataFileMetadata> into Vector<String>
While we have reduced our implementation of processUsersFile down into a single line, while capturing or removing validation errors — we still are processing all of the Data files sequentially and a single I/O error will fail the entire batch.
ReactiveStreams : Adding in some concurrency
So far in this article we’ve been making use of the original imperative DataFileMetadata, but in another article we covered its refactor to a more functional style. The Functional version had a loadAsync method that we could use to load the files concurrently.
Also, rather than throw an IOException it gives us an Either Object that allows us to deal separately with the error case without disrupting the applications flow. We can refactor the loadContents method to use the reactive mergeMap operator on Vector and the loadAsync method on DataFileMetadata which returns a reactive-streams Publisher (LazyEither)
⏫The code above⏫ will load 10 data files concurrently and return their contents in a Vector.
How we respond to I/O errors has also changed, the default behavior will be to aggregate the successful results and ignore the errors. While this is better than failing the full batch due to one failure, it is still sub-optimal
Retrying after loading failures
We can build retry functionality into how we handle the result of DataFileMetadata::loadAsync, the first step to keep things clean is to move the call out into it’s own method.
The return type of loadAsync is a LazyEither — a type that is both a reactive-streams Publisher that can handle asynchronous population & processing and has stacksafe recursive support in it’s map and flatMap operators.
We can implement limited retry support by recursively calling back into our retry method on failure.
Let’s unpack what is happening in this very short, but information dense method
- flatMapLeft : is only ever called if loadAsync fails
- When flatMapLeft is invoked we recursively call asyncWithRetry (providing additional retries are allowed)
- We call asyncWithRetry on the current Thread (r->run() is a simple implementation of an Executor which immediately runs the supplied Runnable on the current thread).
If loading fails, on the same thread we recursively call asyncWithRetry until we successfully load the data or run out of retries.
It doesn’t matter how large the number of retries initially supplied is — we won’t blow the stack, because of the built-in trampoline in LazyEither.
Failing if loading fails
Now errors in loading are still ignored, but if we encounter a loading error we will retry for the specified number of times before giving up. We may rather fail the entire batch with an Error value if the retries continue.
The refactoring of loadContents to support this is slightly more involved (don’t worry we will break it down in a moment), but one of this nice features of this implementation is that everything is now asynchronous and non-blocking — and this flow through from loadContents into processUsersFiles. We need to refactor the processUserFiles implementation aswell.
The map operator needs to change to flatMap (because the return type of loadContents is now also an Either).
Let’s unpack what is happening inside loadContents :
- We create a (potentially) reactive-stream from the Vector
- mergeMap introduces concurrency & asynchronicity to the Stream, files are loaded on the supplied Executor and any failures are retried up to 10 times
- reduceAll : collects all of the resultant Strings asynchronously into a Vector (still inside the Stream)
- findFirstOrError : will return either the successfully collected Strings in a Vector or the first Exception if one occurs
- mapLeft : converts the Exception into a loading Error error value
The resultant Either is a reactive LazyEither instance that will be populated asynchronously, we can go with the processing of the rest of our application and access data from the LazyEither when we need it, or tee up additional operations to perform.
Our complete code for this session now looks like this :
Testing the code
We can run the processUsersFiles passing in different test fixtures.
If we run it with a failing URL and print out the result we will see an left Either with an Error.
And with a URL we are able to successful download contents from it will look more like this (a right Either with the contents):-
We can demonstrate the non-blocking nature of the code by adding some logging to the console.
Now we are going to print the start time, before we setup our full data flow. Then we will print a couple of messages with their timestamps to test if processUsersFiles has blocked us. Before attempting to print ‘x’ which will contain the result of our asynchronous processing. We won’t be able to print the “Completed at “ time until after the asynchronous processing has complete. If we run the code, the output will look something like this :-
What this shows us is that we are able to print out “Blocked?” and “No…” relatively shortly after setting up the dataflow (about 200ms), but that it took over 10 seconds for us to be able print the completed message. With a relatively few lines of code we have setup a robust, asynchronous & concurrent dataflow for processing our files — free many of bugs and challenges that plagues traditional imperative Java code.
compile group: 'com.oath.cyclops', name: 'cyclops', version: '10.0.3'