Benchmarking Functional Error Handling in Scala

Published in

Iterators

13 min readAug 7, 2019

Conventional wisdom has it that using too many functional abstractions in Scala is detrimental to overall program performance.

Yet, these abstractions are an immense help if you want to write clean and abstract code. So, should practitioners of FP drown in guilt for writing inefficient code? Should they give way to less functional code?

Let’s find out!

The question I’ve been hearing a lot recently is:

I have used EitherT all through my code base because it helps with concise error handling. But I heard it is very slow. So, should I abandon it and write error handling myself? But if I do that, isn’t the pattern-matching slow? Meaning the best solution would be to simply throw exceptions?

It’s not so easy to answer that…

Yes, the gut feeling every Scala developer has is that all the fancy monadic transformers add a lot of non-optimizable indirection (at the bytecode level) that throws JIT off and is slower than what your Java colleagues might have written. But how bad is it?

On the other hand, if you stop using the benefits of functional abstractions made possible by Scala’s powerful type system, then you’re left with just a “better Java” kind of language. You may as well throw in the towel and rewrite everything in Kotlin.

Another gut feeling you might have happens when your code starts calling other systems via network. It’s then that whatever you are doing in your code is mostly irrelevant because communication costs dwarf any benefits or losses.

So let’s try to go beyond these hunches and try to measure the impact of being uncompromising functional programmers. I’ll use JMH to do that.

Devising an Experiment

The first step is to create a piece of code that’s representative of the problems you want to measure. Which, in this case, means a typical code that deals with error-handling in business logic. This usually means a code that takes some sort of data (an input) and validates it. Once validated, the code kicks off a transformation, fetches additional “data,” calls the outside world, and waits for a result.

If the result is correct, the code performs some additional processing and returns the final result. If it isn’t, the code performs some bookkeeping and propagates the error back to the caller.

This pattern is generic enough to be applicable in a wide variety of circumstances — e.g., authentication and calling external services — and allows for the measurement of the impact of various techniques — e.g., EitherT and exceptions - without being restraining.

So, let’s start with:

The parameters of functions represent some benchmark parameters you’d like to control. So, you start with some random Input – it holds a number [0, 100) and validInvalidThreshold controls how often the validation function returns Right- initially 80% of cases pass.

We also simulate (with failureThreshold) how often our interaction with The Dark Side ends with an error (we’ll be using these parameters to check if the performance of error handling techniques depends on error distribution).

Last but not least, you’ll want to use JMH Blackhole. It helps simulate a long-running code by consuming an arbitrary amount of time in a way that won’t be messed with via JIT.

Two additional state params, baseTimeTokens and timeFactor, control the timings. baseTimeTokens sets an arbitrary delay inside the transform function. Let’s say that your transformation is a bit more complex than just copying the input. timeFactor specifies how many times slower the other functions are - i.e., initially you’d say that interacting with the outside world, AKA ‘The Dark Side,’ is 5 times slower than what you’re doing within your system. You’ll be using these parameters to simulate more complex code.

Let’s start with Scala Future – while I’m sure you’re aware that it is rarely the recommended effect these days, it’s still very popular.

Future

EitherT vs Either

Let’s measure the impact of EitherT compared to hand-rolled handling of Either in Future

The two benchmarks above perform the same routine we devised earlier. The latter is what a human would write without EitherT.

Quirks

You may be wondering why you need await at the end of each benchmark and why await is implemented as a busy loop instead of handy Scala Await.

First, if you do not await a future, that future will still run when the next benchmark is performed, occupying the thread pool (execution context) and affecting the results. You’ll no longer be measuring the average time each method takes to execute independently.

Second, Scala’s Await tends to put your threads to sleep – which will skew the results, as you’ll be adding random (and potentially long) times of thread scheduling “tax” to each run.

The Use of Inliner

Benchmarks are compiled with -opt:l:inline, -opt-inline-from:**. These make a lot of higher-order methods disappear from the call-stack, for instance, this code:

biSemiflatMap(err => 
               doSomethingWithFailure(baseTokens, timeFactor)(err)
                .map(_ => err),
               doSomethingWithOutput(baseTokens, timeFactor))

Becomes:

new EitherT(
  catsStdInstancesForFuture(executionContext)
   .flatMap(eitherT.value) { f }
)

in the generated byte-code
(compare with

def biSemiflatMap[C, D](fa: A => F[C], fb: B => F[D])(implicit F: Monad[F]): 
      EitherT[F, C, D] =
    EitherT(F.flatMap(value) { f })

)
You can read more about these optimizations here. I believe that they’re beneficial for FP-heavy code because they eliminate megamorphic callsites. So, I recommend that everyone turn them on unless you’re building a library.

Results

Observations:

Yay! The hand-coded version is 1.5x faster than EitherT for short tasks.
For long tasks, the differences are probably too small (~10%) to make any practical difference unless performance is your main concern. In that case, stay away from this combination.
With increase of timeFactor parameter, the relative speedup of not using EitherT tends to become negligible.

Once your computations start to hit a db/external service etc…, which you are simulating by setting timeFactor to, say, 200 – meaning that it’s 200x more costly to call some functions – not an unreasonable setting if you pretend that these are calling an HTTP service, your real worry should not be EitherT.

Analysis

Insights:

There is a considerable price to be paid for creating EitherT instances via right, pure, and extra map calls.
EitherT code compiles to a lot of extra invokedynamic, invokeinterface instructions compared to the plain Future version, but it does not seem to be that much of a problem. Please note that it is quite possible that JIT has been able to perform aggressive monomorphization due to the fact that there is only one instance of Monad, Functor, etc... On the other hand, I wasn’t able to obtain different results even if I experimented with force-loading other Monad implementations.
Inliner is helpful. It can inline all the EitherT.{subflatMap, biSemiflatMap, flatMapF, map} calls, reducing one level of indirection.
The biggest factor is the cost of submitting tasks to the thread pool.

If your tasks are short, you’ll experience a substantial performance gain if you utilize thread-pool sparingly — e.g., by coalescing long chains of Future calls into a single call. If, on the other hand, your tasks are long, the cost of thread-pool management will be amortized over the time it takes to run tasks.

Performance problems with EitherT wrapped around Future seem to be centered around a certain mismatch between these two. While Future favors a small number of bigger chunks of work, EitherT, being effect agnostic, interacts with its effect through generic abstractions like Functor or Monad, which tend to break down programs into a larger number of smaller steps translated into chains of map, flatMap calls. But, as you observed, Future makes these calls expensive for short computations. This effect largely diminishes when tasks perform a lot of work – just use EitherT as it leads to a clean and concise code (again, unless performance is your main concern).

Either vs Exceptions

The source of doubts for almost everyone:

Is it better to forgo Either and go with exceptions?

After all, exceptions are by default caught by both Future and IO, making them effectively isomorphic to Either[Throwable, A]. In consequence, you can use it effectively without explicit Either at the expense of losing some precision because of unrestricted Throwable, as opposed to a more specific error type.

Let’s then create a set of functions that, instead of signaling an error by constructing a Left instance of Either, throws an exception.

Since functions throwing exceptions are not composable, I needed to rewrite things a bit.

Results

Observations:

All things equal, exceptions aren’t really faster than their Either-based counterparts. In extreme cases, exceptions can be 50% slower.
Exceptions get relatively faster the more you throw them. (50% slower for short tasks and high-error ratio vs. around 15% for longer tasks.) But even with the growth of failure rate, it’s unlikely that you’ll ever reach a point where exception-based methods are on par with Either, so don’t bother.

Analysis

Insights:

Filling stack traces can cost a lot — the more you throw, the more you’ll pay.
Stack traces are filled in the Throwable constructor – you do not even have to throw.
So, there is a conflict between the cost of throwing an exception and short-circuiting and recovery — in this case, more than 5% of samples are devoted to filling stack traces.

Verdict

EitherT: Only use for long-running tasks.
Exceptions: Don’t bother.

IO

You observed that under some circumstances, EitherT is not so performant when the underlying effect is expensive to transform.

Let’s see how it fares against effects where that is not the case — the IO monad..

EitherT vs Either

These benchmarks correspond to the ones where you tested Future: an EitherT version and a version where Either is handled manually.

Quirks

Since IO is lazy, stopping the benchmark after an instance of IO is produced is going to measure only construction costs. To be comparable with Future benchmarks, you need to force an evaluation (via unsafeRunSync) of every IO at the end of each benchmark.

This generally “pollutes” results with the cost of running the IO loop, which would not be present in a real setting where users are encouraged to run the computation as late as possible. This means you should not cross-compare actual timings between - i.e., IO and ZIO - because this kind of benchmark favors effect systems optimized toward short-running computations.

Results

Observations:

There are almost no differences between using EitherT or coding by hand – which confirms the observations. EitherT is well-suited to IO - no more than 1.5x slowdown as is in the case of Future.

Analysis

Insights:

unsafeRunSync takes a significant share of time. I guess that this is expected – this is the IO interpreter running. EitherT methods do not even show up on the flamegraph. You can conclude that it does not matter how an IO instance has been constructed.
Async boundaries are costly. You need to make sure you introduce them in the right place — before long-running, potentially blocking operations, otherwise pointless context shifts can seriously degrade performance.
As a corollary — fine-tuning execution aspects (context shifts) seems to be far more important than obsessing over monad transformers in this kind of code.

Either vs Exceptions

Note how you could use specialized methods for dealing with exceptions.

Results

Observations:

As before, exceptions are not faster than Either. The relative differences are not as large as before, though, which makes it a less painful choice if you really have to deal with functions that throw exceptions.

Analysis

Insights:

You see that a whopping 25% of samples consist of filling stack traces. Not only does that mean that exceptions are costly, but also that IO is much better optimized than Future where the dominating cost is thread pool management.

Verdict

EitherT: Yes, by all means, don’t waste your time coding Either by hand.
Exceptions: Don’t bother. But if you deal with a code that throws exceptions, then use IO rather than Future.

ZIO

Measuring the performance of ZIO, as was outlined to me by John De Goes, is tricky. That’s because, as opposed to IO, ZIO is more optimized towards long-running or even infinite processes.

That means that such short-lived benchmarks are polluted by the high costs of setup/teardown times for the interpreter. As a corollary, you should not use this benchmark to conclude which effect system is faster. Instead, given the effect system, check which programming style is the most effective to use.

EitherT vs Either

Results

Observations:

You can repeat everything that was written for IO. There is almost no difference between using EitherT and coding by hand. EitherT is well-suited to ZIO

Either vs Exceptions vs Bifunctor

ZIO contains a unique, bifunctor-based approach to handling exceptions. Read - ZIO can encode error values of an arbitrary type along the result type and retain the precise type of an error.

It makes sense to include the mechanism in this comparison as it has the potential of not using “expensive” throwables with all the benefits of optimized error handling paths.

Results

Observations:

The bifunctor mechanism offers excellent performance and principled error handling.
Its performance is a lot better compared to mechanisms based on throwables, so I’d favor it over those as much as possible.

Analysis

Insights:

You’re seeing implementation internals almost exclusively, which means that you’re not utilizing ZIO to its full potential. In that case, it’s best not to draw conclusions from the absolute numbers.
Again, as was in the case IO, construction details almost do not matter. So, ZIO seems locally well suited to any programming style.
Because of the richer (and heavier) interpreter, ZIO should not be used for one-shot or short-lived methods in isolation.

Verdict

EitherT: No problem, but ZIO has its own unique mechanism which offers a slightly more ergonomic model.
Exceptions: If you have to, but ZIO has its own unique mechanism…
Bifunctor: Yes!!

Tagless final

As a bonus, let’s measure the impact of having an abstract effect wrapper. This technique, sometimes called tagless final, lets you write your logic in terms of an abstract higher-kinded type accompanied by a set of known capabilities used for operating the wrapper without knowing its exact implementation.

It’s wildly popular these days, and it would be interesting to know if this abstraction boost adds any significant performance penalty.

Rewritten code used to benchmark the abstract effect:

As you can see, I rewrote the measured functionality to operate on the abstract effect. Additionally, I created two versions of non-EitherT functions both using syntax extensions (you can write f.map(..)) and not, to further quantify the impact of the Scala way of enriching existing classes.

As you probably know, the compiler must create a new instance of a class implementing the “pimped” method under the hood, which can have a negative impact on overall performance.
Armed with these, you can write benchmarks that call effect-oblivious functions with concrete effect types and compare them with non-tagless measurements from previous benchmarks.

Quirks

To be more fair, I tried to eliminate the effect of various compiler/JIT tricks that could not be possibly performed if the code would have been a part of a larger system.

AllInstances is extended to have a lot of Monad to choose from – possibly eliminating monomorphization tricks. Additionally, methods are marked as noinline to prevent inliner from doing its job.

Results

Observations:

Do not be afraid of syntax extensions. This use case (short-lived object with no state) is well optimized by JIT.
I did not find the tagless final style to be slower, so do not avoid it if it suits you.

Analysis

Insights:

When looking for signs of performance degradation caused by F[..] on the following flamegraph, I decided to look for itable stubs, and I noticed that they are responsible for only 0.6% of all samples, which seems small.
I tried to do various tricks to observe the effects of megamophic dispatch (like importing AllInstances) but did not notice any significant discrepancies.

Final conclusions

Unless you’re building a library, compile with inliner enabled (“-opt:l:inline”, “-opt-inline-from:**”).
If your workload mainly comprises calling DBs, REST, or, generally, long computations — avoid Future and use more efficient and optimized effect systems like IO or ZIO. Also, use the most readable FP-ish methods for error handling. In my case, that would be EitherT[IO] or ZIOs bifunctor. Obviously, you have to think about context-shifts to control blocking and fairness, but at least you control it fully. Future does not give you a choice, and it suffers when combined with EitherT.
If you really have to live with Future – optimize for optimal thread-pool utilization. Generally, that means you can’t rely on generic mechanisms like EitherT as they’re not written with thread-pool in mind.
Forget about exceptions. They do not seem to have any performance advantages (but they can have disadvantages if you throw them a lot) and you lose composability. I’d reserve usage of exceptions for system failures (good thing that all the effect systems catch them) and use Either for logical errors.
Do not trust my benchmarks. Make your own. And if they’re interesting, I will post them here. :-)
If you see any stupid things, please leave a comment.
If you have some extra insights, please comment as well. :-)
If you’re interested in more benchmarks — e.g., measuring long-running effects — please let us know.

Benchmarking Functional Error Handling in Scala

Devising an Experiment

Future

EitherT vs Either

Quirks

The Use of Inliner

Results

Analysis

Either vs Exceptions

Results

Analysis

Verdict

IO

EitherT vs Either

Quirks

Results

Analysis

Either vs Exceptions

Results

Analysis

Verdict

ZIO

EitherT vs Either

Results

Either vs Exceptions vs Bifunctor

Results

Analysis

Verdict

Tagless final

Quirks

Results

Analysis

Final conclusions

Written by Marcin Rzeźnicki