How to successfully manage a ZIO Fiber’s lifecycle

Natan Silnitsky
Wix Engineering
Published in
7 min readMay 23, 2021
Photo by Héctor J. Rivas on Unsplash

Fibers are the backbone of the highly performant, asynchronous and concurrent abilities of ZIO. They are lightweight “green threads” implemented by the ZIO runtime system.

Forking any IO[E, A] effect means it will immediately run on a new fiber.

The ZIO documentation recommends that non-advanced programmers avoid fibers and instead use concurrent primitives like raceWith, zipPar, foreachPar, and so forth, which is sound advice.

But very soon in a lot of different applications, there comes a need to fork a new fiber. The best example would be — running some recurring job.

Fiber management edge cases
When managing the application’s fibers there are inevitable edge cases each ZIO user has to be familiar with and be comfortable with.

Errors propagate much better in the asynchronous ZIO runtime (via tracing) than within the Scala Future mechanism. But there are still cases where an error can cause a fiber to die unexpectedly.

Fibers can also get stuck on some blocking IO operation, and while it may be interruptible, the fact that the fiber is stuck may not be obvious as no error is thrown.

Track, debug and avoid
This article is about how to track, debug and avoid unexpected behaviour in ZIO fibers, including:

  1. Making sure that fibers do not die without at least some kind of reporting
  2. Making sure execution will not get stuck when interrupting a fiber
  3. Having easily available fiber tracking (fiber dump) in place through ZMX or a custom solution

1. Handling unexpected failures that can cause fibers to die

The code of the example application shown below, creates a periodic job (using the repeat operator) that is forked to run on its own fiber.

In order to make sure the job continues to run periodically even when an error occurs the catchAll operator is used before the repeat operator.

catchAll doesn’t catch all failures
Unfortunately, the catchAll operator only catches expected failures. E.g. throwable error on side-effects imported by ZIO.effect or pure errors produced by ZIO.fail

Unexpected failures (aka defects), such as an exception thrown in pure effect ZIO.succeedare not caught by catchAll and will cause the fiber to die.

This is demonstrated in the standard output of the application:

do other stuff
Fiber failed.
An unchecked error was produced.
java.lang.RuntimeException: unexpected defect

do other stuff
do other stuff
do other stuff

(Note that tapError & retry operators also don’t catch such failures).

catchAllCause does catch all failures
The E type of ZIO[R, E, A] refers only to expected failures. Cause[E] is a description of a full story of failure. Cause can be either Fail[+E](value: E) or Die(value: Throwable)

Changing catchAll to catchAllCause will make sure the fiber continues even in case of unexpected error.

This is reflected in the output of the altered program:

do other stuff
job failed with Traced(Die(java.lang.RuntimeException: unexpected defect)…
do other stuff
job failed with Traced(Die(java.lang.RuntimeException: unexpected defect)…

(Note that you can use cause.squashTrace that squashes a Cause down to a single Throwable)

Usually it is not recommended to keep the fiber going in such extreme cases, but you may require otherwise in your use-case.

tapCause to help with debugging
tapCause is not able to keep the fiber going, but it will at least allow the unexpected error to be reported to the log with all additional information to help with investigation:

do other stuff
job failed with Traced(Die(java.lang.RuntimeException: unexpected defect)…). extra information…
Fiber failed.
An unchecked error was produced.

withReportFailure
You can also set your Runtime Platform’s .withReportFailure in order to report on any uncaught errors, including defects.

runtime.mapPlatform(_.withReportFailure(_ => [report unexpected failure]))

2. Interrupt a Fiber after a grace period

Unlike with Scala Futures, ZIO fibers can almost always be interrupted. Only if uninterruptible is specified explicitly, or inside managed resource acquire effect, will the fiber not be interrupted.

There are certain situations where interrupting a fiber is required. For example, when releasing a managed resource that forked a fiber when it was acquired.

Abrupt interruption
In the following example, a long running fiber is abruptly interrupted after a while.

The application output shows how the fiber termination happened, immediately after the interrupt effect was interpreted:

19:15:38.445 long running job…
19:15:38.501 do stuff
19:15:39.696 do stuff
19:15:40.701 do stuff
19:15:40.862 finalizing job!
19:15:40.871 do other stuff
19:15:41.873 do other stuff

The ensuring operator allows to run any finalization effect that is required at the end of fiber execution (in this case it helps with report of time of interruption)

Graceful interruption
In case it is beneficial to allow the fiber a grace period before interruption, the code can be changed as follows (see line 7):

  1. fiber.join — in order to set a timeout, the fiber has to be joined
  2. .resurrect.ignore resurrect converts unexpected failures (defects) to regular errors in order to make sure all types of errors are ignored. Otherwise any unexpected failure in the forked fiber will cause the main fiber to die as well when the former is joined.
  3. The timeout is specified together with disconnect in order to make sure that timeout will not be blocked by an uninterruptible effect (e.g. due to releasing a managed resource for instance). Setting the job as interruptible also makes sure the timeout will not be blocked

Line 8:

  1. timeout.isEmpty — If the timeout elapses then None is returned.
  2. In the case of a timeout the fiber is still not interrupted, so an explicit interrupt is needed. Here interruptFork is used in case the forked fiber was marked as uninterruptible in order to avoid waiting forever for an uninterruptible to get interrupted.

Now, the output is as follows:

19:23:01.422 long running job…
19:23:01.492 do stuff
19:23:02.689 do stuff
19:23:03.693 do stuff
19:23:08.960 finalizing job!
19:23:08.963 do other stuff
19:23:09.964 do other stuff

As the output shows, the fiber was only interrupted after 5 seconds (the timeout period)

Important note: marking an effect to be forked as interruptible will not have meaning in case the effect itself is uninterruptible. interruptible is only helpful inside some part of a bigger uninterruptible effect.

3. Monitoring Fibers state

Leaking fibers can cause instability in any ZIO application. For example, performing too many fork operations on the zio Blocking thread pool (which has dynamic size) e.g. effectBlocking([some blocking effect]).forkDaemon can cause memory overloading due to too many open threads.

Fibers can also die suddenly (as we saw in the first part of this article on defects) or become stuck.

ZMX fiber dump
Performing a fiber dump is sometimes essential to debug the current (troubled) state of your application through the state of its fibers.

Fortunately, there is ongoing work to introduce such an ability to ZIO through the ZIO-ZMX project.

On the application (server) side, the requirement is to setup a diagnostics layer

val diagnosticsLayer: ZLayer[ZEnv, Throwable, Diagnostics] = Diagnostics.live(“localhost”, 1111)

And to supply the runtime Platform with a ZMXSuprvisor:

val runtime: Runtime[ZEnv] = Runtime.default.mapPlatform(_.withSupervisor(ZMXSupervisor))

Supervisors allow the current state of fibers to be tracked.

On the client side, the interactions is through TPC directly

echo -ne ‘*1\r\n$4\r\ndump\r\n’ | nc localhost 1111

Or through a Scala ZMXClient.

For more information, visit the Diagnostics documentation page of ZMX

Roll your own fiber dump
ZMX is not yet officially released (as of May 2021), and regardless you may wish to create your own custom way of accessing a fiber dump.

In order to implement your own fiber dump mechanism, follow these steps:

  1. Create a FiberTracking Service that will access the supervisor’s fibers

2. Create a Runtime extension method that creates a supervisor with fiber tracking enabled (Supervisor.track(weak = false)), maps the runtime Platform to include this supervisor, and includes the FiberTracking service in the runtime environment.

3. Create a runtime for your application’s code that sets fiber tracking on using the extension method from the previous phase.

4. Upon demand, retrieve a snapshot of the currently running fibers from the FiberTracking service and secure a fiber dump string to do with as you wish.

Note: This example includes a simple console application, but the exact same code can be used for a web server as well.

Summary

This article described ZIO fiber related edge cases such as a fiber dying due to unexpected failure, or the requirement to interrupt a fiber, either abruptly or gracefully in a successful manner.

It recommended to use such ZIO operators as ensuring, tapCause, catchAllCause, withReportFailure, resurrect, and disconnect in or to make sure unexpected errors are caught (or at least reported), resources are then released and also that fibers are interrupted safely.

It also offered a detailed description of how to set up fiber tracking and executing a fiberDump for increased debuggability of your ZIO application.

I would like to thank Noam Berman, Dmitry Karlinsky and Asaf Jaffi for their help in making this article.

Thank you for reading!

If you’d like to get updates on my future ZIO related blog posts, follow me on Twitter and Medium.

You can also visit my website, where you will find my previous blog posts, talks I gave in conferences and open-source projects I’m involved with.

If anything is unclear or you want to point out something, please comment down below.

A talk I gave at Functional Scala 2021 based on this article

--

--