How to successfully manage a ZIO Fiber’s lifecycle
Fibers are the backbone of the highly performant, asynchronous and concurrent abilities of ZIO. They are lightweight “green threads” implemented by the ZIO runtime system.
Forking any IO[E, A]
effect means it will immediately run on a new fiber.
The ZIO documentation recommends that non-advanced programmers avoid fibers and instead use concurrent primitives like raceWith
, zipPar
, foreachPar
, and so forth, which is sound advice.
But very soon in a lot of different applications, there comes a need to fork a new fiber. The best example would be — running some recurring job.
Fiber management edge cases
When managing the application’s fibers there are inevitable edge cases each ZIO user has to be familiar with and be comfortable with.
Errors propagate much better in the asynchronous ZIO runtime (via tracing) than within the Scala Future mechanism. But there are still cases where an error can cause a fiber to die unexpectedly.
Fibers can also get stuck on some blocking IO operation, and while it may be interruptible, the fact that the fiber is stuck may not be obvious as no error is thrown.
Track, debug and avoid
This article is about how to track, debug and avoid unexpected behaviour in ZIO fibers, including:
- Making sure that fibers do not die without at least some kind of reporting
- Making sure execution will not get stuck when interrupting a fiber
- Having easily available fiber tracking (fiber dump) in place through ZMX or a custom solution
1. Handling unexpected failures that can cause fibers to die
The code of the example application shown below, creates a periodic job (using the repeat
operator) that is forked to run on its own fiber.
In order to make sure the job continues to run periodically even when an error occurs the catchAll
operator is used before the repeat
operator.
catchAll doesn’t catch all failures
Unfortunately, the catchAll
operator only catches expected failures. E.g. throwable error on side-effects imported by ZIO.effect
or pure errors produced by ZIO.fail
Unexpected failures (aka defects), such as an exception thrown in pure effect ZIO.succeed
are not caught by catchAll
and will cause the fiber to die.
This is demonstrated in the standard output of the application:
do other stuff
Fiber failed.
An unchecked error was produced.
java.lang.RuntimeException: unexpected defect
…
do other stuff
do other stuff
do other stuff
…
(Note that tapError
& retry
operators also don’t catch such failures).
catchAllCause does catch all failures
The E
type of ZIO[R, E, A]
refers only to expected failures. Cause[E]
is a description of a full story of failure. Cause can be either Fail[+E](value: E)
or Die(value: Throwable)
Changing catchAll
to catchAllCause
will make sure the fiber continues even in case of unexpected error.
This is reflected in the output of the altered program:
do other stuff
job failed with Traced(Die(java.lang.RuntimeException: unexpected defect)…
do other stuff
job failed with Traced(Die(java.lang.RuntimeException: unexpected defect)…
…
(Note that you can use cause.squashTrace
that squashes a Cause
down to a single Throwable
)
Usually it is not recommended to keep the fiber going in such extreme cases, but you may require otherwise in your use-case.
tapCause to help with debuggingtapCause
is not able to keep the fiber going, but it will at least allow the unexpected error to be reported to the log with all additional information to help with investigation:
do other stuff
job failed with Traced(Die(java.lang.RuntimeException: unexpected defect)…). extra information…
Fiber failed.
An unchecked error was produced.
…
withReportFailure
You can also set your Runtime
Platform
’s .withReportFailure
in order to report on any uncaught errors, including defects.
runtime.mapPlatform(_.withReportFailure(_ => [report unexpected failure]))
2. Interrupt a Fiber after a grace period
Unlike with Scala Futures, ZIO fibers can almost always be interrupted. Only if uninterruptible
is specified explicitly, or inside managed resource acquire effect, will the fiber not be interrupted.
There are certain situations where interrupting a fiber is required. For example, when releasing a managed resource that forked a fiber when it was acquired.
Abrupt interruption
In the following example, a long running fiber is abruptly interrupted after a while.
The application output shows how the fiber termination happened, immediately after the interrupt effect was interpreted:
19:15:38.445 long running job…
19:15:38.501 do stuff
19:15:39.696 do stuff
19:15:40.701 do stuff
19:15:40.862 finalizing job!
19:15:40.871 do other stuff
19:15:41.873 do other stuff
The ensuring
operator allows to run any finalization effect that is required at the end of fiber execution (in this case it helps with report of time of interruption)
Graceful interruption
In case it is beneficial to allow the fiber a grace period before interruption, the code can be changed as follows (see line 7):
fiber.join
— in order to set a timeout, the fiber has to be joined.resurrect.ignore
—resurrect
converts unexpected failures (defects) to regular errors in order to make sure all types of errors areignore
d. Otherwise any unexpected failure in the forked fiber will cause the main fiber to die as well when the former is joined.- The
timeout
is specified together withdisconnect
in order to make sure that timeout will not be blocked by anuninterruptible
effect (e.g. due to releasing a managed resource for instance). Setting the job asinterruptible
also makes sure the timeout will not be blocked
Line 8:
timeout.isEmpty
— If the timeout elapses thenNone
is returned.- In the case of a timeout the fiber is still not interrupted, so an explicit
interrupt
is needed. HereinterruptFork
is used in case the forked fiber was marked asuninterruptible
in order to avoid waiting forever for anuninterruptible
to get interrupted.
Now, the output is as follows:
19:23:01.422 long running job…
19:23:01.492 do stuff
19:23:02.689 do stuff
19:23:03.693 do stuff
19:23:08.960 finalizing job!
19:23:08.963 do other stuff
19:23:09.964 do other stuff
As the output shows, the fiber was only interrupted after 5 seconds (the timeout period)
Important note: marking an effect to be forked as interruptible
will not have meaning in case the effect itself is uninterruptible
. interruptible
is only helpful inside some part of a bigger uninterruptible
effect.
3. Monitoring Fibers state
Leaking fibers can cause instability in any ZIO application. For example, performing too many fork operations on the zio Blocking thread pool (which has dynamic size) e.g. effectBlocking([some blocking effect]).forkDaemon
can cause memory overloading due to too many open threads.
Fibers can also die suddenly (as we saw in the first part of this article on defects) or become stuck.
ZMX fiber dump
Performing a fiber dump is sometimes essential to debug the current (troubled) state of your application through the state of its fibers.
Fortunately, there is ongoing work to introduce such an ability to ZIO through the ZIO-ZMX project.
On the application (server) side, the requirement is to setup a diagnostics layer
val diagnosticsLayer: ZLayer[ZEnv, Throwable, Diagnostics] = Diagnostics.live(“localhost”, 1111)
And to supply the runtime Platform
with a ZMXSuprvisor
:
val runtime: Runtime[ZEnv] = Runtime.default.mapPlatform(_.withSupervisor(ZMXSupervisor))
Supervisors allow the current state of fibers to be tracked.
On the client side, the interactions is through TPC directly
echo -ne ‘*1\r\n$4\r\ndump\r\n’ | nc localhost 1111
Or through a Scala ZMXClient
.
For more information, visit the Diagnostics documentation page of ZMX
Roll your own fiber dump
ZMX is not yet officially released (as of May 2021), and regardless you may wish to create your own custom way of accessing a fiber dump.
In order to implement your own fiber dump mechanism, follow these steps:
- Create a
FiberTracking
Service that will access the supervisor’s fibers
2. Create a Runtime
extension method that creates a supervisor with fiber tracking enabled (Supervisor.track(weak = false)
), maps the runtime Platform
to include this supervisor, and includes the FiberTracking
service in the runtime environment.
3. Create a runtime for your application’s code that sets fiber tracking on using the extension method from the previous phase.
4. Upon demand, retrieve a snapshot of the currently running fibers from the FiberTracking
service and secure a fiber dump string to do with as you wish.
Note: This example includes a simple console application, but the exact same code can be used for a web server as well.
Summary
This article described ZIO fiber related edge cases such as a fiber dying due to unexpected failure, or the requirement to interrupt a fiber, either abruptly or gracefully in a successful manner.
It recommended to use such ZIO operators as ensuring
, tapCause
, catchAllCause
, withReportFailure
, resurrect
, and disconnect
in or to make sure unexpected errors are caught (or at least reported), resources are then released and also that fibers are interrupted safely.
It also offered a detailed description of how to set up fiber tracking and executing a fiberDump for increased debuggability of your ZIO application.
I would like to thank Noam Berman, Dmitry Karlinsky and Asaf Jaffi for their help in making this article.
Thank you for reading!
If you’d like to get updates on my future ZIO related blog posts, follow me on Twitter and Medium.
You can also visit my website, where you will find my previous blog posts, talks I gave in conferences and open-source projects I’m involved with.
If anything is unclear or you want to point out something, please comment down below.