Towards a better API for I/O

Tim Spence
Feb 19 · 5 min read

The core premise of Iteratees (and their ideologically similar derivatives such as fs2 in Scala and pipes and conduit in Haskell) is that there are three properties you should seek in an I/O API:

  • Incrementality — they support streaming/incrementally reading from file descriptors, etc
  • Resource safety—resources should be released in a prompt and deterministic manner
  • Compositionality

Hopefully the first two properties are self-explanatory, but compositionality may require elaboration.

Consider a function that takes a file and returns the number of x’s in the first 5 lines that are longer than 100 characters:

Now suppose we wanted to return the number of ‘x’s in the first three lines that also contain a ‘y’. There’s no way to re-use the previous code because our application logic is tightly intertwined with the I/O handling - handle based I/O does not compose!

Oleg Kiselyov’s original paper on iteratees presents a particularly concise and instructive implementation of such an I/O API.

In this post we’ll skip over some of the more complex implementation details and focus more on the core design of iteratees and the reasoning behind the types Oleg defines.

A digression on folds

If we ignore Foldablefor a moment, the Haskell prelude defines a fold as:

This processes a list from left to right, using a supplied function and initial value, by passing the output of one invocation of the function as input to the next.

If we take off our glasses and squint hard enough, this looks a bit like what we want — the source of the data (the list) is de-coupled from the computation we want to perform on it (the supplied folding function).

In fact, we can define a fold on a file as well (below is a simplistic implementation for text files):

Now we’re getting somewhere! We can (to a certain extent) compose functions which we want to apply to a source of data and maintain resource safety through the definition of fold.

We can view the folding function as a state machine for stream processing and fold as a resource manager which advances the state by passing in input. However, there are some limitations to this API:

  • The folding function cannot indicate to the fold that it is finished processing the data e.g. in the original “take the first three lines of a file” example, the computation could finish and the file descriptor be closed after the first three lines are read
  • A corollary of the above is that we cannot concatenate stream processors as each of them must consume the whole stream
  • We also cannot trivially concatenate input sources before passing them to our stream processor
  • There is no encapsulation of the state of the stream processor

Back to Iteratees

We’re now (finally!) in a position to define the basics of iteratees:

“Erm, what?”

Don’t panic! Firstly we’ve defined a stream type (Stream a) which can either be finished or contain a list of data of type a. We've also defined Iteratee i m o (like our folding function above), which represents a stream processor which takes a stream of type i and produces a value of type o, whilst performing effects of type m (m is probably IO if you are writing to a file or socket or similar). The iteratee can also return an exception to indicate an error (or as control flow) to its driver e.g. to request to rewind the stream, if that’s possible. You can think if it as a state machine - it has no ability to obtain data from anywhere, it is merely a representation of a computation.

The Monad{,Trans,IO} instances are included for interest and because they turn out to be very useful. For example, here’s a stream processor which counts the number of elements in a stream:

Don’t worry too much about the details — writing the internals of an iteratee library can get a bit messy but the high-level API you can create is well worth it!

Now that we have a way to define stream processors, we need stream producers — a way to drive input into the stream processors to advance their state machine (like fold in our original example):

Why is this the type of a stream producer? An enumerator must take in a stream processor (Iteratee i m o) describing the computation to perform. The value returned from the enumerator should be the final state of the computation (iteratee), wrapped in some context m (probably IO) as the enumerator will have to perform some action to obtain the input. In other words Iteratee i m o -> m (Iteratee i m o)!

The enumerator will probably consist of some sort of loop, performing I/O to obtain data and folding that with the current state of the iteratee. For example, we can define an enumerator for text files:

We’ve achieved quite a lot already but there’s one more type we need for general stream processing. Iteratee can only represent terminal operations which produce a value. We also need stream transducers to create derived streams so that we can represent concepts such as filtering or mapping streams or codecs:

Again, why is this what we want? A stream transducer should take an Iteratee inner m a describing the downstream computation that is to be performed. It should use this to create a new Iteratee that consumes a stream of outer , transforms it to a stream of inner, passes that to the inner Iteratee and returns the final state of the inner Iteratee as its result. In other words Iteratee inner m a -> Iteratee outer m (Iteratee inner m a) !

Note that this means thatEnumeratee is effectively an Iteratee and an Enumerator at the same time. For example, here is filter for streams:

Again, don’t worry too much about the details — you won’t have to program at this level to use iteratees!

Now we just need to define a couple of useful functions and we’re ready to try out our iteratees API!

Now our original example of counting the ‘x’s in a file under various conditions would look something like this (definitions for most of the functions omitted - the high level API is what concerns us here):

As you can see, we’ve split our computation into reusable, composable components :)

Now that you’ve finished you should (hopefully!) be better equipped to go and read Oleg’s Kiselyov’s original paper and John Lato’s equally excellent article in The Monad Reader 16 to see a more detailed explanation and implementation. Moreover, you’re hopefully more motivated to use one of the libraries that implement these concepts next time you have to perform I/O in your programs!

Caveat Emptor

The purpose of this article was to explain the high-level design of iteratees and their advantages over traditional handle-based I/O. The code is little tested and absolutely not production quality. Please use a well-established library like the ones listed above to write real code.


Permutive

We build a real-time data platform that out customers use to power online experiences. We use Medium to write about the technology, product and community we work in.

Thanks to Joe Pettersson.

Tim Spence

Written by

Software engineer @ Permutive

Permutive

Permutive

We build a real-time data platform that out customers use to power online experiences. We use Medium to write about the technology, product and community we work in.