Streams in Node
using minimal pipeable streams
Streams are an asynchronous abstraction that allows dealing with data in small chunks, pushing bottlenecks into the IO layer. This usually leads to less memory cost and increased performance, which is a very good thing.
In streams, data flows from a source, through a bunch of through streams, into a sink:
┌────────┐ ┌─────────┐ ┌────────┐
│ Source │──▶│ Through │──▶│ Sink │
└────────┘ └─────────┘ └────────┘
Node has shipped streams as part of its standard library since its early days. However with each release, new features APIs and concepts were added, making the current implementation very unwieldy. In my years as a Node developer I’ve only met a handful of developers that felt comfortable using the current version of Node streams. In an ideal world streams would be used as much as in Node as pipes are used in shell.
Enter pull stream
pull-stream is an alternative model for streams created by Dominic Tarr. Like Node streams, it has a concept of back pressure. This means that instead of a source pushing out data as fast as it can, the consumer stream pulls data once it’s ready to handle more. This leads to a program never holding more data in memory than it strictly needs.
The Node streams source is well over 1200 lines, without even accounting for dependencies. The pull-stream source is just 28 lines, which is a whopping 0.4kb minified:
Don’t be fooled by the simple exterior though, pull-stream provides the same functionality as Node streams do. With fewer lines of code there’ll be less bugs, and room to optimize every last bit of code.
Pull stream types
In pull streams there are 3 types of streams. Source, through and sink. In order to let data flow, a source and sink must be connected. Through streams are combinations of sources and sinks, making every connection in the pipeline a source and a sink that talk to each other. Conceptually it looks like this:
┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
│Source│──▶│ Sink │ ┌▶│ Sink │ ┌▶│ Sink │
└──────┘ ├──────┤ │ ├──────┤ │ └──────┘
If a source and a through stream are connected they will not start emitting data until a sink is attached at the end. Likewise, if a through and sink are connected, they will not start flowing data until a source is attached at the start. This allows composition of arbitrary streams into pipelines similar to what pumpify provides for Node streams.
pull-streams use the pull() function to combine sinks and sources. Because sinks connect to sources, any number of streams can be connected. It’s functional composition all the way.
Duplex streams are objects that have a .source and .sink properties on them. The following are equivalent:
pull(a, b, a)
Let’s create a basic map-reduce pipeline using asyncMap where we asynchronously fs.stat an array of files, and gather the results in an array:
Because under the hood we’re just composing functions, the overhead of doing this is reduced to a bare minimum.
In Node streams errors don’t propagate through .pipe() chains. It’s therefor common practice to either use helper libraries or attach a .on(‘error’) listener to every stream. Getting errors wrong is not a great feeling.
In pull-streams errors are passed into the callback, which grinds the whole stream pipeline to a halt. It’s again the familiar api of cb(err) for an error and cb(null, value) for success.
Here’s a source stream that returns a single fs.stat value:
Wrapping it up
And that’s it. I could keep whipping out pull-stream examples, but I think we’ve made our point: streams are a cool idea; pull-stream is a neat implementation. If you’re keen to learn more, take a look at the links below. I hope this was useful; give pull-stream a try let me know how you go! -Yosh
Originally published at github.com on March 21, 2016.