A Brief History of Node Streams pt.1

Introduction

From spew streams to suck streams, Streams are a little understood interface used in almost every internal module of Node.js and across thousands of NPM packages.

How exactly have streams come to exist? How do they vary from version to version of node? This post takes a look at what streams are and do, while providing some examples along the way.

UNIX Background

The Streams interface in Node.js are an analogous implementation of the pipe interface found on UNIX systems.

So, what does that mean exactly?

We can think about a pipeline as the movement of information between two points in space.

There is the output of data from one process and that gets piped into the input of another process.

This diagram describes the nature of a pipe in UNIX systems:

The Linux Programming Interface, ©2010 Michael Kerrisk. While this diagram depicts a pipeline to be unidirectional, in Node.js, Streams can also be bidirectional.

Stdin and stdout are standard streams in computer programming. They are simply communication channels. The former denotes standard input, and the latter, standard output.

In a common UNIX bash shell, you might write a command like this to display the given files in the working directory.

$ ls
# To be displayed in the shell output.
file1.js file2.txt file3.txt

In this instance, what you, the user, type is considered stdin, and the list of files being displayed is stdout.

You might decide to redirect that data to a file.

# This creates a new text file which will contain the list of files.
$ ls >> this_directory_files.txt

Or pipe it through a filter first:

# This will list only files ending with a ".js" extension.
$ ls | grep *.js >> this_directory_files.txt

This is an example of how pipes can be used to manipulate data from one endpoint to another in a concise manner.

The way UNIX pipes are used to transfer data from process to process is exactly how are Streams used in Node.js. That is why they are so often used by developers and within the Node internal codebase itself.

Since FileSystem I/O, http, crypto, TTY all implement streams, it would be easy to imagine the enormous potential use-cases to transfer and manipulate data using Node.js.

What makes streams so powerful, though, is their affinity to itself. That means, one constructed stream from one module will easily link to another stream from a completely different module.

Node.js is built with the UNIX philosophy in mind. Should you be unfamiliar, one of the most important takeaways is this:

Do One Thing and Do It Well

In following this principle, lightweight binaries and modules will be created to absolutely succeed in executing one simple task. With the connective properties of pipes (and analogically, streams) these several modules will be able to link up and create a complex system to execute complicated tasks.

In Node.js, this congruency fosters an entire ecosystem of streaming on a global community scale. An example of this is found in build tools like gulp where developers often build and share plugins to introduce custom data manipulation in the application build!

Node Streams

To fully understand how Node.js streams are related to UNIX’s pipes, consider Node’s process.stdin and process.stdout commands. These are the most direct implementation of the UNIX standard streams.

                Streams Hierarchy in Node.js 
=================================================================
                        EventEmitters
|
Stream (Base class)
              /                                \
Readable Writable
/ \
process.stdin || stdin process.stdout || stdout
.
.
.

Due to this inheritance, the Readable stream is said to be similar to stdin, while the Writable stream is similar to stdout.

Here’s an example on how you can use Node.js’ mock of stdin and stdout to write a javascript bash shell script.

First, make a file that will contain your javascript script commands.

$ touch your_script.js

Go ahead and make that file executable:

$ chmod u+x your_script.js

This will be the code to put in the file. The first line is a Node.js shebang, this tells bash to interpret the following code using Node.js.

#!/usr/bin/env node
process.stdin.setEncoding('utf8');
process.stdin.on('readable', () => {
var chunk = process.stdin.read();
if (chunk !== null) {
process.stdout.write(`${chunk}`);
}
});

To run the script, write this into your command line:

$ ls | ./your_script.js

Voila! You have your first Node.js bash script.

Though this example is a verbose use-case to implement what is already native to UNIX bash, I hope it might inspire ideas on how you could write bash scripts in javascript.

You could take this one step further and refactor the above code to look like this:

#!/usr/bin/env node
process.stdin.pipe(process.stdout);

Better yet, this is just one small dimension of streams that Node.js provides to help expand the potential and power of javascript. In any case, should the modules not suit your needs, you can extend the stream API and build your own custom stream.

Following the UNIX philosophy, Streams are designed to be easy-to-use and require little knowledge of the patterns and structures. But in order to really understand streams, it is important to ask these questions:

  • What happens when the source is a large file that contains hundreds of thousands bits of data?
  • What ensures the integrity of the data?
  • What would happen if a process’ resources were used up before the entire file was sent?

Chunking & Buffering

Streams use internal tools and patterns to break up data into manageable pieces to send them from one process to another. This is a process known as chunking.

Chunking helps to abstract complex, larger globs of data into smaller parts, which are easier to transfer. The way chunking works is inherent on how streams receive data and that workload is shifted onto Buffers.

Node.js’ Buffer class is designed after a generic data buffer. In the simplest terms, node’s implementation of buffers convert data into a fixed array of integers (determined by the encoding you’ve set, where UTF-8 is the default). These integers represent bytes and each buffer is connected to a memory space in the V8 memory heap.

Transferring binary data instead of strings ensures safe transportation, API universality, and speed. In instances where memory resources are used up, a back-pressure system is called.

For more information on buffers:

Back-pressure

Back-pressure describes the facilitation of the flow of data and, precisely, the method that streams use to handle an influx of data that it has no room left for.

From Wikipedia:

The term is also used analogously in the field of information technology to describe the build-up of data behind an I/O switch if the buffers are full and incapable of receiving any more data; the transmitting device halts the sending of data packets until the buffers have been emptied and are once more capable of storing information.

Below I’ve provided a visual example:

If we take a look at our friend Pacman, we see he is trying to consume a bunch of white orbs. Say, though, he became too full and he could no longer digest any more.

In this example, as a form of back-pressure, Pacman will signal to the system, “stop the incoming flow of orbs!” so he has time to empty his stomach, and once he has room, begin to eat again.

When back-pressure is instantiated, the stream will have time to process all the data it has recently accepted, which is called draining its buffers.

Once the buffers are flushed the stream will resume to accept more incoming data.

In this instance, the caution tape is the backpressure system, Pacman is the consumer. The source is where the orbs are being generated.

In the earlier implementation of Node.js, back-pressure was automated by utility function named .pump().

var fileSystem = require('fs');
var utils = require('util');
var inputFile = createReadStream('./input.txt');
var outputFile = createWriteStream('./output.txt');
utils.pump(inputFile, outputFile);

This is a small, simple interface that handled a lot of things. Pump attached event listeners that were written into these native streams to be called when there was an error, or when the queue was busy.

Node.js has evolved to the point where most streams in core have unified. That means, as a developer, the interface is even easier to understand, implement, and reuse, all of which continues to promote the UNIX philosophy.

var fileSystem = require('fs');
var inputFile = createReadStream('./input.txt');
var outputFile = createWriteStream('./output.txt');
inputFile.pipe(outputFile);

Finally, let’s take a look at how all of this fits together. In UNIX, communication from process to process is delegated through the kernel using signal codes. Node.js replicates this communication with the use of Events.

EventEmitters

Streams are built from EventEmitters. If you are familiar with jQuery, or the browser’s EventTargets, you will find EventEmitters to be easy to understand.

An event bespeaks its name and can be understood in a traditional perspective. Events come in two parts: a listener and an emitter.

Any time a rule, action, or parameter is fulfilled, an EventEmitter will say to the rest of the program, ‘hey! this happened!’. However, the question is: if there is no one there to listen, does the EventEmitter exist? For this reason (practical and philosophical), a listener really matters.

An event listener has a function attached to it. Every time an event is triggered, the function will execute.

A stream consists of multiple events that become triggered in succession. When data chunks are sent, they are done so by event payloads. This allows the temporal spread of how data is managed, read, and processed. Instead of overwhelming one process at one time, events allow for the slow trickle of data from one stream to another.

event: data
event: data
[ a process is busy ]
event: pause
[ wait until the buffer is drained ]
event: resume
event: data
event: data = null
event: end

Conclusion

So hopefully you have a better understanding of what streams are! Maybe you’re checking out tutorials across the web; but you might notice that there are discrepancies with different guides from different years. One might call an event that another doesn’t, yet the results are the close to identical.

Or maybe you’ve read terms like streams1, streams2, streams3, or classic streams being thrown around.

Are these external packages? Which one is better to use? So many questions! But fret not! The reason for all these monikers is due to the fact Node.js is constantly evolving!

Each iteration of streams tends to be drastically different from the last, or implements a cool new feature. Learning these names and what they refer to will help to lead the way in for troubleshooting your project and understand the best practices to implement each iteration.

In part two, we’ll take a look into different versions throughout the years and how they vary: A Brief History of Node Streams pt.2.

Thanks for reading :)