What They (probably) Didn’t Teach You, Pt. 2: Streams—Overview

Brian Lee
7 min readJun 1, 2017

--

“The pipes, the pipes are calling”

(Brace yourselves, this one is going to be heavy on metaphors.)

If you’re not familiar, a stream might just be a geographical feature that is entirely defined by one feature: a surface flow of water.

Streams in programming is, quite predictably, a metaphor on this concept: instead of water, data flows. It’s important to note that inherent to the idea of a flow exists a direction: data has to flow from somewhere to elsewhere—flow in, flow out.

Input and Output

A computer can do a lot in a very short amount of time internally. That said, most of that behavior is fairly useless to humans. What makes computers useful for any user is that we can control its behaviors through inputs and receive results in outputs we can perceive.

By the time UNIX was developed, the scattered environments of inputs and outputs became unsustainable. To simplify the process, UNIX was built so that inputs and outputs were all a series of strings regardless of the device used; now, systems were free to use whatever device just as long as they were sending data as a flow of text strings. Since then, text streams have been a major part of programming.

The Unix Way

A key concept that has guided software development since the spread of computers is the UNIX philosophy, which is summarized as:

  • Write programs that do one thing and do it well
  • Write programs to work together
  • Write programs to handle text streams, because that is a universal interface

The way these ideas come together may be better demonstrated with a conceptual example. Let’s say we want to retrieve text from a web server, search for a certain word and save the results in a file in the local file system. Instead of writing a single program that does all of this, the UNIX philosophy would use three separate programs each handling the HTTP request, searching and file system writing. We would use one program to download the data, pass the result to a second program that only searches, retrieve the result and finally run the third program to write to local disk. Streams factor into this process as the mechanism that allows one result to feed into a different program through pipelines.

The above example is written as a shell command in UNIX-like systems as follows:

$ curl 'http://url.to.text' | grep search-term | tee result.txt

curl is the program that lets HTTP requests to be sent from your terminal, grep searches text for matches and tee saves the incoming data onto a file location. The three programs are connected to each other by the | character which establishes the piping connection between each of the programs.

The above code can be further simplified with a simple redirection if the search result does not need to be piped further. Instead of | tee, use >.

Piping to/from Node.js

Node supports streaming interfaces. For example, this is entirely valid:

$ curl -s https://gist.githubusercontent.com/brianjleeofcl/643dff9b431e598c288ca7e725ae4309/raw/54dfb0883cf522e8e525535beb54eb755c8fd258/script.js | node

The script file downloaded through curl is fed into node. In the above example, the url points to a script file which contains a single console.log statement:

The curl ... | node will execute the above file in Node, which simply shows a console message. Likewise, many runtimes for other languages such as | python, | ruby or | bash can all run code that is piped in from an external source.

Be careful: executing scripts from an online source possesses obvious risks. If the script contained malicious behavior it can cause serious problems. Unless from a trustworthy source, do not execute code downloaded from the internet.

Piping interface in the shell requires programs to implement three streams: a standard input, a standard output and a standard error, commonly known as stdin, stdout and stderr. In Node, the three streams are hooked in using the process object.

In the following example, we have three files: one reads an image given a file path, another resizes the image using ImageMagick and the other writes to a file given a file path.

In each of these files, streams are piped into other streams in line 5. We’ll look into how these streams are implemented in the next section; for now, note the use of .pipe().

To see how these files are executed, let’s look at the following shell commands:

In the command in lines 4–6, we have the process output from the image reader piped into the image manipulator, whose output is then piped into the file writer.

In this case, there are three separate process-level instance of Node. This allows the operating system to assign each process to appropriate threads, which is one way of working around the fact that Node is a single-threaded runtime. Note that there are no shared memories involved and the code is executed in series as opposed to concurrently.

The benefits of piping is quite apparent in lines 8–10 and 12–13. In each of these commands, parts of the overall process is modified: the former has a different input which is downloaded using curl and the latter has a different output which is posting the manipulated image to a web server. In all of the example cases, the code that handles the image manipulation is identical. The code is reusable since it does not depend on how the data is provided in order to complete its tasks: as long as appropriate input is streamed into stdin and the result streamed out to stdout, it work.

The fact that separate processes can communicate and pass data from one to the other using pipelines of streams is precisely what enables the UNIX philosophy of writing small code that can accomplish multitudes of tasks when connected together.

Streams in Node.js

The shell implementation of streaming was so powerful and efficient that it became commonly implemented in software as well. Node is not an exception here.

Before we look at implementation details, let’s think about the stream metaphor once again. A stream by definition has two points: upstream and downstream. With those two points defined, we must connect the upstream to downstream to ensure smooth drainage without any overflows at any point.

In Node.js, the built-in stream class has two separate interfaces in order to distinguish the upstream from the downstream: readableStream and writableStream. readableStream allows reading of data while writeStream allows writing of data. This means data instances must be retrieved from a readableStream instance to a writableStream instance, finishing out the direction of data flow. All readableStream instances have a .pipe(stream) method, which takes an instance of writableStream as the argument. Using stdin and stdout as example,

process.stdin.pipe(process.stdout));

creates a process where data is passed through immediately. Some streams implement both readableStream and writableStream which means in certain cases, multiple .pipe(stream) calls can be chained.

Streams that implement both reading and writing are Duplex and Transform streams. These streams are often implemented as encryption/decryption or compression/decompression packages.

In addition to .pipe(stream), all streams in node are extensions on the eventEmitter class, meaning we can attach event listeners directly onto streams using .on(...) or .once(...). This is important because information about the current status of the stream is communicated to the process using events, which is used to signal the process that it should retrieve and use the data.

There are two methods of reading from streams. When a readableStream has data available, it emits a readable event; on its listener, the process takes the data with a .read() method. Because the data is not retrieved and cleared from the internal buffer until the method is invoked, this is considered the “paused mode” of reading a stream. This default behavior is turned off when .pipe() is invoked, hence feeding the data directly to a writableStream without any intervening steps, or when a listener for data event is added which makes the data available in the callback function. The paused mode is the default since it is relatively easier to control the inflow from the process side; no data will enter the process without .read() being called, as opposed to flowing streams which invokes the listener automatically at each data instance and can only be turned on and off through .pause() and .resume().

On the opposite end, writing to writableStream is a fairly straightforward process: use .write(data) to push data to a stream, use .end(data) to send the last bit of data and shut off the stream.

Note: you cannot close stdout. process.stdout.end() will throw with Error: process.stdout cannot be closed. Most likely, if the process has no more output to stream out, you may need to exit the process via process.exit(0). Otherwise, the process naturally stops when the event loop elapses.

To put this all together, let’s look at the two ways we can rewrite process.stdin.pipe(process.stdout):

Since just passing data from input directly to output is not really accomplishing anything, it makes sense to write code that handles the data before writing it out; the appropriate places for the data manipulation is noted in the comments.

The above examples explains another benefit of utilizing streams: the processing handles smaller chunks of data. It is much more efficient to write a small part of a large file to memory, handle it, send it down the stream and garbage-collect before repeating the process. The alternative would be to write the whole data to memory all at once, which slows down the whole process and at worse creates a memory shortage.

Summary

Streams are a foundational concept, on top of which modern computing has been developed. Understanding how it works and how to use it allows creative, efficient programming. On a future date I would like to dig deeper into building streams and advanced subjects such as internal buffers, corking and back pressures, but for the moment, hopefully this post serves as a quick introduction and a guide to implement streams to your code.

Next up: Child Processes

Resources

--

--