Building Graceful Node Applications in Docker

So the good news is that you have a Node application that’s up and running and serving traffic. Congratulations! The bad news is that at some point, your application is going to do something it’s not supposed to. Maybe a database connection goes stale, the node runs out of memory, or a deploy forces your server to restart. The question you need to ask is what happens to any requests that are in the middle of being processed? Chances are, when your application dies unexpectedly, so do any in-flight requests.

Graceful exiting is a way of handling these situations by allowing the application to finish responding to requests and shut itself down before the process actually dies. Although it is relatively easy to add graceful shutdown to a Node application, the way that both Docker and npm spawn child processes and handle signals leads to some unexpected differences between local and Dockerized implementations.

Request lifecycle with graceful exiting (in-flight requests will still be aborted, but processing requests will complete)

Graceless exit

To test this out, let’s create a very simple node application.

// package.json
{
"name": "simple_node_app",
"main": "server.js",
"scripts": {
"start": "node server.js"
},
"dependencies": {
"express": "^4.13.3"
}
}
// server.js
'use strict';
const express = require('express');
const PORT = process.env.port || 8080;
const app = express();
app.get('/', function (req, res) {
res.send('Hello world\n');
});
app.get('/wait', function (req, res) {
const timeout = 5;
console.log(`received request, waiting ${timeout} seconds`);
const delayedResponse = () => {
res.send('Hello belated world\n');
};
setTimeout(delayedResponse, timeout * 1000);
});
app.listen(PORT);

As expected, when we run our application locally, it does not exit gracefully.

# Start the application
$ npm install && npm start
> start simple_node_app
> node server.js

Make a request in another terminal

$ curl http://localhost:8080/wait

Send a SIGTERM signal to `npm` before the request has completed

# find the PID of the npm process
$ ps -falx | grep npm | grep -v grep
UID PID PPID CMD
502 68044 31496 npm
# send a SIGTERM (-15) signal to that process
$ kill -15 68044

See that npm has terminated and the request aborts midway

$ npm start
> node server.js
Running on http://localhost:8080
received request, waiting 5 seconds
Terminated: 15

$ curl http://localhost:8080/wait
curl: (52) Empty reply from server

Handle all the signals

To fix this, we need to add explicit signal handling in our server.js file (signal handling borrowed from this great post by Grigoriy Chudnov)

const server = app.listen(PORT);
// The signals we want to handle
// NOTE: although it is tempting, the SIGKILL signal (9) cannot be intercepted and handled
var signals = {
'SIGHUP': 1,
'SIGINT': 2,
'SIGTERM': 15
};
// Do any necessary shutdown logic for our application here
const shutdown = (signal, value) => {
console.log("shutdown!");
server.close(() => {
console.log(`server stopped by ${signal} with value ${value}`);
process.exit(128 + value);
});
};
// Create a listener for each of the signals that we want to handle
Object.keys(signals).forEach((signal) => {
process.on(signal, () => {
console.log(`process received a ${signal} signal`);
shutdown(signal, signals[signal]);
});
});

Now when we re-run our experiment from earlier, we can see that the server finishes processing the request and shuts down before exiting.

$ npm start
> node server.js
Running on http://localhost:8080
received request, waiting 5 seconds
process received a SIGTERM signal
shutdown!
sending response!
server stopped by SIGTERM with value 15

and the request completes

$ curl http://localhost:8080/wait
Hello belated world

Note: npm actually complains here because it's not expecting node to exit. However, since node is doing what it's supposed to do, this error isn't really meaningful.

npm ERR! simple_node_app@1.0.0 start: `node server.js`
npm ERR! Exit status 143

Dockerize all the things

Docker is a software containerization tool that offers the ability to efficiently package, deploy, and manage your application. I won’t go into all of the reasons that Docker is awesome, so just take my (and the internet’s) word for it.

Dockerizing our application is simple: we just add a Dockerfile, and then build and run our new container.

# Dockerfile
FROM node:boron
# Create app directory
RUN mkdir -p /usr/src/app
WORKDIR /usr/src/app
# Install app dependencies
COPY package.json /usr/src/app/
RUN npm install --production --quiet
# Bundle app source
COPY . /usr/src/app
EXPOSE 8080
CMD ["npm", "start"]

We can then build and run our Docker application

$ docker build -q -t grace . && docker run -p 1234:8080 --rm --name=grace grace
> node server.js

Now we want to repeat our earlier experiment by sending a request to the dockerized application and shutting it down before the request completes. We do this by making a request to our new endpoint (Docker internally maps port 8080 to the external port 1234), and calling docker stop grace (which sends a SIGTERM signal to the docker container named grace).

$ curl http://localhost:1234/wait
curl: (52) Empty reply from server

What happened? Why are we seeing an aborted request, when the same code behaved gracefully when we ran it on our machine?

NPM guts

To understand this, let’s look a little deeper into how npm start works.

When we run npm start locally, it spawns the node process as a child. We know this because the Parent Process ID (PPID) of the node process is the Process ID (PID) of the npm process.

$ ps -falx | grep "node\|npm" | grep -v grep
UID PID PPID CMD
502 65378 31800 npm
502 65379 65378 node server.js

We can double-check that npm only spawns one child process by searching for all processes within that Process Group ID (PGID)

$ ps xao uid,pid,ppid,pgid,comm | grep 65378
UID PID PPID PGID CMD
502 65378 31800 65378 npm
502 65379 65378 65378 node

However, we see something a bit different when we check the process graph on our docker container.

$ ps falx
UID PID PPID COMMAND
0 1 0 npm
0 16 1 sh -c node server.js
0 17 16 \_ node server.js

In Docker, the npm process shawns a shell process, which then spawns the node process. This means that npmdoes not spawn the node process as a direct child.

Let’s investigate whether this discrepency is due to the way Docker executes CMD ["npm", "start"] or a more general quirk of npm running in Docker. To do this, we're going to ssh into a running Docker container and run npm startmanually to see how it spawns child processes.

# Add an extra port mapping to our container so that we can run two node servers
$ docker run -p 1234:8080 -p 5678:5000 --rm --name=grace grace
# SSH into the container in another terminal and check the currently-running processes
$ docker exec -it grace /bin/sh
$ ps falx
UID PID PPID COMMAND
0 1 0 npm
0 15 1 sh -c node server.js
0 16 15 \_ node server.js
# Start up a second node server on a different port
$ port=5000 npm start
> node server.js
Running on http://localhost:5000

Now we ssh into the container in another terminal and check to see what the process graph looks like

$ docker exec -it grace /bin/sh
$ ps falx
UID PID PPID COMMAND
0 22 0 /bin/sh
0 46 22 \_ npm
0 56 46 \_ sh -c node server.js
0 57 56 \_ node server.js
0 1 0 npm
0 15 1 sh -c node server.js
0 16 15 \_ node server.js

Here we can see that, regardless of how npm start is called, it always spawns a shell process, which then spawns a node process. This is different than how npm behaves locally, where it spawns the node process directly.

The great signal hand-off

I’m not sure why npm behaves differently in these two scenarios, but this spawning discrepancy is my best guess as to why the same code exits gracefully locally, but flops and dies in Docker. npm knows how to pass signals to its child process, but the shell process doesn't know how to pass signals to its child node process.

Note: there have been a number of great posts written about the way that the main docker process passes (or fails to pass) signals when it’s run as PID 1, including this one by Grigoriy Chudnov, this one by Brian DeHamer, and this one by Yelp. Many people have written solutions for this problem, including Yelp’s dumb-init library, the tini library and the docker run --init flag. However, it's important to note that the problem we're seeing centers around the way that npm passes signals, regardless of whether it's run as PID 1 or not.

The solution to this signal-passing problem is disappointingly simple: run node server.js directly from the Dockerfile instead of npm start.

# Dockerfile
EXPOSE 8080
CMD ["node", "server.js"]

This is unfortunate, since npm start offers a uniform entry point for starting your application. This is particularly useful if you're passing a number of flags or options to node, but I think the ability to exit gracefully trumps the niceties of being able to call a single start command.

With this change, we can see that calling docker stop propogates the signal to the node process so that it can respond and shut down gracefully before exiting.

$ docker build -q --no-cache -t grace . && docker run -p 1234:8080 --rm --name=grace grace
Running on http://localhost:8080
received request, waiting 5 seconds
process received a SIGTERM signal
shutdown!
sending response!
server stopped by SIGTERM with value 15

Our request completes as expected.

$ curl http://localhost:1234/wait
Hello belated world

Really really long requests

You might notice something strange if you have a particularly long-running request.

app.get('/wait', function (req, res) {
// increase the timeout
const timeout = 15;
console.log(`received request, waiting ${timeout} seconds`);
const delayedResponse = () => {
console.log("sending response!");
res.send('Hello belated world\n');
};
setTimeout(delayedResponse, timeout * 1000);
});

When we repeat our usual setup of running the docker container, making a request, and then stopping the container, we’re back to gracelessness.

$ docker build -q --no-cache -t grace . && docker run -p 1234:8080 --rm --name=grace --init grace
Running on http://localhost:8080
received request, waiting 15 seconds
process received a SIGTERM signal
shutdown!

$ curl http://localhost:1234/wait
curl: (52) Empty reply from server

The request aborts because docker stop has a default timeout of 10 seconds before it gives up and sends a kill signal. Recall that SIGKILL signals cannot be caught or ignored, which means we cannot exit gracefully once one has been sent. However, docker stop does have a --time, -t flag that you can use to increase the amount of time before the container gets killed. This is useful if you expect a single request to take more than 10 seconds.

A graceful conclusion

It is extremely important for a web application to be able to exit gracefully so that it may perform any cleanup work and finish serving in-flight request. This is easy to do in Node applications by adding explicit signal handling to the node process; however, this may not be sufficient for Dockerized applications because of the way processes spawn children and pass signals.

Any intermediary that is used to start node, such as a shell script or npm, may not be able to pass signals down to the actual node process. Because of this, it is better to start node directly from the Dockerfile, so that it can receive signals properly. In addition, because docker sends a KILL signal after a timeout following docker stop, applications with long-running requests may need to increase this timeout to allow requests to complete before the application shuts down.