Scaling-out with Node Clusters

Node Tips and Tricks for faster APIs (1)

In this article series, I will be sharing a few tips I have learned about building APIs with Node JavaScript that are not only as fast as they can be, but also “responsive”.

Usually when we talk about faster programs, we consider code optimisations, and sometimes concurrency, which Node takes care of with its Event Loop via callbacks, and abstractions such as Promises, async-await, etc.

Because the Event Loop is a single-threaded model, everything runs on a single processor on a machine that may have multiple processors.

This means we can improve the performance of our programs by taking advantage of these other processors via Clustering.

Before we proceed, let’s define a few terms …

What are clusters?

In Node, a cluster is a network of processes that share a single server port. Each process running in cluster-mode is capable of running on a separate CPU.

This means the optimal number of processes a cluster can have would depend on the number of CPU cores available on the executing machine.

What is a process?

A process is an instance of a program running on a single CPU. When you execute a program you have written, it becomes a process.

What is a port?

A network port is a software construct that serves as a communication endpoint, used to identify a process or application in a network.

A port is usually process or application specific. This means that if a process uses a port 3000, no other process can use that port while the first process exists.

Now, let’s build …

For the sake of this article, we will build a simple API server to return a list of prime numbers between 1 and a supplied maximum number.

We will then use clusters and benchmark the performance of both programs.

Our program will depend on Express. This tutorial assumes you already know how to create a basic express app.

The fast, un-opinionated, minimalist web framework for node

This code already exists in a GitHub repository, so feel free to clone it and run on your machine.

git clone
cd prime-cluster
npm start

This will start the server on port 3030 on your computer. You can test it in your browser by navigating to http://localhost:3030.

Now, let’s modify our code to use clustering …

If you’re following this tutorial, you might want to make a branch of your original implementation, because we’ll need to use that when benchmarking.

In our modification, we’ll be focusing on the index.js file, because that’s where all the clustering work will be done.

Located here.

When running in cluster-mode, a process may be either a Master or Fork process.

The first process is always the master process, and it creates forks of itself which become Fork / Worker processes.

In our program, we determine the number of available CPU cores on my development machine (it’s four, by the way), then we make forks/clones of our master process as seen in …
for (let i = 0; i < cpuCount; i++) {

A process can determine whether it is a master or fork process and act accordingly, which is why we have …

if (cluster.isMaster) {
else {

When a process determines that it is a fork, it starts an express instance, and begins serving HTTP requests.

It is important to note that fork processes may terminate due to exceptions, and it may be important to restart them when they do.
cluster.on('exit', (worker) => {
console.log('mayday! mayday! worker',, ' is no more!')

On my machine, we have 4 running instances of our prime-numbers server rather than just 1 like we did initially.

This should mean better performance, but there’s only one way to find out.

Introducing Artillery is a powerful, modern load testing toolkit built with Node JS, and we’ll be using it to check the performance our API with and without clustering.

Bring in the heavy artillery

First, we’ll install artillery if you don’t already have it …

  • Open your terminal or command-prompt and run
npm i -g artillery
artillery quick --count 50 -n 40 http://localhost:3030?max=100000

This tells artillery to create 50 virtual users, and each should send 40 HTTP requests to http://localhost:3030?max=100000.

When testing without clustering …

Don’t forget to switch to the master branch of the repo.

Benchmark Test result of API (without clusters)

Artillery gives its report in batches. So while we had scheduled 2000 requests (50 users * 40 requests each), it gives the results in batches of 765, 805, then 430 requests.

It then gives the report for the averages of reports for the entire 2000 requests.

Our focus is on the request latency section. Latency is the time taken in milliseconds from when the request was created till the response is received. We have parameters like:

  • Min: The minimum latency
  • Max: The maximum latency for the batch of requests being reported
  • Median: The median latency
  • p95: 95% of the requests had a latency less than or equal to this value
  • p99: 99% of the requests had a latency less than or equal to this value

On average, without clustering, we have the results:

- min: 13ms
- max: 1443.1ms
- median: 610.3ms
- p95: 683.7ms
- p99: 1256.4ms

When testing with clustering …

Don’t forget to switch to the clusters branch of the repo.

Here, we have the average results …

- min: 13.5ms
- max: 1606.5ms
- median: 290.7ms
- p95: 560.6ms
- p99: 719.5ms

Let’s compare the two results …

          (without) vs  (with)
- min: 13ms vs 13.5ms
- max: 1443.1ms vs 1606.5ms
- median: 610.3ms vs 290.7ms
- p95: 683.7ms vs 560.6ms
- p99: 1256.4ms vs 719.5ms

We’re focused on the p95 and p99, because we want the majority of our requests to finish in good time.

Whoop! Whoop! 🎉🍾

Our implementation with clusters has better p95 and p99 performance, meaning that 99% of the 2000 requests complete in much shorter time than if we were using a single-process.