When node.js is the wrong tool for the job

JavaScript is my first programming languages and I haven’t found the need to expand my programming skills to more, mostly because you can do everything with node.js and JavaScript. However, after recently building a high-volume service for a client, I’ve, for the first time ever, found myself wishing I didn’t write the service in node.js, though my client doesn’t mind.

When is node.js right for the job?

I mostly work on CRUD apps, which suits node.js perfectly well. If your requests are simply database calls without CPU-intensive logic, then node.js is good enough for your app. These are very simple apps that do not need anything special from the backend.

Why node.js is perfect for simple CRUD apps when there is also a frontend, i.e. if it’s a React or Angular app. Having the entire stack written in JavaScript is great for the team — frontend developers can quickly investigate and, more importantly, understand the backend if needed. More people know JavaScript than any other language, so almost any available engineer can help when more hands are needed.

Another factor is speed. If your entire team has to learn Go or Rust to write a Go or Rust service, then it’s obviously going to take longer to develop than a node.js app (assuming everyone knows JavaScript). Sure, the Go or Rust version might be 2x as fast, but it might also be 2x as slow to develop, which may cost the company more than 2x as many servers.

I would really only recommend the modern node.js. Years ago before ES6 Promises were accepted by the community, I contributed heavily to node.js modules because a lot of them were simply broken; node.js was the wild west. Nowadays, the community and the modules have matured immensely and I am very grateful for that.

When is node.js wrong for the job?

One of my client’s requested services is a high-volume proxy server with “rules” associated with each type of request. Rules are stored in PostgreSQL and cached in Redis. Analytics and metrics are sent to Druid, AWS Redshift, Microsoft Azure Application Insights, and InfluxDB, with Application Insights pending removal. Their web servers are deployed on Microsoft Azure using 4-core instances, the largest available for their web app service. Their goal is to hit 400,000 requests per second in the near future after their product is finalized.

Unlike services I usually work with, there is no front end to this app. Thus, JavaScript as a common language with front end developers has no benefit. The previous service was an nginx server with gigantic configuration files, but it was not as maintainable and dynamic as they would like.

In-memory Caching

The service has rules to proxy requests and these rules are edited in a CMS and stored to PostgreSQL. A separate process saves the rules to Redis servers in every region. Each Redis server could be serving anywhere from 1 to 20 quad-core servers, the maximum available for Azure Web Apps.

On each server, rules are retrieved from Redis and cached in-memory using an LRU-cache. As node.js is not multi-threaded, we spin up 4 instances of node.js per server, 1 instance per CPU core. Thus, we cache in-memory 4 times per server. This is a waste of memory!

We cache a geoip database in memory. The file itself is only about 60mb, but because we have to cache the database 4 times, we end up using closer to 240mb. In my experience, my node.js processes have never been more than 120mb when there’s no in-memory caching. This service uses upwards of 400mb per process.

The end result is that node.js uses a lot more memory than required. This never actually because an issue for us as we rearchitected before it became an issue and because most servers actually have too much RAM for node.js servers.

Reaching Bandwidth Limits

Prior to breaking up our rules and querying Redis on every request, we stored rules as a JSON objects, which were retrieved from Redis and cached locally in memory. However, we quickly saw performance degradation once we started serving really large rule sets.

Operations started adding rules with 100,000s of domains, which caused a single set of rules to be about 10mb large. As we cached these rule sets in memory for about 30 seconds, each Redis cluster would see about 10 servers * 4 cores * (10mb / 30 seconds) * (8B / 1b) = 106MB/s of bandwidth usage per rule set. This was above the 100MB/s bandwidth limit of a standard Azure Redis cluster, so Azure started throttling the server, causing latencies to spike.

If we weren’t using node.js, we could cut this bandwidth by 4 as there would only be one connection to the Redis cluster retrieving rule sets, not 4 (1 for each node.js process). We would’ve probably hit this limit eventually with any language, but we hit it a lot faster with node.js.

So we broke up our rules so that certain rules were Redis sets. This lowered bandwidth usage, but increased latency as each request may require Redis calls and the average latency to an Azure Redis server from a Azure Web App is about 7ms. Prior, no Redis calls were necessary on a per-request basis.

Processing medium-sized datasets

Another reason for breaking up rule sets was processing time. When rule sets were 10mb JSON strings, each node.js process would need to JSON.parse() the string every 30 seconds. We found that this actually blocked the event loop quite drastically, causing latency to spike above our desired 10ms.

What makes this worse in node.js is that this JSON.parse() would occur 4 times, once every 30 seconds per event loop. This is not an issue with multi-threaded languages as this processing can be done on a completely separate thread and once per server instead of per process.

We’ve moved this logic into a worker that breaks up rules into Redis sets. We sees this event loop blocking, but by making it a worker and running it within a separate process via child_process.fork(), we no longer run into this issue.

Excessive socket usage

Azure Web Apps have another interesting limitation — it limits socket connections to 8,192 per server. We hit it pretty quickly once we started making our own non-analytical external HTTP calls.

We have a lot of external requests setup, but the ones that made external request on every HTTP request we received were:

There are a few that we batch, but still call more than once a second:

  • AWS Kinesis Firehose
  • InfluxDB

For every request we receive, you can expect a few calls of HTTP requests. Once we started proxying HTTP requests, which are additional to the above requests, we quickly ran into this socket limit. The reason was that node.js, by default, does not pool connections. Every HTTP request was creating a new socket. We fixed it by setting keepAlive=true on the globalAgent:

However, we also needed to set how many sockets we pooled for each host. How do you figure that out? Well, there’s 8,192 sockets per server, which means 2,048 sockets per node.js process. How many hosts do we have? At least 8 — 4 for all our analytics and another 4 for the current approximate amount of proxied domains. 2,048 / 8 = 256 max sockets per domain per node.js process.

But we still were getting errors. I guess I was wrong! So we lowered it by trial and error until we didn’t receive anymore errors. That number happened to be 128 sockets, but we expect to lower that number once we ramp up.

The issue with node.js here is that we had to divide 8,192/4 since we’re spinning up 4 node.js processes per server. If node.js was a multi-threaded process, we wouldn’t have to do such complicated math and trial and error as 256 sockets per host would’ve probably worked.

Development Speed

Like most projects, this service was not well-specified. I would frequently rewrite large portions of the server due to misunderstandings on functionality or churn on certain functionality. Fortunately, I am so quick at developing on node.js that these changes did not take me very long.

This churn would have made development a lot slower if we used a different language, specifically because the only other language this adtech team was experienced with is ActionScript.

Conclusion

When performance is absolutely critical, node.js is not the right tool for the job. It does not work well with in-memory caches and multi-core environments. However, your simple CRUD app is not going to hit this problem, so you probably don’t have to worry about it. And you probably overestimate how critical performance is at your company. Premature optimization is the root of all evil.

Above all, what matters more to an engineer is delivering. There’s no need to make optimized products if your product is never delivered. Hitting the market early with your product and getting product feedback from real customers is more important than fast code.