Investigating Node.js Performance: Event Loop and Network I/O (Part 1)

Xue Cai
The Airtable Engineering Blog
12 min readJun 3, 2021

A Performance Investigation Story at Airtable

By Xue Cai

With more than 250,000 organizations using Airtable, performance and reliability are absolutely critical. Performance issues are a normal part of any company’s operations, and investigations into their causes can be challenging — but immensely satisfying when they’re resolved. As an engineer on the performance and architecture team, I’m sharing the results of a performance investigation that uncovered an unexpected interaction between Node.js event loop and network Input/Output (I/O) which significantly slowed down loading Airtable bases.

In a subsequent blog post, we’ll share a more in-depth experience with code examples as well as a deep dive with Node.js Async Hooks.

Background

Airtable bases store all of a team’s information in one place. When a base has a lot of information, it may take longer than normal — sometimes dozens of seconds — for the server to fetch all the data and display it to the user. However, in October 2020, we began seeing an unusual amount of reports from people about the performance of their bases.

The Problem

Symptom 1: Slow or not loading bases

We began receiving customer reports of bases taking minutes to load, or even not loading at all. However, the bases of these users loaded as expected when we investigated and we couldn’t consistently reproduce the problem.

Symptom 2: Server out of memory

At the same time, we also started getting an increased number of alerts indicating that our servers ran out of memory. The graph below shows this in action: this server process was serving one of the slow bases; normally the server’s resident set size (RSS) should remain below a few gigabytes; however, the RSS kept climbing up and eventually exceeded 40GB.

Airtable Architecture in a Nutshell

To understand the root causes of these problems, it’s important to understand Airtable’s architecture:

There are several components in play between a user’s web browser and the server when a user loads a base.

  • AWS Elastic Load Balancer (ELB): the Amazon-managed load balancer that forwards requests to the web service.
  • Web service: a thin gateway service that primarily handles authentication, rate limiting, and serving web pages. Most business logic requests are forwarded to the worker service for additional handling.
  • Worker service (workerParent): the process type that delegates requests from the web service to workerChild processes. The workerParent also manages the lifecycle of workerChild processes such as spawning a new workerChild process.
  • Worker service (workerChild): the main process type that does the actual work. For a base loading request, the workerChild process fetches data from the database, then assembles and serializes the response, and sends it back to its workerParent process. Because the parent and child processes are on the same host, the workerChild process and its workerParent process communicate via a Unix domain socket. The process that ran out of memory in the example above was a workerChild process.

Both web and worker services are written in Typescript and running in Node.js.

Reproducing the Problem

Setup: Many concurrent base-loading requests to a big base

We saw the server memory increase at the time that bases were slow to load. How could the server’s RSS reach 40GB? We knew that some bases contain hundreds of megabytes of data. Thus, a base-loading response could be just as large. If a single base occupies up to 400MB of RAM, 100 concurrent base-loading responses could cause the RSS to spike to 40GB and in turn the server could run out of memory.

We also observed that the timing of memory increases was correlated with the timing of elevated base-loading requests, which led to our hypothesis: the memory increase is caused by base-loading response buffering.

To test our hypothesis, we created a big test base and sent a large number of base-loading requests to it over a short time window. We successfully reproduced the memory increase symptom and crashed the workerChild process serving the base.

Early clues

If base-loading responses are not flushed to users’ web browsers fast enough, where is the bottleneck? We use res.pipe() to stream responses all the way from the workerChild to AWS ELB. Although the responses eventually accumulated in workerChild, congestion can occur anywhere along the path, which sends back pressure all the way to workerChild.

We first suspected ELB because we observed that it had been processing more and more bytes over the last few months, which implied congestion if it didn’t scale up properly. We also suspected that it was the web clients retrying too many requests because of misaligned timeouts between web clients and servers. However, neither turned out to be the root cause.

We then observed a pattern in the timeline of our logs: the workerChild finished processing a request every ~3 seconds, but the web server finished processing a batch of requests only every 30+ seconds. This suggested that something between the workerChild and the web server was holding onto the responses.

To figure out what was going on, we looked at the logs in more detail. However, the logs exhibited very complex system behaviors including retried requests, cancellations, and rate limiting, as well as service crashes and self-healing. It was hard to follow one request end-to-end, let alone many concurrent ones.

Controlled Experiments

With an architecture as complex as ours, we decided to approach this problem through experimentation — holding all variables but one constant and seeing how the system reacts!

First experiment: Changing the number of concurrent requests

The first few variables we tried to eliminate were request retries and service crashes. We hoped to still observe the suspicious pattern after removing these variables. We did this by sending fewer requests at a time: we only sent 5 requests at a time, waited for their responses to return, then sent another batch of 5 requests, and so on.

This time, we observed the same pattern, without any request retries or workerChild process crashes! All 5 requests returned with a 200 HTTP status code and 20+ seconds of latency. More specifically as shown in the graph below:

  1. workerChild finished processing a request every ~3 seconds for the first batch (top).
  2. workerChild finished flushing all 5 requests in the first batch at almost the same time, 8 seconds after the last request finished processing (middle).
  3. Web service responded to all 5 requests immediately after workerChild finished flushing (bottom).

At this time, we were quite certain that the bottleneck is between workerChild and workerParent processes, not the web service. Also, the graphs suggest that early requests in a batch “wait” for later requests, so that they all flush together.

We then repeated the same experiment by reducing the number of concurrent requests to 4, 3, and 2 respectively. We observed the same pattern. Moreover, the bigger the batch, the longer the delay for every request in the batch. How odd!

Second experiment: Repro on a development machine

To understand exactly what workerChild was doing, we collected a Node.js diagnostic report, which includes “JavaScript and native stack traces, heap statistics, platform information, resource usage etc.”

To make it easier to collect diagnostic reports, add ad-hoc logs, and change server configurations, we decided to run our experiments on a development machine. We created a very large base in our development database and then directed the script to send requests to our development servers in exactly the same way: 5 requests per batch. The result? We successfully reproduced the behavior!

While sending requests, we were also signaling the workerChild process every 100ms to generate a diagnostic report. Here is an example of the libuv section of a diagnostic report (libuv is a multi-platform support library with a focus on asynchronous I/O. It was primarily developed for use by Node.js):

{
"type": "pipe",
...
"sendBufferSize": 8192,
"recvBufferSize": 8192,
"fd": 28,
"writeQueueSize": 27172784,
...
},

The workerChild process communicates with the workerParent process via a Unix domain socket which showed up here as a “pipe”. The report indicates that the pipe has a sendBufferSize of 8192 bytes, and currently there are 27,172,784 bytes to be sent (writeQueueSize). The socket can be uniquely identified by its file descriptor (fd) which is 28.

We then visualized the progress of writeQueueSize (y axis) over time (x axis), grouped by file descriptor, with Airtable’s Chart app. We ran the experiment with 1 and 50 requests in a batch respectively. The results are shown below.

When sending the response for one request (the red line in the graph above), the workerChild first filled up writeQueueSize (~27MB), then quickly flushed the queue and thus finished sending the response to workerParent. The whole process took ~707ms.

When sending responses for 50 requests (the 50 lines in the graph above), the workerChild kept filling up writeQueueSize for different Unix domain sockets but never got a chance to flush them until the very last request. 24 seconds elapsed before the very first socket was flushed!

It seems that the workerChild was prioritizing the processing of new requests over the sending of outstanding responses. This smells like a Node.js event loop blockage.

Node.js and the event loop

Node.js relies on an event-driven architecture. The idea is that just a single thread can execute an entire program if the program’s operations are largely I/O bound. When one part of the program is waiting on I/O to complete, the event loop can schedule another part of the program to run.

With such an architecture, it is vital that no one part of the program occupies the event loop thread for very long; otherwise it risks starving the rest of the program. This phenomenon is known as “event loop blockage”.

For more information on how the event loop functions within Node.js, read this Node.js doc.

Updated hypothesis: Root cause for poor performance

We then hypothesized that the combination of event loop blockage (from request processing) and small sendBufferSize is the reason for poor performance.

Before explaining the hypothesis, here is some background on how workerChild processes requests. Each workerChild process uses a queue to control how many requests are processed concurrently (maxConcurrentRequestsToProcess). However, once the workerChild finishes processing a request and sends its response for flushing, it immediately picks up the next request and starts processing. In other words, the processing of new requests often overlaps with the flushing of outstanding responses.

Now, let’s explain the hypothesis. Consider the 1-request case where the event loop needed to send out a ~27MB response. Because the sendBufferSize was only 8KB, the event loop needed ~3456 iterations to flush all bytes. Fortunately, the event loop was dedicated to one request so it could quickly loop over all iterations and thus finish flushing the response.

In comparison, in the 50-request case, the event loop was busy processing new requests while flushing responses of earlier requests. Each loop iteration took significantly longer than the 1-request case, so in total also took much longer to loop over 3456 iterations for a single request. The event loop blockage from request processing was so significant that no response could be flushed until no more requests awaited processing (that’s why all 50 lines in the graph above finished almost at the same time).

The graph below takes this situation to the extreme (with maxConcurrentRequestsToProcess = 1). Imagine a scenario where request processing code contains only 10ms of synchronous operations, then each event loop iteration will take at least 10ms. Responses of earlier requests will keep accumulating and won’t finish flushing until the last request finishes processing. If we have nonstop requests to process, it will take more than 3456 * 10ms = 35 seconds to flush a single response!

To confirm that our request processing code blocks the event loop, we used the node-blocked library, which logs whenever the event loop is blocked for more than 10ms. We saw plenty of logs which supported our hypothesis.

So, why does event loop blockage starve I/O? Initially this puzzled us; we were under the impression that I/O operations were handled by libuv worker threads, outside the Node.js event loop. It turns out that’s not the case! This doc explains it very well. To summarize, libuv worker threads are only used “to handle ‘expensive’ tasks. This includes I/O for which an operating system does not provide a non-blocking version, as well as particularly CPU-intensive tasks.” Modern OSes provide non-blocking APIs for network I/O (i.e. epoll onLinux, kqueue on macOS, event ports on Solaris, or IOCP on Windows) so network I/O is handled by the main event loop directly.

Now, how do we solve the poor performance problem? Attribution, fixing and prevention of event loop blockage is difficult. We experimented using Node.js Async Hooks for attribution but it adds significant overhead and is still experimental. If we can’t solve event loop blockage easily, can we increase the sendBufferSize so each response needs fewer loop iterations to flush? We decided to experiment further.

Third experiment: Changing Unix domain socket sendBufferSize

We couldn’t find an official API to set sendBufferSize, but we managed to do it using the Node.js Foreign Function Interface (FFI) for N-API:

const fd = res.socket._handle.fd;  // res is an express.Response object
setsockoptInt(fd, SOL_SOCKET, SO_SNDBUF, sendBufferSizeInBytes);

switch (os.platform()) {
case 'linux':
SOL_SOCKET = 1;
SO_SNDBUF = 7;
break;

case 'darwin':
SOL_SOCKET = 0xffff;
SO_SNDBUF = 0x1001;
break;
}

import ref from 'ref-napi';
import ffi from 'ffi-napi';

const cInt = ref.types.int;
const cVoid = ref.types.void;
const bindings = ffi.Library(null, {
setsockopt: [cInt, [cInt, cInt, cInt, ref.refType(cVoid), cInt]],
});

/**
* Call the native 'setsockopt' API to set an integer socket option.
* See: Unix man page for setsockopt(2).
*/

function setsockoptInt(fd: number, level: number, name: number, value: number): void {
const valueRef = ref.alloc(cInt, value);
bindings.setsockopt(fd, level, name, valueRef, cInt.size);
}

When we ran the experiment, the results looked great!

The graph below shows the result of running the experiment with 50 requests (the 50 lines in the graph) in a batch, using Unix domain sockets with 4 MB sendBufferSize.

Compared with the previous 50-request experiment, this time each response finished flushing quickly after starting. Earlier requests were not “waiting” for later requests. The whole process took ~16 seconds but each request completed in less than 1 second!

Results and Future Work

We carefully rolled out the 4 MB sendBufferSize change to Airtable’s staging environment, then to the production environment. Since then, we have not received any new customer reports and there have been no more out-of-memory server crashes.

We also had other ideas on how to fix the problem. The root cause was the combination of event loop blockage and a low sendBufferSize. If we could solve the blockage, the problem should also go away. However, attribution and fixing of the blockage is difficult. Moreover, preventing the fix from regressing is also difficult. Instead of directly fixing the blockage, what if we 1) offloaded response flushing to a different thread with a different event loop, or 2) employed better priority/concurrency management of request processing vs. response flushing? We haven’t felt a need to explore these two ideas because the sendBufferSize fix has worked well so far. That being said, we do want better monitoring (and especially attribution!) of event loop blockage, and a more sustainable solution to this problem.

Conclusion

In this blog post, we’ve shared a performance investigation story at Airtable. The symptoms were slow loading bases and servers running out of memory. The root cause was a combination of Node.js event loop blockage and a small sendBufferSize for Unix domain sockets. We solved the problem by increasing sendBufferSize to 4 MB, and things have been going well for six months.

Many thanks to folks that participated in the investigation, including Alexander Sorokin, Brian Larson, Jayden Navarro, Keunwoo Lee, Keyhan Vakil, Philip Zeyliger, and Xue Cai! Also special thanks to Adar Lieber-Dembo and Linjie Ding for reviewing this blog post!

If this blog post sounds interesting to you, come join us to work on interesting and challenging problems — we are hiring!

--

--