NodeJS: Implications of Best Effort Timing During Event Loop Saturation
I’m rather new to node and one regular theme that keeps appearing is that of timing. I’m beginning to think that careful care must be taken when dealing with timing in node and deriving insights from time based operations. Node core documentation explains the issue of the event loop in detail and makes concrete recommendations for working with the event loop and ways to help manage loop latencies. This post focuses on ways to think about the node runtime (event loop/queues); trying to explore ways to reframe understanding of the node event loop and to work with node timing to create reliable high fidelity alerts. This post shows why node time cannot be trusted, how we can think about time in node, and how we can alert on it.
Time is an illusion
One of the core issues in dealing with node event loop/queues is the concept of timing. Node makes no effort to hide this:
The timeout interval that is set cannot be relied upon to execute after that exact number of milliseconds. This is because other executing code that blocks or holds onto the event loop will push the execution of the timeout back. The only guarantee is that the timeout will not execute sooner than the declared timeout interval.
The issue is that many timeout libraries rely on setTimeout to enforce application level timeouts. To illustrate how quickly performance degrades a test program was created. The purpose of the test is to see how far (if at all) the runtime falls behind the desired rate of operation. The test asks the runtime to perform an operation at a certain rate, ie 100 operations/second, and the results show what the program was actually able to perform. The program uses setsInterval at a configurable timeout. It keeps track of a counter for each operation. Finally, it averages the rate of operations. Data was then generated by executing it at a couple of different timeout intervals: 1000 (1 op/sec), 500 (2 ops/sec), 250 (4 ops/sec), 100 (10 ops/sec), 10 (100 ops/sec), 1 (1000 ops/sec). For comparison an equivalent go program was also created and benchmarked. The versions of go and node, and the commands are shown below:
// INTERVAL_TIMEOUT_MS=1000,500,250,100,10,1 $ node --version v10.15.2
$ go version go version go1.11.2 linux/amd64
$ INTERVAL_TIMEOUT_MS=1000 node time.js
$ go run main.go -interval-timeout=1s
The results of the tests:
The tests were executed on an unloaded 2 core linux machine. Both runtimes had really difficult times keeping up at an expected rate of 1000 operations / second. Node achieved 450 operations / second while go was only able to achieve 650 operations / second. For node, this results in 2.2ms per operation instead of the expected ~1 ms; 2.2x MORE LATENT than expected!
Imagine having an operation that should on average take ~100ms. An alert has been created to fire if there are sustained latencies of >=250ms. What does the alert firing mean? Because of the event loop latency the only thing it indicates is that the operation took ≥250ms. There is missing context. The alert needs additional context in order to tell if this is a runtime issue or another issue. Timing cannot be trusted, but we live in a world where we need to use time to reason about the order and duration of events. The rest of the post explores a couple of ways that we might address living in this world and still incorporate node timings into observability and alerts.
Reframing Latency
The first issue is around semantics and how we think of and refer to latency and timeouts. Timeouts by themselves only identify latency in the node service and not latency for any dependent operations. Because any node tick can take an arbitrary amount of time we can’t use it to infer anything about dependent services or operations.
The only thing that timing can reasonably tell us (what happens when the clock goes backwards?) is how long a node operation takes within the context of a node process. How long does the operation take?
If we were to reframe latency in node as a queuing system, the latency of an operation is the total time it spends in queue and the total time it spends being processed:
The only thing that we can say with some degree of certainty is that the operation took x long and that is because of how much time the operation spent on CPU and waiting for other operations to be processed OperationLatency = operationProcessTime + eventQueueLatency. (Just to be clear, the queue above is not how the event loop or internal queues are structured, but can be used to understand the latency of an operation in node).
Strong Correlations
In order to effectively use node based timings (latencies) to reason about the performance of a 2nd degree dependency, a connection needs to be created between node and its dependency. Engineers are often responsible for creating this connection (edge) and performing the correlation mentally. This can be seen when a client latency alert fires. One of the first steps that a responder takes might be to check if there is an increase in server side latency as well:
In this case it’s the responder who is correlating to different events through their understanding of the system. If node latencies can only tell us how long an operation takes to process then additional information is required to derive meaning from node timing events. An alternative would be using the alert itself to correlate the data by incorporating both the client and server latency in the alert itself:
An alert that establishes a correlation would need to combine increase in client latency with a signal from the server (increased server latency). Incorporating both server and client latency provides a much stronger signal about the health of the server than just alerting on client side latency. By shifting the correlation to the alert, the signal that the responder receives is stronger.
I hope this post illustrates why node timings cannot be trusted. Because there are no temporal guarantees looking at node time in isolation can be misleading. Correlating time with other signals provides an actionable signal.
Appendix: CPU Affinity
This test executes the original operation rate tests but with binding to a specific core:
$ INTERVAL_TIMEOUT_MS=1 taskset -c 1 node time.js
$ taskset -c 1 go run main.go -interval-timeout=1ms
Both do much better for this workload. Go is able to keep to 1000 ops/second and node keeps up much better as well reaching 740/1000. Certain workloads may greatly benefit from setting taskset.