Node.js Profiling, Event Loop Lag & Flame Charts 🔥
When working with Node.js, you need to be aware of how the event loop works. There are plenty of articles on that topic, but I wanted to share how you can use profiling to understand the behavior of a Node application.
We have a big Node.js stack, with 20++ different services processing data from connected cars. Several tools are used to monitor the behavior of our system, for example:
- Datadog gives us a great overview of the infrastructure (instances, queues, load balancers)
- New Relic provides application monitoring, to ensure our many APIs are satisfying its clients
However, it can be hard to monitor the event loop.
Monitoring the event loop
One way of doing this, is to throw functions on the event loop, and time how long it takes before they execute. If the functions aren’t executing fast enough, you’re experiencing event loop lag.
We are using Heavy to keep track of performance metrics, and log if the delay is > 40 milliseconds.
import Heavy from 'heavy';
const heavy_instance = new Heavy({
sampleInterval: 1000,
});
const min_time_between_log_messages = 1000;
const event_loop_logging_threshold = 40;
let logging_interval;
function logIfAboveThreshold({ heavy, logger }) {
if (!heavy || !heavy.load || !logger) {
return;
}
if (heavy.load.eventLoopDelay > event_loop_logging_threshold) {
logger.info({
threshold: event_loop_logging_threshold,
event_loop_delay: heavy.load.eventLoopDelay,
heap_used: heavy.load.heapUsed,
rss: heavy.load.rss,
}, 'Event loop delay above threshold');
}
}
export default {
startLoggingPerformance({ logger }) {
heavy_instance.start();
if (logging_interval) {
clearInterval(logging_interval);
}
logging_interval = setInterval(() => {
logIfAboveThreshold({ heavy: heavy_instance, logger: logger });
}, min_time_between_log_messages);
},
};
We monitor the logged metrics in Splunk, to understand how the system is performing.
In addition to the logging, we’ve had an approach to reject HTTP requests with a `503 Too Busy` error, since event loop lag may indicate that the server is under too much load.
After having seen a constant number of logged delays, we started profiling one of the servers to understand where the lag came from.
Profiling in Production
In order to collect data, you can either:
- run tests on a production-like environment (e.g. using Apache Bench to put load on the server)
- enable profiling on production instances to get real data
We’re not very fond of the deceptive production-like environments and have built a culture around working safely with the production environment, so (2) was an easy choice for us.
Data collection
All of our instances run in clusters (auto scaling groups). The process for collecting data from one of the instances was:
- Add an extra instance to the auto scaling group to compensate for the one to profile (the Target instance)
- Detach Target instance from elastic load balancer (ELB), to stop sending traffic to it
- Configure instance to start node process with profiling: `NODE_ENV=production node ––prof server.js`
- Re-attach instance to ELB, to receive traffic
- Allow ~30 minutes to collect data
- Monitor instance’s logs to see if event loop lag is logged during profiling
Preparing collected data for analysis
- Upload V8 profiling log from instance to S3:
`aws s3 cp isolate-0x25ff140-v8.log s3://<bucket name>` - Upload application log from instance to S3:
`aws s3 cp api.log s3://<bucket name>` - Load V8 log in IntelliJ IDEA (Menu: Analyze V8 Profiling Log).
There are other ways of analyzing the V8 log, but IntelliJ’s tools for this are outstanding. - Look in api.log for logged event loop lags and map their timestamp to the profiling timeline (i.e. seconds elapsed since server upstart).
There were 5 logged event loop lags during the profiling time period.
Flame chart analysis
A good way of analyzing the V8 log is to look at the Flame Chart (a.k.a. Flame Graphs). In short, the flame chart can show how long functions take to execute and what’s in the call stack.
- Look in the V8 log around timestamps of logged event loop delays
- Scan open for wide flames, indicating long execution time
- Look at wide flames and try to understand if and why they caused the blocking behavior
- Use IntelliJ’s ”Navigate to…” feature to find the longest and typical execution times for specific frames
- Carry out further analysis of timeline outside of event loop delays, to understand what the typical behavior is.
Results
- Normally, the function execution was 1–2 ms, with some longer executions around 10–20 ms (e.g. when building up arrays of database documents).
- 3 logged lag events were due to garbage collection
- 1 logged lag event was due to a slower (~70 ms) processing function that has been a candidate for being moved out of the API and into a separate processing pipeline
- 1 logged lag event was due to apm.js (from New Relic’s monitoring library) taking a long time to hook up a callback in its tracing mechanisms
Conclusion
Overall, there were no systematic delays or signs of poorly written synchronus code.
Of all the traffic this specific API receive, 0.1% of the messages are followed by an event loop lag warning.
Since we do not consider the event loop lag to be a result of this API being overloaded, we can remove the behavior where the API would automatically respond with a 503 error.
However, it is important to keep in mind that the event loop lag will still have affect on the application. The lag will indeed delay execution of functions, resulting in slower response times.
Moving forward, we keep monitoring the event loop and have also configured alerts to answer:
- how frequent are the event loop lag warnings?
- how long is the event loop lag when the warnings are issued?
- are we increasing the number of warnings?
That’s it.
👋🏼