Node for Holidays: 25 or 6 to 4
You have probably heard that Walmart has been heavily using Node, in particular for the UI layer — server side rendering, React, Redux, Electrode, etc. You will likely continue to hear more about Node and Walmart in that context and the successes of Node during the holidays. However, in this posting I will be writing about Node and my personal journey in a different context — me, a front-end engineer, jumping into the service layer and Node as the middleman between upstream Java services and a Node UI layer (walmart.com).
Front and back, side to side, and all that…
I have been working specifically in the front-end for the past 10 years or so of my career. I don’t have a computer science degree, I don’t know the HTTP specifications by heart, and I can’t read your entity relationship diagram. So a position on a services team was a perfect match, right?
Right or wrong I jumped on the services team because that is where help was needed. Needless to say, I quickly learned just how little I actually knew about end-to-end software engineering. It was an extremely humbling experience. I eventually managed to put my ego aside and began to open myself up to learning. I intend to share some of these lessons through a series of blog posts beginning with what I learned about request timing metrics and Node.
Behind the parlor wall
Before I dive into the timing metrics learning I want to provide some context of what it is we built for this learning and the others to follow. For holidays we were tasked with replacing the business logic that orchestrated service calls and transformed data in an existing monolith Java application into a stand-alone service that could be called by different verticals (pages) and native mobile applications as well. The service was required to return page definitions and modules to be rendered, taking into consideration module scheduling in page zones (slots, buckets, outlets, etc.) and A/B testing. In addition to these requirements, the service needed to inflate data like product identifiers contained in module data and shape product data from upstream services. There were also other integrations with upstream services such as personalization. Below is a high level overview and sequence diagram that describes the configuration-based solution in more detail. We already open sourced the data-patching library, json-patchwork, and will be open sourcing the orchestration layer in the future.
High level overview
Data enrichment details
All of this needed to be scaled to handle 48,000 requests per second, but that is another story. Spoiler: We scaled it.
A day in the middle life
The front-end is challenging, but the majority of these challenges are sandboxed to the browser environment such as rendering performance. The back-end is equally challenging, but the challenges are distributed across different layers such as caches, databases, service buses etc. all connected via a network that has its own challenges. In the middle-tier everyone’s challenges (problems) are yours. If a UI application reports high latency with your service it automatically becomes your problem regardless of whether the latency is in your service, an upstream, or the network. This is because you are the layer directly beneath them. In short the middle child does have it bad.
It’s all about timing
So what do you do when a consumer of your service shows you their logs or pretty graphs, and says, “you are running slower than usual”, or even worse “you are not meeting the SLA”? Your initial reaction will likely be to internally panic, while saying, “thanks for letting us know; we’ll look into it”. This very thing happened to me, several times.
Don’t trust and verify
People often say, “Trust, but verify”. I have found this to not be very helpful advice when engineering software. Its not that I think that people are being dishonest, it is just that systems are complicated and you can waste an extensive amount of time trying to solve a non-issue if you react to every problem you are presented.
In the cases of reported latency it was often the case that our service was not to blame after our logs revealed that we were indeed meeting and exceeding our SLA.
However, just because our logs indicated that we were not at fault it didn’t mean that there wasn’t a problem. It just meant that there was a discrepancy between the UI application and service logs, and that we needed to determine why the discrepancy existed. The first thing we did was compare what it was we were measuring.
Our service logs measured the time it took for our service to send a response to a request. The UI application logs measured the time a request was made until the UI application received response. This led us to the missing link — the network (latency). Eureka! — Not exactly. Trying to solve a networking issue is HARD. Conditions are never the same, and the network never behaves consistently. Just when you think you have identified the problem something changes. I will just cut to chase and let you know the network was not the problem, in this case.
Measure twice, cut once
After chasing our tails looking at the network, we were back at square one. So we decided to more closely examine what we were measuring again. First, we double-checked our service telemetry. Once we were satisfied we decided to dig into the UI application code and telemetry. Everything looked normal with the exception that the reply timing measurement occurred after the reply body had been parsed. This added on some time, but it wasn’t significant enough to account for the latencies being reported by the application. However, while insignificant, it was this finding that turned out to be the actual eureka moment.
Knowing is half the battle
Back to the lecture at hand…
So when the queue is large, this means that any request-reply callback that gets pushed onto queue could potentially have to wait awhile for other messages in the queue to run to completion before it is executed.
This is precisely what was happening when the UI application began reporting service latencies during high traffic periods. So a fix was not required on our side, but we indirectly helped improve UI application performance by identifying the root of the latency, which was event loop latency in the UI application. This in turn helped the UI application teams identify and fix bottlenecks. These fixes ranged from rendering optimizations to adding more application servers.
I used to think that working in the front-end was difficult because every issue is typically filed against the UI first, so people usually assume that the UI is always to blame, no matter where in the stack the actual problem occurred. I now think that the middle is even more challenging, because not only is it the front-end for the back-end (it gets all the blame from the UI), but when debugging you also have to look in two directions as opposed to one. You have to understand the front-end that is calling you and the back-end interfaces, plus the network. You also need solid relationships with both sides and all associated teams in order to foster a sense of collaboration as opposed to blaming each other when problems occur.
In terms of technical takeaways, knowing the runtime environment, i.e., single threaded with an event loop concurrency model, is key to debugging issues. Also, ensure that you are gathering metrics such as event loop latency in addition to application and service telemetry. Lastly, never, ever trust data and always verify it.
Well that’s it for now. Hopefully, you found this mildly entertaining and useful. Next time I’ll share how how keep-alive nearly killed us on more than one occasion.