Stories by Aurelijus Banelis on Medium

Debugging GraphQL schema change in Golang app

Aurelijus Banelis — Thu, 23 Sep 2021 08:24:38 GMT

GraphQL was created to have stable APIs via strict schemas on a server-side and the more relaxed query/mutate/subscribe methods from the client-side. You might imagine a feeling when schema validation stops working randomly after some requests. And you might imagine that investigation of those kinds of issues are the fun ones.

Are you ready for another debugging story?

Schema difference tool

It all started, when we saw some false positives from graphql-inspector tool.

We are checking schema changes after each GraphQL functionality change (Pull request hook) to prevent breaking changes for GraphQL clients in production.

From the first impression, it looked like some configuration issue after pre-production/staging infrastructure changes (it was a coincidence that both changes were made within a similar time range).

While we could easily double-check affected GraphQL queries (or mutations) — it was a low priority task.

The obvious: maybe it is already fixed?

One day we finally picked that difference checker task, hoping to just upgrade the tool and see the issue gone.

Unfortunately, the upgrade of the graphql-inspector version did not help and we (thanks to a colleague, James) figured out, that pre-production and production environments were returning different results. Those two environments are supposed to run the same code and therefore – supposed to return the same results.

https://medium.com/media/e7f2d8b6d2913d0a5112799cd4fdf385/href

As you can see from the code example, ! means that argument is mandatory. Therefore input arguments of type Locale! should be validated always. But for some docker containers (applications running for some time) — input validation did not work.

Signs of a bigger problem

Testing environment you cannot trust (or releasing new versions blindly) was not a good thing, so this low-priority task had to be taken more seriously.

We started from obvious things: searching for some stupid typo mistake by aligning pre-production and production configuration. To our disappointment same docker image, the same environment variable, the same AWS roles — were still reproducing the same error.

When hope was almost replaced by frustration — we finally saw the same issue in a staging environment.

So we went a small step further: the issue is reproducible (the good news), but (the bad news) it started without a code change.

Managing the randomness

Comparing request types before and after behavior have changed

For GraphQL application development, we have multiple environments (local, staging, pre-production, and production) — some are best for debugging, others are best to catch integration and configuration mistakes.

Ideally, we wanted to reproduce the issue in the Local environment, so the problem could be narrowed down until it is easy to find the real cause. Unluckily, the issue was appearing randomly after some time and some calls to the application.

Replaying requests did not seem like an option, because not all requests were idempotent, also timing or cache were candidates for the cause of the issue.

So we ended up with a simple Lamba cron (small application on AWS) that was calling Staging environment to reproduce the symptoms of the issue. The Staging environment had less traffic, so we hoped it would be easier to narrow down to fewer examples to double-checking.

https://medium.com/media/0b1c753a6a1e7312c65fa5496b82a3ac/href

We automated everything we could — now we needed only to wait to gather more data (assuming our monitoring tools were also ready).

Monitoring GraphQL

Because HTTP URL will always be the same, we use operation name as a short and meaningful identifier

Because GraphQL is flexible regarding queries, logging full GraphQL input could fill up your storage quite fast. Also, we want to be careful about not storing sensitive data (secrets, personal data, etc). In the company, we have agreed on 2 GraphQL best practices:

Variables are used as arguments for common calls (it is also beneficial for Automatic persisted queries). Those parameters are obfuscated in logs from sensitive data.
OperationName is an optional parameter to provide a human-readable description of a query (or a mutation). OperationName is the closest thing to the endpoint concept in REST API.

Logging OperationName and metrics of cache usage was the optimal choice for both privacy and resource usage even in the Production environment.

Of course, longer monitoring, filtering by container, and special OperationName for Lamba cron — were useful additions.

Script to reproduce the error ready, behavior logging ready, monitoring to identify start the issue also ready, text difference tools ready — so lacked only hope to finally solve it.

Reproducible locally, what’s next

Debugging variable state and looking for anomalies (E.g. NonNull changing: true to false)

Lambda tool combined with GraphQL monitoring gave a lot smaller set of OperationNames to double-check. Luckily, searching the company’s GitHub for a particular OperationName string was giving quite good GraphQL query examples to test those locally.

I still remember that moment of joy when I loudly said: “yey, I reproduced it locally!”

https://medium.com/media/6771bc506230a002b7539b6096df300d/href

After all this mystery-solving, we still needed debugging in Go (aka Golang) code:

Check for obvious mistakes in our code
Make testing easy (E.g. simplify GraphQL query, so it would be fewer fields and no special authentication)
Follow the execution flow and look for unexpected state
Narrow down the issue, by changing the code (dirty, but fast fixes)
When the hypothesis is tested: finish with long term solution: an automated test to reproduce the issue and actual code fix making those tests passing

A quick fix in the code — probably the best way to narrow down the problem

Regarding debugging in Golang, I felt really happy when source code of all dependencies were also downloaded. So tools like debugger watchers, conditional breakpoints, updating code manually — were working across own and dependencies code base.

It does not matter where the issue is if it affects our clients or the stability of the service.

Issue in Open Source dependency

In home24 our software is relying on Open source (keeping license terms in mind). So we do not only consume free code but also try to give back to the community. Pull requests with bug fixes are the perfect example of Open source flourishing.

Pull request with test cases to reproduce in the upstream library

Of course, maintainers of Opens source project are the ones to decide, how the issues could be resolved. Still, notifying about the issue or proving an alternative way to fix the problem — is a good practice in Software industry overall.

Go programing langue have powerful tools to save some memory by using references to objects, but those same optimizations (as seen from bug fixes) can also lead to unexpected updates by reference. And that was the cause for this story to be written.

To sum up

I hope this debugging journey was a good example, how someone could approach and think about the problems in Software development. It might be easy to find a quick solution by the error code, but the reality is not always that simple. The initial assumption may differ a lot from the cause of the issue: especially when it is a known issue somewhere deep in the dependency’s issue tracker.

Therefore practices like raising and testing assumptions, getting a fresh view from colleagues, prioritizing risks and tasks, using multiple tools, optimizing the process itself, and thinking more than you — are the ones that work, at least for me.

I wish you fun programming journeys, as we have here at home24.

P.S. Summary version also available as an Infinity map.

Debugging GraphQL schema change in Golang app was originally published in home24 technology on Medium, where people are continuing the conversation by highlighting and responding to this story.

Debugging Node.js Memory leak on Production via Shadowed traffic

Aurelijus Banelis — Mon, 25 Jan 2021 08:11:27 GMT

Memory leaks are hard to debug, especially when using programming languages with garbage collection. At home24 we had quite a journey searching for the cause of 139 exit codes from crashed docker containers.

The outcome of the debugging was not only the fixed bug, but also an example how tools outside of the JavaScript ecosystem can help in debugging JavaScript code.

So let’s start the journey…

In the beginning, it was just a 5xx status codes

Number of crashed docker containers via AWS CloudWatch metrics

In a Web service world, there is a convention to respond with 500–599 HTTP status codes when there is a nonrecoverable problem in the application itself. Therefore most popular Node.js applications (e.g. Express) and AWS Load balancers also implement this behavior.

What cannot be found and fixed via Unit tests while still in development, ends up in 5xx exit codes in a live environment. For code logic errors — it is repetitively easy to fix because those are deterministic: with the same input and context, we should get the same error. But there are other issues like failed network, hardware and… insufficient memory management. We still get the same 5xx error codes but fix range from wait-and-retry to long story ahead.

Memory leaks in a docker container world

Example of memory leak issue delegating to other teams

If you are coming from bare metal Node.js applications, you would probably say “just restart pm2 and it is fixed”. But it does not work in the world of containerized applications because:

There is ecs-agent, kubernetes, docker or other hypervisors that would restart the application instead of pm2
restarting application mitigates outcome, but we are still paying for more CPU or memory of a not optimized application

In home24 we are trying to innovate fast and take advantage of the best programming language for the task, so we are using docker extensively. At first, we tried to increase the memory of the containers (think bigger servers), but it just postponed the issues.

Advertisement about discounts before Black Friday weekend

Black Friday (biggest sale of the year) was coming, so “just postpone” was not a viable solution. Frontend team was already filled with tasks of creating new features and fixing “reproducible” errors, so memory leak issue was delegated to Scaling team —a team that was more familiar with AWS infrastructure than with internals of Node.js/JavaScript.

Brace yourself, memory leak hunter is coming

Slide from presenting bug hunting internally at home24 Demo night

Coming from other programming languages, Scaling team assumed that unused memory would be released quite soon. But even with a Hello world Express application after no requests memory usage did not drop. So we realized that JavaScript memory management is more complex than we anticipated. So as a Scaling team we went from different debugging tools (E.g. Chrome inspector, clinic, netdata, NewRelic)to different debugging methods (E.g. building locally, running production containers, comparing logs of similar services).

As usual in these types of bug hunting — the best ideas came in the middle of midnight. Marius saw that there are a lot of HTML strings persisted in the memory when trying to reproduce locally. He googled for similar issues in JavaScript community and guess what — there was a known memory leak in the popular axios library:

Workaround for axios and HTML content

It was used in 3rd level dependency, so rebuilding (minifying) all upstream dependent services took some time. We kinda reproduced it locally and saw fewer retained strings via Chrome inspector. Therefore we were already excited that we managed to find the bug. Finally, we released the application with the fix and…

Release date of new service version, but no visible decrease in errors

there was no significant impact.

We were going out of options, so we had to accept the defeat. To be honest it was Scaling not the Frontend team.

But the story does not end here…

Meanwhile migration to Fargate

The task to migrate Node.js service on new infrastructure

As part of Scaling team tasks, there was a migration from AWS EC2 based ECS to Fargate based ECS. From practice, we saw that we are not good at optimally choosing the right servers (EC2 instances) and right auto-scaling parameters. Therefore, for stateless Node.js application, it made sense to migrate.

To get the right container sizes for production load we used a technique called shadowed (mirrored) traffic:

Traffic shadowing

Traffic shadowing is a quite common technique in Cloud-native tools like Skipper or Istio. It is similar to load testing, but instead of manually describing requests, examples of real ones are directed to the copy of the service. While our Node.js applications were stateless and dependent services could handle 2x traffic — the migration plan seemed perfect… until we saw the response time chart:

Response times more unstable on Fargate

But wait — why a Response time chart, how it is related to memory usage?.. It is not only garbage collection calls that make application slower. The load balancer also does not remove unhealthy containers instantly. So clients (services or browsers) are still waiting for valid connection close packets until it times out.

It turned out that Fargate was starting all containers at the same time, so memory was filling also at the same time. To our disappointment, the memory leak issue was more visible on Fargate than on EC2 instances.

Slide from presenting bug hunting internally at home24 Demo night

Scaling team already failed in trying to fix a memory leak issue before. If other similar Node.js apps would also have a similar memory issue, the whole migration to Fargate goal would be at risk.

It seemed that the Scaling team had just bad luck, unless…

Introducing: Debugging on production

Connecting to remove Node.js container via Chrome Inspector on a shadowed instance

The Situation was different. We had an environment that did not affect real users but had the best test data possible (mirroring all LIVE traffic). So we could use all heavy JavaScript debugging tools.

It turned out that Chrome inspector (DevTools) was adding a lot of overhead, so for Heap snapshot and Allocation instrumentation on timeline containers were crashing in seconds (meaning connection reset errors in Chrome). So connecting to “real production” would be catastrophic (or at least not professional).

Chrome Inspector options for memory profiling

From all profiling types, only Allocation sampling survived ~10 seconds until the container was marked as unhealthy (note: we were using container sizes that had same response time as original ones). And it was enough to see the real issue: docCache variable.

Also, this setup allowed to remove false-positives fairly quickly (E.g. is it related to Node.js version, Apollo library version, hooks, caching configuration, etc).

Slide from presenting bug hunting internally at home24 Demo night

Finally, we saw a happy ending in this debugging story…

The real cause: string interpolation instead of Apollo variables

After assumption was tested with a traffic of real users, we finally could confirm the real cause of the issue:

https://medium.com/media/bafbfc5d0aff875893829442f779a8bc/href

Of course, this is a simplified example. But when you are writing a lot of:

) @include(if: $withOptionalBackendParameters)

It gets tempting to replace those with some simple generated strings.

To summarize, at the start of debugging:

There was a false assumption that many different query parameters would impact memory usage
The bug was very hard to reproduce locally because there were not enough unique products/configurations simulated

Lessons learned

We could say that Scaling team learned some new JavaScript tricks, but personally, I see it from a wider perspective:

For complex problems, cross-team effort gives more unique tools and views

If not the unique insights from multiple team members — I doubt I could write a happy ending for this debugging story.

Therefore I want to publicly give big thanks to the whole Scaling and Frontend teams (not excluding Tomas, Marius, Džiugas, Danny, Olga, Robert, Karolis, Antonella), as well as many other colleagues from home24.

Debugging Node.js Memory leak on Production via Shadowed traffic was originally published in home24 technology on Medium, where people are continuing the conversation by highlighting and responding to this story.