Introducing GraphQL resolver tracing in AWS X-Ray

A technical walkthrough to adding GraphQL resolver details into AWS X-Ray to identify performance issues.

As our LIFE Fasting Tracker app user base has grown, we’ve seen the expected increase in API traffic. We’ve kept an eye out for places where slow requests might hurt user experience. With LIFE Fasting Tracker being our first big deployment of GraphQL into production, I was eager to see how it performed in the real world and learn tricks to make it perform more efficiently as we scale up.

When it comes to performance improvements, a good first step is to collect data about what is slow and where time is being spent on requests. Our infrastructure is 100% AWS Lambda functions, so our first step was to figure out which of our Lambda functions were slow.

AWS X-Ray is a service for collecting performance metrics and visualizing where time is being spent. AWS X-Ray includes support for high-level tracing of Lambda functions. It’s easy to enable, so we started there.

How to Enable X-Ray tracing for Node.js Lambda functions

The official AWS documentation for enabling X-Ray in your Lambda functions is simple to follow, but below is a screenshot of what the option looks like in the Configuration page in the AWS Lambda Management Console.

Just check the checkbox, save the configuration and wait to collect data.

If the thought of manually changing configuration options in the console makes you cringe, don’t worry, it can be automated too. At LifeOmic, we use Terraform to manage all of our infrastructure, and enabling it through code is just as easy.

After enabling tracing at the Lambda level, you’ll want to make some small application changes to enable detailed tracing of AWS APIs and other cross-service calls from your application. The code below is what you want to include early in the initialization of your Node.js function.

If your deployment includes multiple services, you’ll want to make the same changes to all services you want to collect performance data from. Push the code and configuration changes to your Lambda functions and wait for some API usage in your application.

How I found the Slowness

We used the X-Ray console to search for slow requests and see where time was being spent. Our infrastructure includes more than just services for the LIFE app, so I wanted to narrow the performance data to requests that hit a service I know is used by our mobile application. The X-Ray console allows filtering, so I used a service("social-circle-service") filter to narrow down the X-Ray traces to those that hit our LIFE app’s social circle service somewhere in the execution.

14 second requests! Plenty of room to improve.

Clearly, there is plenty of room for improvement in our social circle service! No requests in our system should be taking 14 seconds. Even one second would have been longer than I expected. I did a deeper dive into one of those executions by clicking on a link in the ID column.

An interesting section of one of our slow requests

From that trace data, you can see in the screenshot above that the service took two seconds and was executing a GraphQL. That was expected, as almost all of our application requests use GraphQL. However, I couldn’t identify exactly what part of the execution was slow.

Unfortunately, this is as far as the basic integration could take me. But with some work we can make X-Ray even more useful.

Add GraphQL resolver details into AWS X-Ray

I was unsatisfied with the level of detail in the screenshot above, so I created some custom X-Ray integration to show GraphQL resolver details in the console. With the extra resolver details added, the above screenshot was transformed into something much more useful for locating our slow code.

Now it’s easy to see the breakdown of the time across the different GraphQL resolvers. With this, we can work to improve the implementation of the resolvers involved.

At LifeOmic, we’re happy to share work like this with the community. We’ve open sourced the project as @lifeomic/graphql-resolvers-xray-tracing on GitHub and in NPM. To enable the integration, just depend on that project and add some code near GraphQL schema initialization.

As for my performance problem outlined above with the LIFE social circle service, one thing that stood out to me was that separate Lambda executions to the user-service were being made for both thename and membersresolvers. I realized they could probably be combined into a single request with some request batching.

With the resolver level of detail in this service and others, it was possible to quickly find and implement many easy performance enhancements. The service execution times dropped to much less than a second and we got feedback from our users that the app felt more responsive.

Hopefully you can use @lifeomic/graphql-resolvers-xray-tracing to find places to improve your own GraphQL-based services. Please try it out, open up issues if you find any, and feel free to send Pull Requests for any improvements you’d like to make. If you want to work for a company that’s excited about building and sharing this type of work, check out our jobs page.

Matt Lavin is the lead developer on the mobile backend services team at LifeOmic. We’re building the services to support our mobile health tracking LIFE apps using AWS Lambda and GraphQL for serverless APIs. Feel free to contact me on LinkedIn or Twitter to learn more.