Optimizing GraphQL using Apollo Engine

This post covers the importance of having Analytics in your GraphQL, and when to take action on optimizing your app based on your observations. At OK GROW! we use the Apollo Engine to analyze our GraphQL endpoints performance and optimize slow queries and mutations.

Optimization and when to avoid it

“Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.” — Donald Knuth

It’s important to include the full quote above, because we have to be aware of both not optimizing things until they become a bottleneck, and not ignoring them once red flags are raised.

For example, when creating an MVP to show to investors or to be used for user testing, most of the time it’s more important to get the simplest solution implemented as quickly as possible than to try to come up with a clever solution that does the job efficiently.

Consider a chat app that consists of a public chatting screen and a user profile. During the implementation of the profile page, the developer has quickly copy-pasted code which has resulted in every update to the profile page to run 10 queries and 20 mutations, one for every field in the profile and each mutation run twice due to a mistake the developer made. Meanwhile, sending a chat message only makes 3 mutations, one to indicate that the user is typing, and one to send the message across, and one caused by a mistake the developer made in his logic. Take a minute and think about this. Which one is a higher priority to fix?

To understand the problem better, let’s take a look at Fragner Back’s blog post on The Problem You Solve Is More Important Than The Code You Write.

As you can see above, the priority of a task is highly dependent on how many users use that particular feature. Once you created an MVP and potentially released it to production users, then you can see the how often people use a particular feature, and whether or not if it’s worth fixing or spending the precious developer time on something else.

For example, after releasing the chat app in our example and looking at the analytics data, you might find that 0.01% of your users change their profile information, and at most once a year. Meanwhile, every active user is using the chat feature every day, and fixing that one extra call per chat message can reduce the load on your server by 33%. At this point, your time is probably better spent dealing with higher priority tasks which are fixing the extra call for each chat message. This is while the ugly code you wrote while building the MVP gives you lucid nightmares that you can’t wake up from!

Using Apollo Engine to find bottlenecks

Apollo Engine provides the ability to get real-time analytics on GraphQL queries and mutations, along with some other cool features like caching.

After connecting your app with Apollo Engine, and a few days of having your app used in production, you can analyze the performance of your app by looking at your GraphQL metrics.

Call Frequency

The first thing to observe is the most frequent queries and mutations. This can you insights about both a) the behavior of your users and which parts of the app they find most interesting and b) the efficiency of your code in terms of how often certain queries and mutations are called.

As you can see, in our example, send chat message gets called about 750 times more than updateProfile, therefore it has a higher priority to be optimized and dealt with.

Response Time Based on Distribution

Secondly, we want to take a look at the response time for each of the most frequently called queries and mutations.

The shape of the curve can give you information about the queries or mutations. For example, a normal distribution in the second example can indicate that a general problem with the query causing it to respond slower sometimes, whereas a right-skewed distribution can potentially indicate that the response only becomes slower in certain circumstances. This can be, for example, when a certain GraphQL field is requested, or a GraphQL variable is set to a certain value. Clicking on the bar in the histogram will allow you to trace your queries in each percentile with more depth.

However, the most important data here is the response time of most of your calls. Following the logic of fixing the most impactful problems first, we want to first look at p95 (95% of calls) of our data, and make sure that we have a good enough response time there, before approaching the issues with p99 and p999.

Apollo Engine provides you with a list of slowest calls in the p95 percentile:

Looking at the first example our response time is not that bad, whereas in the second example it’s a bit worse. Those are the scientific terms! But, how do we know what a good or bad response time is? The answer can be subjective depending on the query or mutation, and the screen that the user is interacting with, but the important thing to note is to show an appropriate loading indicator or feedback when needed.

Jakob Nielsen talks about when feedback is needed in his book Usability Engineering. From chapter 5 of the named book:

  • 0.1 second is about the limit for having the user feel that the system is reacting instantaneously, meaning that no special feedback is necessary except to display the result.
  • 1.0 second is about the limit for the user’s flow of thought to stay uninterrupted, even though the user will notice the delay. Normally, no special feedback is necessary during delays of more than 0.1 but less than 1.0 second, but the user does lose the feeling of operating directly on the data.
  • 10 seconds is about the limit for keeping the user’s attention focused on the dialogue. For longer delays, users will want to perform other tasks while waiting for the computer to finish, so they should be given feedback indicating when the computer expects to be done. Feedback during the delay is especially important if the response time is likely to be highly variable since users will then not know what to expect.

Verifying fixes

After every optimization, we can go back to Apollo Engine, to verify that our fixes have successfully improved the response rate or time.

When looking at data over time, make sure to take into account other patterns that may be affecting the rate of the calls. Day of the week, holidays, and marketing campaigns can influence your data, so be cautious of the conclusions you make based on the graphs.

This concludes our look at Apollo Engine as a tool to guide our optimizations. I hope you found this blog post helpful and let us know how you use Apollo Engine to get information about your app. To get started with Apollo Engine check out the official documentation.