EXPEDIA GROUP TECHNOLOGY — SOFTWARE

How to Run GraphQL Directive-Driven Capacity Tests at Scale

Vrbo’s framework for testing its site using Apollo Server and hapi plugin support

Published in

Expedia Group Technology

9 min readJul 21, 2020

Every year, vacationers flock to the Vrbo™ site (part of Expedia Group™) around the holidays. It’s a time when families come together to figure out their upcoming family vacations. It’s a busy time for us and we spend a lot of energy ensuring we can handle all of the traffic spikes that come with the season. This year, however, we had an even bigger challenge: a bowl game sponsorship and many high profile television advertisements that would drive large spikes to our system — particularly to our mobile apps.

The Citrus Bowl sponsorship drove huge traffic spikes to our website and mobile apps

At Vrbo, we ensure our system can handle these sorts of spikes by running capacity tests. Prior to adopting GraphQL and Apollo Server, our capacity tests relied on replaying unauthenticated/anonymized HTTP GET traffic. This is a safe way to ensure that we are not inadvertently replaying traffic that we should not. HTTP POST requests can change state, while GET requests do not. This style of test can provide a lot of confidence that our applications and routes with the highest call rates and scaling needs are ready for action.

But our GraphQL traffic is always sent as a POST request! These requests were not getting replayed during our load tests and we were not stressing the systems we expected to stress. Replaying all POST requests was not an option. We needed a smarter, more targeted approach that provided the coverage we needed to know our systems could handle the strain, without changing the state of our system or writing invalid content to data stores

Diagram showing the edge logging HTTP GET requests to S3 for consumption by the load testing tool, but not POST requests — GraphQL `POST` requests were not captured for load tests

At first we built custom log scrapers for high-load queries. We extracted request data plumbed through application logs and processed it with bash scripts to fit the load framework input format. This allowed us to use real request parameters for all of our mobile searches in the peak capacity tests, but we used a predefined query format, and it only worked for search queries. This script successfully exercised the most critical service dependencies for our mobile app during traffic spikes by replaying searches with the query arguments mined from application logs. We mirrored this approach for search and details pages.

This approach gave us some coverage, but it didn’t truly represent all production traffic, and the scope problem continued to grow as our website was shifting most of its traffic over to GraphQL. Creating a new custom test every time we made a new GraphQL request was not an option.

Evolution of the prior diagram, showing POST requests stored in S3 via a log data scraper run on application logs. — Ad hoc log scraping is a labor intensive and error-prone manual process

Further complicating the situation, a wide variety of GraphQL traffic can come from a single endpoint — especially from our mobile apps — which means simply replaying traffic from certain routes would not completely solve the problem for us either.

At this point, we stepped back and defined requirements for what we really needed.

GraphQL capacity test requirements

1. Application level logging enabled with a single systemwide switch

Logic can’t live at the edge as our GET logging does, because we would need to read and evaluate every post body, which would be expensive and add latency.
Deciding what to log is not trivial and requires knowledge of the API; only the GraphQL Server can really know which API calls are safe to log.
Our performance engineering team manages and runs our capacity tests. We need a way for them to turn on and off log collection for all applications at once.
Since we cannot rely on our edge to log the POST traffic for replay, we need to do it at the application level, but the cost/complexity of handling this one application at a time is prohibitive.

2. All replay traffic anonymized without any authentication or personally identifying information

We didn’t want to expose a security hole by logging data we shouldn’t — like valid authentication headers, passwords etc. And since this behavior was being controlled, this might invalidate some queries. Even if they were not changing state, they relied on valid authentication headers, or other secure information we did not want to store for replay.

3. Opt into replay of GraphQL operations — not Opt Out

Requests need to be opted into the peak capacity test (PCT). We don’t want to rerun some requests during PCT, like booking requests and inquiry submissions, because they may change the state of the system. And no matter where the request is being executed, we can’t accidentally log something we’re not supposed to.

Explicitly selecting which GraphQL operations to replay is the safest approach to deciding what to log. A wide variety of requests can go through a single GraphQL endpoint. An opt-out or blocklist approach is error-prone, and can get out of date with what is actually being requested.

When a client makes a GraphQL request, we examine the request and mark it as safe to log or not based on whether the operation has been labeled as replayable.

4. The GraphQL schema is the single source of truth for what can be replayed

Making the schema the source of truth for what can be replayed has the following benefits:

Makes the decision discoverable
Decision owned by the schema owner/domain expert
Universally decided for all consumers of the schema
Traceable to a single code change/ticket, making security reviews easier to manage

Due to how we compose our GraphQL APIs, schemas are reused across many applications. This makes accurately representing the load of our entire system during a PCT difficult without a centralized decision for what to log. Relying on each client to configure their load testing queries put the decision in the wrong hands and would likely lead to drift from reality.

Final architecture

An evolution of the first diagram, showing plugins on each app communicating to the A/B platform and to Kafka, thence to S3 — A hapi plugin gates logic for which operations can be replayed

Implementation

After breaking down the problem, we identified four discrete pieces of functionality we needed to implement:

Managing which routes to record and providing hooks to inject business logic for how and what to log
Configuring and controlling PCT logging across many applications
Deciding which GraphQL requests to log
Collecting the logs and running the load tests

Managing which routes to record and providing the hooks

We needed a framework to hook our GraphQL-specific logic into our existing webapps. To do that, we built a generic internal plugin, hapi-request-logging, that provides those hooks.

Because hapi allows you to apply plugins to routes, we effectively solved the “which routes to log” problem by encapsulating the logic in a plugin — we just configure which routes to record by including the plugin in those route configurations. To make things simple for our webapp developers, we include this plugin whenever we configure the GraphQL server plugin for a route.

hapi-request-logging provides three hooks for injecting business logic. These are lists of functions that are applied during the hapi request lifecycle:

transformers: how to map the request object to your log format
loggers: how and where to log — if needed you could log to many places.
skippers: determine what requests to log — again, we can apply multiple skipper functions to refine what is being logged with different logical checks. In our implementation, we have two skippers: an on/off switch for log collection, and another for injecting our GraphQL-specific business logic.

This provides all the generic tools we need to log GraphQL traffic to a location that the performance engineers can collect inputs from. Now we just need to provide the business logic to inject into the transformers, loggers, and skippers hooks.

Configuring and controlling capacity test logging across many applications

The second piece is to provide the logic for where the requests are being logged and when logging is enabled. This logic is handled by another internal plugin, hapi-pct-logging, which handles all the implementation specific logic about where and how to log, as well as providing the universal switch to turn on log capture.

In other words, hapi-pct-logging provides a transformer, skipper and a logger to hapi-request-logging.

The transformer function it provides handles the mapping from the hapi request object to an Avro definition that defines the Kafka topic schema. This is where we filter out headers and other information in the request that we do not want to include in our load test.

The skipper function acts as our universal switch to turn on logging. It checks to see if the the global feature gate that enables logging is turned on.

The logger function handles logging the transformed request to Kafka.

Now we had a generic framework for logging POST requests to a well defined location and could also be turned on and off across applications by a single switch. That left us with building out the support for determining which GraphQL requests we could log.

Deciding which GraphQL requests to log

With these two plugins, we had all we needed to log POST requests to a Kafka topic whenever the performance engineering team enabled the feature gate throttle in our A/B testing tool. But we still needed to filter out traffic we shouldn’t be replaying. This was a big problem for our mobile GraphQL API, where all our traffic goes through a single endpoint.

We decouple schema ownership from application deployment ownership, and many different applications can host the same schema. It’s not feasible to ask every service owner to determine which GraphQL requests they feel are safe to log. The schema owners are the proper owners for that decision.

Additionally, the schema really is the interface of a GraphQL server, so by annotating the Schema with a directive that determines what is replayable, we can make that information more discoverable.

We decided to define a GraphQL directive called @replayable. The replayable directive can be added to queries and mutations. We wrap the field resolve function with code that decorates the context. At runtime, when that query is executed, we allow requests with the @replayable directive.

When we register the GraphQL server on a route in our hapi server, we register the hapi-request-logging plugin and add another skipper function that evaluates the state of the decorated request context.

We also ensure that the directive is defined for all instances of our internal graphql-server library (which wraps apollo-server).

All that’s left is to decorate all the operations we wanted to expose to our peak capacity test. After adding the directive, the schema definition looks like this:

Collecting and running the requests with our load testing framework

A key win for this change was to reduce the amount of overhead per PCT. At the time we finished this work, we had three ad hoc tests that generated load, and were likely going to need more and more if we wanted to continue to improve our API coverage.

Now, our reliability engineers have a single, simple process:

Enable a feature gate throttle in our A/B testing framework
Wait
Run a quick job to extract the data from S3

And as we scale out our GraphQL schema to different applications and teams, we can still get coverage for PCT tests for those calls. The performance engineering team’s process never changes — they just get that traffic for free.

Case study: Prepping mobile for peak capacity during the Citrus Bowl

With all of the infrastructure in place, we were able to set up a load test prior to peak traffic that pushed all of our systems to the max. We effectively tested mobile traffic at spikes of up to five times our daily max traffic with real representative queries.

Graphs showing stable response times while load is ramping up — During the capacity test, our response times stay relatively stable (left) while traffic ramps up (right)

That load test did strain our infrastructure and made it clear that we needed to scale it up before January 1.

When that day arrived, we saw big spikes, but at significantly lower load than what we load tested at — only around three times our peak traffic — and we sailed through them without much difficulty. You can see the system have minor hiccups initially as we scaled up to support the new traffic rate, but we only saw small increases in our p95 and p99 response times.