Moonpig’s Journey to GraphQL — Part 2

Jakob Nordlander
Moonpig Tech Blog
Published in
8 min readOct 22, 2020

In this series of blog posts we share Moonpig’s journey to start using GraphQL. In part 1 we talked about why we decided to use GraphQL. In this part we will tell you about how we created a GraphQL Gateway to allow multiple teams to contribute to the Moonpig graph, while still allowing each team to work independently.

The monolithic GraphQL API

A common way of implementing a GraphQL API is to have a single GraphQL server with a single schema and many resolvers. The resolvers will point to data sources (other APIs, databases etc.) that are developed by different teams. To manage this API you have a couple of options. You could have a dedicated GraphQL team that works closely with the other teams to integrate any required changes. Another option is that all teams are responsible for a part of the GraphQL API, maintaining the schema and resolvers that belong to them. We felt that both of these approaches have some drawbacks.

At Moonpig we believe that teams should be able to develop their software autonomously and dependencies on other teams should be minimised. Therefore we were not keen on the option of having a dedicated team to implement the GraphQL API.

That left us with the option of having multiple teams working in the GraphQL API. Speaking with other companies about their experiences, and from our own experience doing this for other projects, we knew that having a single repository with multiple teams working in it can get messy.

Some problems we have seen with a shared codebase are

  • Unclear ownership of common code
  • Unclear ownership of broken builds and deployments
  • Unclear ownership of production issues

To avoid these problems we were keen on exploring the options available to us that would allow teams to work independently on their own GraphQL APIs, but only expose a single schema to the clients from one HTTP endpoint.

The search for an alternative

As with most software engineering problems the first step is to see if anyone else has already solved the problem. We pretty soon came across a promising technology called “schema stitching”. Schema stitching was developed by Apollo and they define it as:

“the process of creating a single GraphQL schema from multiple underlying GraphQL APIs.”

Schema stitching is implemented inside the graphql-tools package. It exposes methods to merge multiple schemas together and it allows you to make modifications to the schemas in the process (if required). The merged schema can be served by Apollo Server (Apollo’s open source spec-compliant GraphQL server). Each merged schema will have an associated data source that tells Apollo Server how to get the data for that schema. At Moonpig we call the Apollo Server that merges the schemas together the GraphQL Gateway, and the underlying GraphQL APIs micrographs (a portmanteau of microservices and GraphQL).

To figure out if schema stitching would be suitable we ran a series of POCs. The tests were mostly successful but uncovered a couple of issues that needed further investigation. The main ones were:

  1. Cache-Control headers were not propagated from the micrographs to the client
  2. An update to a micrograph schema would require a redeployment of the GraphQL Gateway to re-introspect and compile a new merged schema.
  3. When the Gateway is initialized there is a dependency on the micrographs to serve their schema definitions via introspection queries. If they are not available to respond to the introspection requests, the Gateway is unable to include them in the merged schema.
  4. Any GraphQL operation that has connected queries (the output from one query is used as input to another) would either need special resolvers added to the Gateway or would need to be broken up into multiple client-server requests.

We also discovered a more urgent issue in that schema stitching was no longer being maintained. Apollo had just released another technology, Apollo Federation, to replace it. Federation promised to solve a lot of the problems identified in schema stitching.

We now faced a decision. We could either implement federation, or we could stick with schema stitching. If we went with Federation we would be one of the early adopters, potentially experiencing quite a few teething issues. If we went with schema stitching we would most likely need to do a migration to Federation in the future.

To help us make the decision we answered the following questions:

How many companies are using federation in production?

How easy is it to find information about how to set up federation?

What is the main reason for moving from schema stitching to federation?

How difficult would it be to move to federation later?

How much throw away work would we do if we started by implementing schema stitching?

How many of the problems we discovered with schema stitching would be solved by federation?

In summary, we came to the conclusion that Federation would be a better option for us, however, for a few reasons seen below we decided to delay that work and instead implement schema stitching.

  • Federation was brand new and we didn’t want to be early adopters
  • We didn’t have any immediate need for the improvements made over schema stitching
  • The work required to move from schema stitching to federation appeared manageable

We will come back to Federation and its benefits in a later blog post so stay tuned for that!

With that decision made we needed to resolve the problems we had found with schema stitching.

Queries across micrographs

In schema stitching when you want to query across micrographs you have to explicitly declare those relationships in the Gateway. This would make the Gateway a single repository with many teams working in it, something we were trying to avoid.

It turns out that this is one of the problems the federation has solved in a much better way. When we looked at our schema we realised that we had no immediate need to make these types of queries. Therefore we made the decision to avoid them until we were ready to move federation.

Forwarding Cache Headers

Apollo Server has the concept of setting cache headers by using schema directives. This, paired with persisted queries allows you to cache GraphQL responses at the edge. We found that when we introduced the Gateway in front of the micrographs, the cache headers that they returned were not included in the Gateway’s combined response.

We used two pieces of functionality in Apollo Server to propagate the cache control headers to the client. The first one, apollo-link, makes it possible to add middleware to the micrograph request pipeline. The second, Apollo Server plugin, allows us to modify the response headers before they are sent back to the client.

Our custom link component is inserted into the link stack for each micrograph. When it receives a response, it looks for cache control headers. If it finds any it pushes them to an array of cache headers that is stored in the request context. Storing it in the context makes it possible for the Apollo Server plugin to access them. We need to use an array as a single Gateway request could be making multiple requests to micrographs so we need to collect headers from each one and then decide which ones we will return to the client.

return new ApolloLink((operation, forward) => {
return forward(operation).map((response) => {
const context = operation.getContext();
const headers = context.response.headers;
if (context.graphqlContext) {
if (!context.graphqlContext.resultCacheHeaders) {
context.graphqlContext.resultCacheHeaders = [];
}
const ccHeader = new CacheControl(headers);
context.graphqlContext.resultCacheHeaders.push(ccHeader);
}
return response;
});
});

The Apollo Server plugin exposes a willSendResponse lifecycle hook that is executed once per Gateway request, before sending the response back to the client. At this point the plugin aggregates the cache control headers that have been stored in the request context into a single header. The result is a combination of the least permissive options.

For example, if micrograph A returns “cache-control: private” and micrograph B returns “cache-control: public, max-age=1800”, then the Gateway responds with “cache-control: private”.

If micrograph A returns “cache-control: public, max-age=1800” and micrograph B returns “cache-control: public, max-age=900” the Gateway would respond with “cache-control: public, max-age=900”, as it has the lower max-age.

const resultCacheHeader: CacheControl = 
requestContext.context["resultCacheHeaders"] ]
.reduce((current: CacheControl, accumulator: CacheControl) => {
if (!accumulator) return current;
return accumulator.combine(current);
});
const resultantHeaders = resultCacheHeader.outputHeaders();
Object.entries(resultantHeaders).forEach((header) => {
responseHttp.headers.append(header[0], header[1]);
});

Handling Schema Updates & Failing Introspections

At Moonpig we run most of our software inside AWS Lambda. To run Apollo Server inside a lambda you create an instance of Apollo Server and call createHandler on it. The handler can be used to respond to lambda requests. To create the Apollo Server you need to pass it the merged schemas. To get the merged schemas the Apollo documentation suggests that you introspect the micrographs and pass the responses to the mergeSchemas function. Once the handler has been created the schema cannot be changed. The fact that the schema cannot be changed once the handler is created raised some concerns.

  • What do you do if one (or many) of the micrographs are unavailable during introspection?
  • How are updates to the schema applied?

We needed to come up with a solution that enabled the schema to change after the lambda had started.

We solved this by introducing a caching mechanism for the lambda handler. When the lambda receives a request it first attempts to retrieve the existing handler from cache. If it is available it will be used to send the response, if not the handler will be created and then used to respond to the request. The handler is cached for 60 seconds so if a micrograph is unavailable during the first startup the longest that is the longest time it would be excluded from the schema.

This approach ensures that we can self-heal from micrograph introspection failures, and it also makes sure that schema updates will be reapplied regularly. However, there is a performance penalty by doing this.

Whenever the gateway is recreated it introspects all micrographs again and this can take a few seconds to complete. The obvious answer is to extend the cache of the handler, however, we didn’t want to do that as if there is an error with the introspection then that instance of the Gateway would not be able to respond successfully to queries from that part of the graph.

To solve this issue we started caching the schemas from the introspection as well. This way, if we receive a successful response to the introspection we can cache that schema, but if it fails we know that the introspection will be retried the next time the handler cache expires. We configured the schema cache to be 5 minutes.

If a micrograph is unhealthy, and we are unable to introspect it, we will wait at most 1 minute before trying again. If the micrograph is healthy we don’t do unnecessary introspections but schema updates will appear within at most 6 minutes (5 minutes schema cache + 1 minute handler cache).

Conclusion

This version of the Gateway has been running in production for the last ~18 months without any major issues. Micrographs sometimes become unavailable but the Gateway self-heals as soon as they come back up and it doesn’t require any intervention from us. Teams have successfully been able to work in isolation and the Gateway has not had to be changed due to a requirement in a micrograph.

Overall the stitched Gateway has been a success, but now it is time to move forward with federation to enable us to take advantage of the improvements made. We will share more about this in the next part.

--

--