Glassdoor’s Journey to GraphQL Federation

Published in

Glassdoor Engineering Blog

7 min readFeb 20, 2020

Glassdoor’s GraphQL journey and how we federated our GraphQL using Apollo federation and GraphQL manager.

At Glassdoor, our GraphQL journey started in early 2017, around the time when we started designing our new product — Know Your Worth. Since then the GraphQL stack at Glassdoor has evolved significantly and the goal of this post is to highlight some of the key milestones we have hit and share some key insights into “Federated” Graph.

What is Federation? — The act of breaking down a large monolithic data graph into smaller domain-driven subgraphs that have access to each other’s schemas. They are built following the principles listed at https://principledgraphql.com/

Foot in the door

Content Graph : With the Know Your Worth (KYW) project we started with our first Graph QL service “Content Graph” to serve employer-related data to all our upstream layers. This GraphQL service was a Java-based Spring Boot compliant graph that we built off of graphql-spring-boot-starter: 3.5.0 with graphql-java: 3.0 that was available at the time. We were able to roll out our content GraphQL with basic caching and batch support. In reality, we spent much time porting the features from our old spring injected service layers to the content GraphQL instance for the next year, committed fully to our path forward. And through time we followed this pattern to roll out many more GraphQL service layers. These new instances often interacted with DBs directly over an ORM layer.

Rising complexity

Initially, most of our Front End (SSR, CSR) layers directly talked to the content GraphQL layer to serve many content types such as reviews, salary data, etc, and many other related entities. Our caching architecture had matured, the performance metrics met our SLAs and with increased adoption of GraphQL, multiple GraphQL services were now in production serving respective data sets.

So when we started defining our new “Collections” product, which was to allow our uses to save any data set that we surface on our site and build a collection of heterogeneous content, we had a new problem at hand to solve. How do we serve data to a collections service through multiple GraphQL services?

The Collections Team considered designing a GraphQL collections service with a REST endpoint storing core collection entities and essential metadata with UI layers to route to these APIs via our API Gateway. The downside was that none of this data would come hydrated for the UI layers to consume and an intermediate aggregation service would need to call multiple services to get this hydrated. This led to two issues to address —

Should we treat the collections service only as a metadata store and then the client can call downstream services to hydrate the information?

This conforms to our core domain-driven design principles but what about the performance on the client side? We calculated that the number of calls from the client would be on the order of 6–10 per collections page visit. This number would grow exponentially with every new collection module that shows up. We concluded this design needed to be improved.

To reduce the calls can we possibly build APIs inside the collections service to talk to services that hydrate this data?

This would solve the performance problem, but over time when an entity change happens in the downstream service, the maintenance efforts become a concern.

As more teams were adopting GraphQL, making multiple parallel calls to sub graphs for hydrating entities became an overhead on the client side, in our case Node SSR apps. As we brainstormed options to solve for this overhead, the idea we kept coming back to:

If only there was a way to join queries from multiple entity type graph services, we can resolve performance overhead and have better maintainability.

We were solving for keeping the upstream calls to a minimum, with node app clients dependent on one entity type calling the respective graph service while node app clients dependent on heterogeneous entity types calling a service that will join the queries by calling multiple graph services wired to serve one entity type.

Ready to go big

Graph Gateway: Given all the conversations we had and the vision of where we wanted to head, including the increasing trust of the Apollo codebase for all our upstream React code, we started to develop a proof of concept with Apollo Server. The idea was to delve deeper into the federation world and see if the above two problems could be solved. We formed a team of enthusiasts from multiple teams, thanks to our guild culture, and followed multiple reference architectures including principledgraphql.com to develop a graph-gateway. While developing this was easier, to get the end-to-end setup ready to meet our performance SLAs required additional cycles. We started by defining a serviceList and extended ApolloGateway to add support for using a cached schema during composition failure, so that the server startup does not fail.

Collections Graph: The collections use case was aptly suited for launching our first NodeJS + Apollo GraphQL server graph instance talking to an underlying Postgres DB using Knex/ObjectionJS. Once our first lean version launch worked, we got to adding upgrades for scale, monitoring and performance.

More compliance

Federation Compliant Content & Jobs Graph: The early graph layer that was built back in 2017 (content graph) was built on an older version of graphql-java that was not federation compliant. One approach was to upgrade the graphql-java version but that path involved substantial risk, so we developed an annotation layer that would enable support for federation with the older version of graphql-java. All you had to do was to add an annotation called GraphQLFederatedReferenceResolver on top of your entity resolver. As a result your main GraphQLQueryResolver would extend GraphQLFederationQueryResolver to add the method signatures for _entities and _service, both of which are needed by ApolloServer/Federation to make a downstream graph service federation compatible. We also launched a bare-bone Java based Jobs Graph for internal use and made it federation compliant following the steps above. This enabled us to test the whole federation architecture end-to-end and get an initial read on performance. At this point, our stack looked close to the diagram below. The red subgraphs were Java based. The others were NodeJS based and were being used for testing the collections feature on mobile native with a small subset of traffic.

Upgrades on Content & Jobs Graph: Later we upgraded from v4.1.0 of graphql-java to v13.0 dropping the need for the stop-gap compatibility annotation mentioned above.

Operationalizing

Apollo GraphQL Manager: To launch full-scale it was important that we have monitoring in place. Our Grafana dashboards worked well for monitoring trace metrics of single entity graphs but for join queries on graph, we needed a more robust monitoring setup. We needed something that gave us a holistic picture and was able to identify and alert us on sudden spikes of error rates. This is where our GraphQL Manager learning from the Apollo Summit came in handy. We tried the setup in our QA & stage environments before we started the licensing discussion with Apollo for an enterprise plan.

Managed Federation via GQM: Once we bought the Apollo licensing, we partnered with the Apollo team for the features needed for our scale. Until then we had been using the serviceList model supported by apollo-server out of the box. It prevents the server from starting up when a schema composition across subgraphs fails at startup. We had added an ability to reuse cached schema from a previous successful startup so that some of the queries still worked where schema could resolve correctly. While this was great, it silently prevented early identification of schema composition issues and we needed a monumental effort of coordinating with teams and many custom alerts to make sure things were fine. But this approach was masking the issues upfront and we realized that remotely managed federation was the way to go. It is a support tool that is provided even in the free tier of Apollo which allows developers to find composition issues prior to pushing a schema change.

Add the engine key when your server starts up —

const server = new ApolloServer({ gateway, context, engine: { apiKey: 'KEY', schemaTag: env}});

Starting with the least dependent sub graph perform a service push

apollo service:push --serviceName=collections --serviceURL=https://{collectionsLB}-`${env}`/graphql --variant=`${env}` --endpoint https://{collectionsLB}-`${env}`/graphql

This pushes the collections schema by calling _service and _entities queries on the endpoint param specified. Add the other sub graphs as follows.

apollo service:push --serviceName=content --serviceURL=https://{contentLB}-`${env}`/graphql --variant=`${env}` --endpoint https://{contentLB}-`${env}`/graphqlapollo service:push --serviceName=jobs --serviceURL=https://{jobsLB}-`${env}`/graphql --variant=`${env}` --endpoint https://{jobsLB}-`${env}`/graphql

Start the gateway and if there are composition errors you would not be able to hit /graphql or /playground on the gateway and you will see errors, and will be easily able to find which subgraph is causing issues. If everything goes well, running apollo service:list — variant=qa will give you output like this

Future Thoughts

We spent a lot of time hardening our security layers, making it better suited to deal with graph specific attack vectors, but more details on that in another story.

Additionally, we intend to test out and evaluate the benefits of federation-jvm. We have heard a lot about this from Apollo engineers and want to give it a try. Our goal is to be able to use the newer specification annotation like CacheControl and try to standardize the federation traces across the stack.