Evolving the Graph

Jon Wong
Jon Wong
Aug 28, 2019 · 9 min read

This is a written version of a talk I gave recently at GraphQL Conf 2019. If you would prefer to view the video version, here it is:

Evolving the Graph, GraphQL Conf 2019. Slides

For those of you who prefer to read and want a little more background, here’s a more detailed version of the talk, complete with links and all that fun stuff. Enjoy!

Here’s an overview of GraphQL at Coursera, some of the technical decisions we made early on, and a reflection on the impact of those decisions over the last three years. I hope that you can learn a lesson or two from our journey and apply it to your own GraphQL story!

The engineering organization at Coursera has been using GraphQL in production for more than three years — it powers both of our mobile clients as well as our web client, and serves millions of requests a day! But three years ago, the landscape supporting GraphQL was very different from that of today, and some of the more obvious choices around how to architect a GraphQL system were a lot less obvious. And as we re-evaluate the advantages of GraphQL and what it brings to our stack, we thought it’d be interesting to take a look at where we came from, and which decisions have stood the test of time (and which did not).

Why We First Started Using GraphQL

Three years ago, Brennan Saeta and Bryan Kane sought to find a solution to one of the more frustrating issues Coursera was dealing with at the time: centralizing all the REST APIs that our organization was creating for simple and easy consumption. We needed a technology that would let clients be selective about exactly what data they needed so we could reduce the total set of data being sent over the wire, and GraphQL fit the bill. Watch that story here and read about it on the Apollo GraphQL blog here.

To summarize: Our first take on GraphQL looks a lot like what is today referred to as a GraphQL Gateway, a thin service where the GraphQL schema and core business logic do not live in the GraphQL service itself, but are rather composed of underlying services. Furthermore, we opted for a data-first approach to GraphQL, meaning the unified GraphQL schema being exposed from our gateway was a direct proxy of the underlying services, with next-to-zero modification of the underlying service graph.

Given the relative infancy of the GraphQL ecosystem, it was clear that this approach was an outlier — many companies at the time were choosing a schema-first approach, where the GraphQL schema was often a slightly-to-heavily modified version of their underlying service graph. And quite often this GraphQL service had a lot of business logic in it.

The Key GraphQL Decisions

In the course of evaluating technical decisions made in the past, people tend to quickly judge the decision in the context of today. When we view them in the context in which they were made, we can see that those decisions were completely justifiable. I’m sharing that context with you today so you too can understand why we made these decisions. The following were key decisions we made in the process of designing our GraphQL solution, with the intent of addressing the original context (proxying REST resources) as well as optimizing for the sustained growth of our GraphQL schema.

In today’s GraphQL ecosystem, federation commonly refers to a combination of both request federation and schema federation, that is, a federated GraphQL server tends to be a single GraphQL service that composes multiple GraphQL schemas together, and delegates the execution of the relevant part of the graph to the underlying GraphQL service.

Request federation via REST

At the time, we barely had the vocabulary to think of this as federation, much less deal with more than one schema, so our gateway technically did only request federation: We took our own homegrown schema definition language, stitched that together ourselves, and whenever a GraphQL request came into the service, we’d farm it out to REST.

As many GraphQL enthusiasts will tell you, there are two camps to building out a GraphQL server: SDL-first or code-first, roughly summed up as whether you edit GraphQL SDL to change your schema or you edit something else (code) that eventually gets generated into a GraphQL schema.

Code-first schema generation with Sangria

At Coursera, we had already invested a lot of time and effort into a unified REST framework called Naptime that gave us a great place to start with generating our schema, so in conjunction with the wonderful tooling provided by Sangria, we chose to do code-first schema generation.

To fully realize our vision of a unified GraphQL API for all of our REST endpoints, we needed to figure out how we were going to “stitch” or “compose” these endpoints in a scalable, straightforward way. We had a lot of choices here, including multiple iterations on how we should name a GraphQL field that came from a specific REST endpoint.

Schema stitching with namespaces in GraphQL

In the end, we picked the one that most directly matched our REST endpoints: namespaces. This meant that for every service included in our GraphQL schema, they’d each occupy a top-level root field, and the nested GraphQL hierarchy matched the hierarchy of our REST APIs.

Three Years Later…

After a couple of rounds of funding and millions upon millions of new Coursera learners, it’s important to take a look at the decisions we made in the name of “growth” to see whether they were durable decisions. In addition, we’ll cover a few things that we didn’t “decide” on but were important factors in our GraphQL story nonetheless.

Our use of “federation” by delegating service requests to our underlying REST APIs kept our gateway incredibly thin — so thin, in fact, that service developers barely had to think about it at all. From an runtime perspective, we were able to support requests across our entire API surface area with exactly one resolver because we had standardized all of our APIs. In this case, service developers were not writing resolvers at all.

Additionally, this strategy kept our ownership story really sleek. Teams retained ownership over their individual services and had no need to interact with the GraphQL Gateway at large.

One large downside with the technical implementation behind this, however, was that it was pretty difficult to preview incoming schema-related changes that affected the unified GraphQL schema. Unfortunately, this meant that we had to preview changes when they got to production… which was incredibly problematic. We could have built tooling to preview this earlier on in the development process, but because of the homegrown nature of the Gateway, it was definitely a non-trivial issue.

Code-first schema generation turned out to be an efficient way to achieve our original goal of reflecting our REST APIs into the GraphQL schema. As I’ve mentioned prior, this greatly reduced the amount of effort needed to participate in our unified schema, and very little had to change in the life of a backend engineer to get automatic GraphQL support.

However, this had its downsides as well. From a technical standpoint, it was really simple to get new things into the GraphQL schema, but we never really had a strong opinion about whether these new things should be in the schema in the first place. Again, given the context of the project, this explicitly wasn’t a goal, but after three years of unbounded growth, our GraphQL schema was bursting at the seams: We have more than 7,000 types, and more than 650 root types. One of the benefits of GraphQL is discoverability, but with so many entities, it was really difficult to figure out which APIs existed already, and without proper governance of the schema, these APIs were not particularly well documented, either.

Using the namespaces approach for schema stitching actually turned out to be a pretty solid decision. These namespaces allowed us to even fathom handling more than 600 distinct REST APIs with virtually zero overhead from dealing with collisions or backwards-compatibility — if there were any issues, a new API could be created and we’d serve both versions for posterity.

However, when it came down to having to make changes to fields for actual product requirements (e.g., locking down fields or changing types to be more complex), we didn’t have a workflow down to effectively adapt the graph, which definitely caused some hiccups when it came to serving long-running clients like our mobile applications.

In the process of building out the GraphQL server, we were tackling adoption on the client in parallel. This was an unexpectedly good feature of the GraphQL adoption process, as the velocity of feature development in mainstream clients like Relay and Apollo helped create a really solid developer experience for all client developers.

Our Big Realization

In the midst of re-evaluating a lot of these technical decisions and how they’ve affected our organization, we made a pretty huge realization that completely flipped how we thought about the problem. It came down to a foundational element in the context of the problem we were solving in the beginning:

Is a data-first schema the best use of GraphQL?

Ultimately, the answer is nuanced — a data-first schema got our schema where it is today, and has served us, our product, and our learners really well. However, in the grand scheme of things, we felt at odds with some of the core tenets of GraphQL: product-centricity. Taken straight from the GraphQL introduction blog post:

GraphQL is unapologetically driven by the requirements of views and the front-end engineers that write them. We start with their way of thinking and requirements and build the language and runtime necessary to enable that.

The examples the blogs use, versus what we have at Coursera

This ultimately underscores some of the hesitance we had and the conflict we had internally — schema-driven development is one of the most powerful workflows that GraphQL supports, and documentation, talks, blog posts all reflected that. While our implementation ultimately was data-first, the most difficult thing to understand was when the requirements of the product just barely mismatched the capabilities of the API, causing the product developer to revert to making business logic-related decisions in the client, rather than pushing it back into the server.

What’s Next for GraphQL @ Coursera

We’ve realized that alignment with the best practices around schema-driven development are what we want and need at Coursera. That is a hefty decision to make — our data-first schema is accomplishing the task it was originally set out to do, but the promise of GraphQL is not, and as a result this means that we’re starting from scratch on what GraphQL looks like at Coursera. We have a few driving principles here, and there’ll be many blog posts to come as we navigate this journey.

When it comes to our schema, we’re switching our focus to optimize for quality over quantity — not everything needs to make it into the GraphQL schema, and when it does, we should be really thoughtful about how it does.

When engineers worth together with one another, GraphQL will actually be the language that they communicate in, along with the technology. API contracts are schema-first and are derived directly from the products they support, and both sides can work and meet in the middle.

And above all, the schema becomes the unifying layer for all engineers. Our GraphQL gateway can and should be something that all engineers have a say in, and the tooling and workflows will reflect that.

Pulling It All Together

Despite our best judgement in building our first iteration of GraphQL, despite wonderful decisions that have fulfilled the requirements of the problem for many years, despite features that have really supported the growth of our GraphQL schema, none of them really addressed the core evolution of our team; it isn’t the graph that’s evolving, it was the team that is evolving. The people involved grew their understanding of the technology to see how it could better serve our needs, and ultimately, that has resulted in us making a different choice from that of our predecessors.

Thank you to Gago Frigerio for editing this talk, Brennan Saeta and Bryan Kane for being the driving force behind GraphQL at Coursera, and the Developer Experience team for helping to drive the future of GraphQL at Coursera. If the work above sounds really awesome and you want to work with some state-of-the-art GraphQL to help advance the future of education, come work with us!

Coursera Engineering

We're changing the way the world learns!