How we built Globoplay’s API Gateway using GraphQL

Photo by Jeremy Bishop on Unsplash

I know you don't hear this every day, but I had never worked in a project where I would choose GraphQL over classic Rest APIs to abstract the client/server communication. I mean, it is not usual to hear that because nowadays internet has tons of well written posts that describe how magical GraphQL is when used to solve any kind of problem. 
Also, there is another thing that make this phrase uncommon: How fancy projects that use GraphQL in production look like. C’mon, they look fancy.

To be honest, I always suspected this was not a real world thing. Like, it just seems it won't work in production when you have thousands — even millions — of users opening your product every single day and, like in our case, it feels very uncomfortable to think that you can no longer rely on the old fashioned per resource http cache.

Yeah, it does feel uncomfortable. At first, actually.
However, as we started to understand the changes we would need to make to our architecture (to our mindset too!) and embrace them in order to deliver the product the stakeholders were expecting to see, things started to look simpler and more comfortable.

But, before we go on, I believe some context would help you understand where our product was and how we changed our minds.

Globoplay is a video streaming platform created and developed by Grupo Globo, which is the largest media and communication group in Brazil and Latin America.

It is one of the products called OTT (Over the top). Currently supported only in Brazil, it started with content coming from TV Globo (free-to-air television) allowing users to watch programs that were just broadcasted by TV Globo (that’s called catch up). As time went by, we started to offer exclusive national content as well as international series/movies through Web, TV, iOS and android clients.

That’s enough business, bro…

In mid 2018, we had two backends for frontends (BFF) doing very similar tasks: One for web, and another for iOS, android and TV. Although they should differentiate only by the peculiarities of each client, we ended up duplicating lots of services, infra, and most important of all: effort.

As much as I love the “backend for frontend” idea (and how cool it sounds), we could not keep the current architecture. Not only because of the reasons I just said, but because each BFF was serving slightly different content to its clients while the business team started to ask for something new: Ubiquity among all clients.

This is it! We need to unify our duplicate infra, support different clients with different needs to provide ubiquitous experience to users.

Now you are probably wondering why did you have duplicate infra? 
Well, yeah… It makes no sense and the answer could be as simple as Homer Simpson's: It was like that when I got here.

Just kidding. Although that is true, it does not really answer the question. The reason why we had duplicate infra is because the teams did not work together. They were different teams with different processes and goals. Only after some time we realized that they should work together and, as soon as we consolidated them as one team, we started to see the benefits in doing so.

GraphQL as the API Gateway

I remember the day I spent some hours staring at the window thinking about all the functional requirements we had. To nobody’s surprise, the more I reviewed everything we needed to support, the more GraphQL started to make sense.

I mean, it fits like a glove on some of our requirements:

  • While TVs need a big program poster, mobiles need a small one.
  • We need to show exactly the same video duration among all clients.
  • Video duration should be formatted following the same pattern / template in all interfaces.
  • TVs should provide detailed information about each program, but iOS and android could show only a poster + program title.

Of course the requirements were not exactly these, but you got the idea.

If we had decided to go with a backend for each frontend, we would easily end up with four backends — and we are not even considering that we have multiple types of TVs (which need different kind of support), game consoles, TV boxes, etc... 
Also, can you imagine how big would be the payload if we decided to go with a single Rest API supporting all these clients?

At that time, I was very excited about GraphQL!

Given the scenario we had, we were indeed very excited about what GraphQL could give us in terms of being an API Gateway to help us to consolidate our infra, providing the flexibility so the clients could access different variations of the same data and, last but not least, it would be the living document of the schema we were going to provide to clients. 
It was all good if we did not have some small but uncomfortable concerns: performance and caching.

Of course we had more concerns (dozens, to be honest), but those two were the major ones. In order to try to address those concerns we decided to start a proof of concept about GraphQL called Jarvis.

We chose to give our GraphQL POC the same name as Tony Stark's AI assistant for a list of reasons:

  • It knows everything that is running behind it.
  • Iron Man just need to ask for the information: Jarvis will provide it.
  • And finally: It is freakin’ awesome!
"Started out, J.A.R.V.I.S. was just a natural language UIHe runs more of the business than anyone besides Pepper." — Tony Stark to Bruce Banner

Summing up, the idea was to use Jarvis as the API Gateway for all Globoplay clients so each app could ask for the exact piece of data needed to build the user experience.

Once the POC started, we did some research on the most used GraphQL implementations and we found Apollo GraphQL among the top results.

If you are willing to start with GraphQL but you've never read about Apollo, I strongly recommend you to look at their docs.

Apollo GraphQL is a GraphQL implementation that helps development teams to deliver data from multiple services, including REST APIs and databases. It also includes two open-source libraries for the client and server, developer tooling, analytics, tutorials and so on. After some time reading their docs, we decided to proceed with Apollo Server as Jarvis' core.

Implementing the GraphQL API

As I said before, I know there are lots of great GraphQL tutorials around and maybe I don't need to create another one. However, I will describe some of the features we tested during the POC that helped us to create a production ready version of Jarvis like: Data sources, query cost validation, application/http caching.

Apollo Server is a straightforward GraphQL implementation wrapper around a nodejs server. It takes no longer than 5 minutes to start a server:

As you can see in the resolver, the program root query always return an empty response. In order to make it to return some data, we need to add support to a Data Source.

Apollo data sources are classes that encapsulate fetching data from a given service with built-in support for caching, deduplication, and error handling. It is so cool that we can worry only about writing the code that is specific to interacting with our backend and Apollo Server takes care of the rest. As part of this example, we implemented a REST Data Source responsible for fetching data from an API called Programs.

In order to make it work, we need to tweak our server file to use the data source we just created:

That's just great! With couple of lines of code, we just started to provide data from another source through our GraphQL API. You should now be able to open Playground and run some queries.

As we evolved our GraphQL schema and more clients started to query data from it, we needed to think about preventing clients from creating expensive queries that could make our server to spend too much time to answer, which would eventually affect its load and performance.

There are some good articles on how to secure your GraphQL API from malicious queries. I strongly recommend you to start reading this one as well as implementing a simple complexity cost validation rule in order to understand how these types of validations work.

Before you go ahead and spend a ton of time implementing query cost analysis, understand your schema and be certain you need this kind of security.

At this point we have a minimal working version of a GraphQL API almost ready to go to production. Before doing so, we need to think about features that can enable our service to keep stable in production regardless eventual spikes on the connected clients.

Cache to the rescue!

In my opinion, the caching strategy is a big factor in this case and it must be planned according to the scenario that the API will face in production. You can decide to cache your query results either at application level (In-Memory or Redis), at http level (CDN, load balancer), or both!

Again, that really depends on your case.

About the application level caching, my suggestion is to start with a simple and distributed In-Memory cache (which is supported by default by Apollo Data Sources) and move to a remote Redis instance only when you need to.

On the http level caching, I recommend you to read articles like this that talk about Automatic Persisted Queries and CDN caching
The idea is that you can add cache directives to your schema to define the max age for each type or field. Once a query is executed a cache-control header will be added to the server response (following the lowest max-age defined in the fields that were requested in the query). Then, any layer between the server and the client can use the value from the header to cache the results. You can also define a default max age for the entire schema:

Once we completed these steps in Jarvis POC we focused on evaluating the monitoring tools we would use in production — but that part I will explain in the next articles.

Spoiler alert: Apollo Engine is great…

Wrapping up

As you can imagine, Jarvis is doing a great job in production.
We were able to solve most of the concerns we found when we designed our architecture including performance, caching and development experience.

Of course it took us some time to understand all the changes we were making to the way the teams were used to read / write data. And I believe it was necessary given the requirements.

Now the teams are enjoying the fact that they work with this API Gateway that allow them to ask for whatever they need to do their job. 
Business team is excited because we provided support for the ubiquity among all clients that they were asking for.
My boss is happy because it worked! — that's important!

BTW, the complete GraphQL API tutorial can be found at GitHub, enjoy ❤️!