Scaling GraphQL at PayPal

Published in

The PayPal Technology Blog

13 min readOct 30, 2019

This post is part of a series of best practices and observations we have made while building GraphQL APIs at PayPal.

A year ago, we wrote “GraphQL: A success story for PayPal Checkout” which covers our journey from REST to Batch REST to GraphQL. A lot has changed since then! This post covers everything we learned while scaling out GraphQL at PayPal and will serve as a guide for deploying GraphQL at your company.

A year ago, there were a handful of products using GraphQL. While we had success in PayPal Checkout, there was no infrastructure, tools, training or support. Despite those gaps, GraphQL still took off like a rocket 🚀. As of this writing, we have 50 different products using GraphQL!

At PayPal, GraphQL grew from 3 to 52 products in 2 years

Adoption was fast and we’re still recovering from it. Like other technology shifts, scaling in the Enterprise isn’t about horizontal scaling or paying a lot of money for servers or cloud compute. Scaling people, tooling and processes are most challenging.

Looking inward

Before deploying GraphQL, it’s important to look inward at your own company and reflect on who you are, what got you there and what your strengths and weaknesses are.

At PayPal, our products are built using JavaScript. React for front-end and Node.js for the backend. Further in the stack are hundreds of Java REST services and some C++ SOAP-like services.

PayPal was one of the first companies using Node at scale and helped kick off the “Node in the Enterprise” brand with events like NodeDay and partnering with The Node Firm and NodeSource.

PayPal migrated to Node through 2013–2015. It transformed the entire company and improved the way we build and ship products. A simple content change would take weeks to deploy as a single monolithic C++ app. Now, developers can experiment, iterate and push new product features in minutes.

That didn’t happen overnight and wasn’t by accident. Before Node took off at PayPal, Bill Scott pushed a vision of building products using LeanUX, where iterating and learning is key. In LeanUX, UI bits are experimental and disposable. If an experiment didn’t perform well, keep iterating! Node.js was key to that success.

In 2014, we launched Kraken, a set of libraries on top of Express to “give your application some arms” with configurable middleware, security defaults, a dust.js renderer, and content localization. Since then, most of Kraken’s arms have disappeared. We still use configurable middleware and security defaults, but anything related to web applications has changed since 2014. Now, teams are building client-side React apps and applications are bundles of static files deployed to a CDN, rather than code running on fleets of servers.

For APIs, we use the BFF (backend-for-frontend) pattern. Although they are very specialized and allow developers to iterate, they are tightly coupled and are not very re-usable. The result is we have a lot of teams iterating and building the same kinds of things over-and-over! Building a BFF API isn’t trivial either. Often, they contain a lot of orchestration logic where you need to grab data from 5, 10, 15 different services, normalize those responses, throw out 95% of it, and map, filter, and sort that data into what you wanted 🙄. Writing this code isn’t a good use of our developer’s time. This tight coupling and lack of re-use is a problem for us.

Can GraphQL help? Yes!

In a previous post, we wrote about our journey from REST to Batch REST to GraphQL. We found that it enables the best of both worlds — developers can iterate quickly, API and UI are loosely coupled and code re-use increased.

Do a similar exercise with your company. Before jumping on the GraphQL hype train 🚂, spend time understanding your past, how you evolved, assess your strengths and weaknesses and how GraphQL might be able to help.

Resetting expectations

Developer experience > Performance

Before jumping into a checklist of production-ready items, we should reset expectations. When you are first introduced to GraphQL, it’s common to think the primary benefit is transferring less data over the wire allowing for faster, high performing apps.

Large companies will layer GraphQL on top of existing REST services. The result is your GraphQL query will only be as fast as your slowest REST service. GraphQL allows you to fetch everything you need in a single round trip. If there is chattiness between your client and server, you can reduce round trips and reduce latency. But, this isn’t a promise that you can make for everyone.

After a while, you will realize that the benefits of developer experience and flexibility are much more compelling than performance.

GraphQL is friendly to humans. With GraphQL, developers think in terms of fields, not endpoints, domains or complex joins. Developers can traverse the graph to pick out the user’s first name, 60x60 profile photo, primary shipping address, and credit cards without having to invoke 6–10 different services. New hires love it. If you know what JSON looks like, drop the double quotes and commas and you are able to query a GraphQL API. Your UI developers love it because it’s product-centric, declarative and has a rich toolset that reduces integration friction and improves confidence.

As you pitch GraphQL to leaders and teams in your company, focus on developer experience, productivity, and flexibility. Otherwise, you might disappoint them.

GraphQL at the edge of your stack

At PayPal, our core services are developed by separate back-end teams and our product teams build BFF APIs that orchestrate data from those underlying core services. At first, we thought “GraphQL should go everywhere”! We were thinking problems of over-fetching and under-fetching would go away and you would have visibility into what use cases use what fields and be able to trace it all the way back to the database. Architects loved the idea. But, we couldn’t stop development across the company to migrate to GraphQL.

After a lot of experimentation and reflection, we found that GraphQL shines most at the edge of our stack. GraphQL is product-centric and should be influenced (or developed) by your product teams. GraphQL schemas should be developed design-first with a product team’s input. They shouldn’t be designed in isolation by a back-end developer. GraphQL does orchestration very well. For this reason, GraphQL is best at the edge of your stack and can work in tandem with REST.

Our friend (and ex-PayPalian) Trevor Livingston had a similar observation working with GraphQL at Expedia.

https://twitter.com/tlivings/status/1114536415902216203

Enabling GraphQL at your company

Okay, it’s time to make this real! This section is a checklist for enabling GraphQL at your company.

First things first, you need to create a foundation for product teams to stand on. GraphQL is new and exciting with lots of open questions and opinions. Hardly anything is set in stone. You have a lot of choices to make!

Who are your API developers and consumers?
Who is going to contribute? What languages do they know?
How do they build APIs today? Are they specialized or general-purpose?
Are you going to use existing frameworks and tools? Which ones?
Do you need to add anything company-specific? Authentication, authorization, retries, circuit breaker, custom HTTP status codes, error handling?
How will you enforce standards?
How will you handle errors?

Creating a foundation

At PayPal, GraphQL has taken off with product teams who contribute to BFF APIs and UIs. We use Node.js for our GraphQL APIs. Like many other companies, we use Apollo’s open-source libraries and tools. Apollo has a dedicated team building and maintaining those tools, so they are top-notch with great documentation. We use apollo-server and sprinkle in our PayPal-specific production-readiness, such as logging and instrumentation, auth, error handling and rate-limiting.

If you have any unique requirements, create modules and plugins that can be used with open source libraries, like Apollo. Don’t create deep abstractions or hide complexity from developers. Keep it Google-able!

Remember that GraphQL is still an API. You will want to make sure that you have sufficient logging, retries, circuit breaker patterns, rate limiting, and query complexity checks.

Scaling knowledge

Ensure that architects and API designers are on board with GraphQL to help you scale-out design reviews and enforce standards. More than likely, they have designed REST APIs for years. GraphQL is different. At first, you might feel some resistance. Push through it. You need to spend time with them to outline the differences and challenge them to approach API design differently: no versioning, embracing the graph rather than using IDs to create relations, no HATEOAS links. By getting architects on board, you won’t be the only subject matter expert and it will bring legitimacy to GraphQL.

Next, how are you going to scale learning at your company? GraphQL is changing rapidly. Any training materials that you create will be out of date and have maintenance costs. Instead, curate external resources and seek help from training companies like Moon Highway for excellent GraphQL classes!

Setting standards

After your architects are on board, you will want to set some design standards. Create a document and reference it everywhere. Use tools like graphql-schema-linter to enforce naming conventions.

Some examples include:

All fields must have a description or comment
Type names, LikeThis. Field names, likeThis. Enum values, LIKE_THIS
Use enums when possible
Deprecated fields must have a reason
No collection or list suffixes. (Ex: Use cards, instead of cardList)
Preferring input types with mutations

Then, you will need to make choices around pagination. What’s your preferred way to surface lists in your schema? Cursor-based pagination?

How will you surface errors? As of this writing, error handling isn’t figured out yet with many options out in the wild:

Using the default errors Array
Extending errors with custom properties
An errors field in your schema
Union types

At PayPal, we chose to extend errors with custom properties. We liked that it’s still spec-compliant and allows us to add error classifications and other metadata to our errors when we need to. We felt that the other options weren’t approachable and allowed for errors to go unnoticed. We will provide more details in a future Medium post.

Authentication/Authorization

How will you protect your schema?

At first, we protected our entire schema then realized that we have many different types of users with different privileges. Then, we created a higher-order auth function that you could wrap a resolver with. Finally, we realized that creating a custom auth directive is the best way to protect your schema.

In the example above, if a query contains the user field, the user must be logged in and creditCards need 2 additional privileges.

Because it’s a directive, it’s visible in the schema, not buried in code. API designers and architects who might not necessarily know JavaScript can help with reviews.

Gain superpowers with great tooling

Previously, we wrote “GraphQL: Instrumenting your API and unlocking superpowers” which explains how GraphQL has a unique advantage over REST where you have increased visibility into how your API is being used, provide extra confidence to clients as they integrate with your API and protect against breaking changes.

For starters, you can pipe latency, errors and usage data over to your company’s graphing tools. Here’s an example Grafana dashboard:

From Marc-Andre Giroux’s talk on Continuous Evolution of Schemas

Because you know what clients request what fields, you can proactively inform them when you deprecate something they use. At PayPal, we found that by being proactive and delivering changes in small bites, product teams are more likely to migrate sooner. Large programs, large migrations with complex planning processes can be daunting and disruptive to a developer’s workflow.

Other tools that we recommend include: graphql-playground for testing out queries in dev-mode, graphql-schema-linter for enforcing schema naming conventions, eslint-plugin-graphql for linting your client queries against a schema, graphql-doctor for PR status checks.

Buy vs. Build

At PayPal, we recently started using Apollo Platform and feedback from product teams has been great!

Apollo Graph Manager provides instrumentation tools that show deep insights at a field-level, confidence against breaking changes, whitelisted queries and eases client integration by providing linting and in-line SLA timings for every field.

Apollo Graph Manager isn’t free. Could we have built this ourselves? Maybe, but it wouldn’t be as polished. We don’t have a GraphQL infrastructure team. If we did, we wouldn’t want to wait for 12–18 months to build out something comparable and have to maintain it. We want to deliver on the promise of GraphQL now! From a buy vs. build decision, Apollo Platform is a clear buy. We recommend that you consider it too.

Investing in GraphQL

If your company is using GraphQL, you should get involved and invest in GraphQL’s success. It’s important to network with other companies, share battle stories and bring back learnings to your company. You should join the GraphQL Foundation, which includes companies like PayPal, Facebook, Twitter, AWS, Intuit, and New York Times. Join groups like GraphQL Contributors Day or local meetups, or get involved in the GraphQL working group or make a specification proposal.

Challenges scaling GraphQL in the Enterprise

GraphQL is great so far, but it won’t be awesome until we solve a few problems with how the graph is assembled and how we measure success.

Many repositories

Before 2012, PayPal was a C++ monorepo. Since then, we have spawned thousands of GitHub repositories across many domains and product teams making up layers of services and applications. For companies like Facebook and GitHub who have monorepos, sharing isn’t much of a problem.

GraphQL is tricky with many repositories. You can’t reference or link between remote types that don’t exist in your local filesystem without some custom trickery like stubbing out remote types. It isn’t easy for developers to discover or re-use types defined in another service.

Assembling your graph

Users want to see a single, cohesive graph that they can traverse through without thinking about many services and having to hit many services to get the data they need. In reality, assembling a single graph is difficult.

One solution is schema stitching where a gateway consumes schemas from underlying GraphQL APIs and surfaces a single schema to the developer and incoming queries are delegated to the underlying API. Marc-Andre Giroux wrote an excellent post on the challenges with schema stitching. With schema stitching, the gateway has glue code responsible for maintaining the relationships between types and ensuring subqueries are executed correctly. The glue code is problematic when the gateway owner doesn’t know the relationships between these types and when it’s unrealistic for product teams to own this infrastructure.

Apollo Platform’s solution is a federated development paradigm that uses custom directives to link together types in a declarative way. It eliminates the need for glue code in the gateway, sets a proper separation of concerns and allows you to extend types that you don’t own in local development. It will be interesting to see if companies will adopt it and how federation will influence the GraphQL specification.

What if building one graph is unrealistic? At PayPal, we optimize for quick iteration and continuous learning. Many developers aren’t incentivized to slow down, gain consensus and be a part of a larger whole. It’s a double-edged sword of our culture. If one graph isn’t possible, how else can we re-use work? Another option is local modules with options like GraphQL Modules and graphql-component. With local modules, you can pick the types and fields that you need and all code is running in the same process. Many issues with schema stitching and federation go away. But, we have a large number of similar endpoints and our server footprint isn’t reduced. Is this okay?

This is the most difficult problem that we have with GraphQL.

Measuring success

As stated earlier, the most compelling benefits of GraphQL are developer experience, productivity, and flexibility. At times, performance can be a benefit, but it’s not a promise you can keep.

How do you measure developer experience? How could it be objective and not based on back-of-the-napkin math? How do you measure developer productivity? Comparing time to market between a project using GraphQL and not?

This continues to be a challenge for us. We talk to developers and they tell us life is so much better, but how do we quantify that to our leadership?

Like these challenges?

We’re hiring! 👋 If you would like to work on front-end infrastructure, GraphQL or React at PayPal, see our job openings!

Some of us are speaking at GraphQL Summit this week. If you are attending, we would love to share ideas and chat about GraphQL! 😊