As our team and the number of features and services we develop and maintain grows, we found ourselves in a position where it was becoming increasingly difficult to scale with our suite of services that provided traditional REST APIs for applications to use. Some APIs were provided by PHP, some were provided by Node.js and there wasn’t much overlap in the choice of technology between each service.
We ended up in a position that looked something like this:
An architecture like this meant that it was hard to:
- Have a consistent, unified way to deliver newly added data and functionality for both cross-team internal use and for our users.
- Provide that data or functionality in an unopinionated form, uninfluenced by specific consumers on the roadmap.
- Allow the new information to be connected to any and all existing data without needing to explicitly connect to its data source, load the data and define and implement the relationship.
The volume of data required to run a finance-focused business is already immense. The multiplier on overall complexity, once you start developing new ways to bring life to that data in the form of new functionality, services and tools, grows very quickly. We inevitably ran into a situation where teams were maintaining many smaller services to provide slices of data or functionality, but there was no way to access everything consistently or in one place. At a high level, these were some of the issues we were running into as a result:
- Ownership of services was often ambiguous as team composition and priorities changed throughout the year. Parts of our overall system would become neglected and fall behind on maintenance and alignment with other systems.
- Every service needed to access, load and provide any peripheral data based on how consumers were planning to use it. For example, a service responsible for providing content (news, updates, alerts) might also need to be able to provide data about which companies that content was written for and various data items related to that company, rather than being able to focus purely on providing the content and abstract itself from other parts of the overarching system. This creates a situation where a) your service is designed around a particular use-case or consumer and b) your service becomes dependent on extraneous information, which increases the risk of it being broken by changes to the database schema, changes to the way that data is accessed, etc.
- Consumers of these services, such as feature teams developing new functionality for our users, faced a frustrating journey of learning about and integrating multiple independent systems, with inconsistent authentication patterns, API styles and so on. This added a lot of overhead for multiple engineers — the original developers that worked on these systems would need to put time aside to walk through the service they were responsible for, be available to answer questions during the development of the feature and potentially jump into the codebase to fix bugs or make adjustments to better suit the new use-case, which impacts the ability for that individual to contribute to the objectives of the team they are actually part of.
- Reconciling runtime information like logs and performance metrics was tedious. It was difficult to keep on top of which services were performing poorly and updating them without indirectly affecting the performance of another service and not even being aware of it.
GraphQL + Apollo Federation
First Iteration: Exploration
The team wanted to explore GraphQL and Apollo Federation to see how they could work for a subset of simpler services that already existed and verify that this would be a good potential path forward.
The first iteration was developed as a single Node.js application using the Apollo Server libraries and related guides. Federation was implemented from the beginning, with a handful of individual subgraphs and a gateway, but these services were defined as part of a single TypeScript project which meant that it was easy to unintentionally (or intentionally, if you were in a rush or just lazy) share implementation across them. Limited existing knowledge of GraphQL and a federated architecture meant that these issues were hard for the team to catch consistently in review processes, which led to the project becoming messy very quickly. Not only this, our misuse of the technology meant that we were inadvertently moving towards a monolithic data access layer, the exact opposite of what we wanted to achieve, due to the fact that everything lived in one place and was able to access sibling services directly.
Despite these problems, we were successful in migrating a small amount of functionality to our new graph, alongside a completely new service for serving news content, which was used in a production application a few months later with no significant incidents.
Second Iteration: Evolving with New Knowledge
Over the next 12 months, the GraphQL project was becoming the go-to for several teams when it came to scoping out and developing a new system. The existing services were gradually expanded to reach the same level of data availability as our legacy APIs and additional subgraphs were created to facilitate the backend component of new features being developed for our users. The accelerated adoption of the GraphQL project brought the lacking implementation to the forefront — we needed to go back and rework it significantly to be a viable long-term, scalable foundation to continue building features on. The changes that followed were:
- The introduction of a monorepo structure using yarn workspaces to properly isolate each service from each other. This forced each service to fully own its own implementation and prevents lazy or unintentional code-sharing. This makes it much easier for teams to take ownership of individual parts of the graph that relate to their objectives since their contributions won’t have unintended effects on other services.
- Proper implementation of a federated graph, using references for entities owned by external services rather than reimplementing them or importing them directly from that external service. For example, the implementation for news content to belong to companies is owned by the content subgraph, but the company entity is owned by the company subgraph. The gateway then reconciles all of the subgraphs so that a company entity is able to provide news content, as well as all of the fields defined on the original company entity. This allows future services to simply define additional fields related to the domain of that service on existing external entities, without needing to modify code in that external service directly. This is extremely powerful as it allows the graph to grow naturally without mistakenly affecting any existing services or teams.
- Adoption of Nest.js for the gateway and subgraphs, which brings consistency, a strong community and an exceptional framework to develop new services. Nest.js is also our framework of choice for existing Node.js based services and applications, so it aligns at a broader engineering team level as well.
- Breakdown of the complete GraphQL service into distinct deployments, one for the gateway and one for each service. This allows us to scale, deploy and monitor services independently. Services that deal with user-facing functionality can be much beefier than services that deal with internal utility. Each service also fully manages its own connection to resources it needs, such as databases and in-memory storage like Redis.
- Introduction of Apollo Studio, which provides a tonne of useful runtime metrics and contribution features.
The new architecture looks something like this, much cleaner and easier to understand than before:
Problems we Solved
GraphQL and Apollo Federation allowed us to solve a bunch of the original problems we were facing:
- Feature teams have a consistent experience when they need to access some new data or functionality. They can jump into Studio or manually download the SDL to see fully documented updates to the graph. They can use the same entrypoint, credentials and interface for fetching data or invoking mutations.
- Teams working on new services are able to expand existing entities without needing to explicitly load those entities into their service. If the underlying storage method for those entities changes, external services don’t need to concern themselves with it. If there was a bug in some functionality that is accessible through your subgraph, patching and redeploying the subgraph that owns that functionality will fix the problem across the entire graph.
- Although services still need properly managed ownership to not fall behind on maintenance and improvements, being a part of a whole means they are more likely to be looked after and not forgotten about.
- Because each service is a discrete application with its own build process, deployment, scaling parameters and runtime data, it doesn’t face the same risk as a monolithic API service where eventually you hit a wall trying to scale it properly as a single enormous deployment. This also means that overall risk is reduced during deployments since the rest of the graph will be unaffected if the deployment of an individual slice is buggy or straight-up broken.
- Though we have runtime metrics and logs for the individual services, tooling such as Apollo Studio provide a reconciled view so that you can monitor the performance and health of the entire graph as a single entity.
- Services can deal with providing their own domain-specific data and functionality without forcing responses into a specific shape for consumers. The consumer can pick and choose what fields they care about and what related data they need, which is facilitated by lightweight resolver declarations that reference entities owned by external subgraphs.
Moving to GraphQL and a federated architecture has been very rewarding and exciting for the team so far, though not without its own challenges, particularly when it comes to maintaining consistently high performance. The reduced overhead of needing to manually load additional non-domain data, having plenty of good quality boilerplate services to work from and having a consistent development approach means we can create and iterate on new backend functionality much faster than before. On top of that, rolling out new functionality is lower risk than contributing to an existing application and having to go through the process of deploying the entire thing.
For us, the next steps are:
- Encourage the wider team to build new features as federated services that can be merged into the graph.
- Introduce higher-performance data sources for high read, low write data, such as DynamoDB. These can be easily swapped out and improve the performance of the entire graph.
- Improve monitoring to detect and alert the team for poorly performing fields on the graph, so they can be surgically dealt with.