Our Journey in Adopting Federated GraphQL at SSENSE
Over the last 6 months, SSENSE developed a federated gateway to our presentational micro-services (commonly referred to as the back-of-the-front-end). During this period, we also planned and migrated a single code path on the website to use the newly minted gateway. This article gives a brief look into how and why we chose this pattern to evolve our consumer applications at SSENSE.
The Decision to Federate
SSENSE runs on a rich micro-service architecture, with many teams in many different domains. Here is a quick visual representation of the number of services needed to render our product listing page:
The complexity of our data, in concert with our average website traffic (astronomical, sometimes), makes the case for an equally complex architecture. Our data has a high variability of cache TTL (for example, inventory availability needs to be refreshed faster than static text content), and some domains have specific ownership requirements (for example, some devs, though not the canonical owner of product data, own part of the product data).
A few years ago, we made the decision that these two factors warrant a discreet service(s) for certain teams/data, and decomposed our monolith API into many micro-services. Initially, this application was a huge improvement over our legacy set-up — teams were shipping faster, data was less coupled, objectives were clear. Eventually, though, as we hired more teams, those teams worked on features that required coupling of data (for example, personalization needing both user and product data). Since we didn’t want the business logic living on the client, we built several “aggregation” services.
Here is where things started to get messy:
Clearly, the direction the teams were headed was not going to scale — the organization was growing fast, and maintaining a 360-degree view of our data was becoming difficult. Feature delivery, too, was getting tough — it seemed that to meet any business objective, we needed to rope in another team or work on a foreign service to get the data we wanted on the client. As the organization scaled both in size and ambition, we were increasing our number of hoops to jump through.
In the summer of 2020, we decided to reimagine the front end API. This time, though, we set out to solve problems around ownership and velocity, rather than complexity. After lots of back-and-forths, we decided to go deep on Apollo’s new federated specification. For the uninitiated, a federated API exposes one large schema that is stitched together by many small schemas (or “subgraphs”) underneath. What struck us about this federation was that it added data ownership in a way that made sense for our organization (teams provide data to the graph, and don’t have to worry about duplicating or aggregating data outside of their scope). And the federation layer gives us 360 observability on our APIs (something that was becoming blurry in the status quo).
As a proof-of-concept, we decided to migrate our Product Listing Page (PLP), which includes the latest arrivals, search, product collections, and brand/category pages to the federation. The PLP has a unique position in our stack: it has minimal data requirements (it’s mostly powered by two downstream services), but is one of our most complex code paths (thousands of lines of business logic). This migration would require a full client side rewrite of the page, the installation of two GraphQL APIs on our downstream services, as well as the introduction of the new federated gateway (a.k.a. a lot of work!).
We wanted to use this new pattern as a symbolic fresh start — the core tenets to drive our decision making were:
The previous architecture had gotten us really far, but as mentioned above, teams were feeling growing pains. We wanted this new architecture to be intentional, and have a strong set of principles to help keep development smooth.
One of the main principles that drove our decision-making was data availability — a client should be able to get any data from the gateway in an obvious way. Having this in mind generated a lot of good discussion around where certain logic should live. For example: in the status quo, we had business logic for “retries” on empty search results (when receiving an empty search result, we retry the request with less specific filters). Because of the data availability principle, we instead developed a “fallback” mechanism where the client could specify an alternative query should there be no hits on the first attempt. This decision had a big impact on performance, and turned what was potentially five network calls, each with a call to ElasticSearch, into one network call with potentially five ElasticSearch calls.
Historically, we had fallen into the bad habit of over fetching data, so another intention revolves around consolidation — anything that is available on the API, should be available in one query. In our status quo, we fetch data from a one-size-fits-all `/products` endpoint. Because of the many needs this one endpoint fulfills, the payload was huge and the code to handle it was over-engineered (we were modeling the product payload not once, not twice, but nine times on a single call between ElasticSearch and the view layer). With our new API, we were able to remove a ton of modeling code, and with the federation we were able to get anything we wanted in a single query. In the new flow, we fetch twice and model once.
Last but not least, our final principle was reusability. As I’ve mentioned above, the status quo data usage has resulted in a lot of coupling, which resulted in a few really bizarre code paths. For example, the mobile and desktop views of our product listing page, although having nearly the same data and functional requirements, are rendered from two different code paths. Going forward, we didn’t want to repeat these kinds of scenarios; in the federated state of things, mobile and desktop share one code path (writing this is cathartic).
In our experience, making your data available in a reusable way, and fetchable with one query is a solid way to build the foundation of a federated graph.
Not Better Performance, but Faster Delivery
Developer experience was our north star. We wanted to enable our front-end teams to build more things, faster. Performance has to maintain a reasonable level, of course, and our objective was to match the status quo. We knew the removal of caching layers and the addition of the new federated gateway were going to add some latency. However, I’m happy to report that cleaning up the client side code and optimizing the server code paths leveled this out.
We met our goals toward developer experience (or so it seems, we need more time to really assess this, but looking good so far!) Previously, working on the product listing page was very complex and time consuming. There were plenty of edge cases that led to several failed rounds of QA before shipping, and “I’m going to need another pair of eyes on this” was the unofficial tagline of code review. On our first post-federation feature, a new-to-the-company dev shipped an absolutely massive change to our navigation in two days (!). This is a big win, and has already confirmed the developer experience value of the federation.
At SSENSE, we’re observation heavy — having logs and metrics helps with rapid debugging and minimizing downtime. However, much like the above concerns around ownership and quality, our once-great monitoring also went through the pains of organization growth. As we added teams and more shared scopes, dashboards quickly snowballed and became homogeneous; it was normal to ask “which response time dashboard is the good one again?”.
So, in as much as the federation is meant to unify our downstream APIs, it also unifies how we observe our downstream APIs. Distributed tracing through the federation gives us a ton of real time information (we have many midnight revelations and learnings from performance-optimization delirium, that I’ll save for another post).
We planned the federation knowing that we’d be centralizing our monitoring, so many decisions were made with observability in mind. We set up tracing/alerting/monitoring at the top level, with shared request IDs to link logs and traces. This will let us, once more services are migrated, use the federation as a command centre and source of truth for the health and performance of the stack.
In summary, developing the federated gateway at SSENSE has been, and continues to be, a huge learning experience. This article has been a high-level look at the what and why behind our decision to federate. Stay tuned for some deeper and more technical articles on GraphQL and Apollo Federation.
Thanks for stopping by, and be sure to smash that clap.