Empowering Agility: How an API Gateway Helped Us Move Faster and Decouple from Our Monolithic Past

Rudy
StashAway Engineering
6 min readMay 4, 2023

As a follow up to our previous article on “Wrangling a big ball of mud”, this piece covers the next step in the evolution of our architecture. The tl;dr from the previous article is that we transitioned a legacy monolith called the app-server from an unloved tangle of twine to a much more healthy modular monolith.

The next big task we had to grapple with was getting the app-server in a place where squads can decouple from it and start to split core domains out.

So sit back, relax and enjoy the read!

Context

Based on the prior context of the article linked at the top, the app-server was wearing multiple hats in our architecture:

  • Core service owning multiple bounded contexts
  • Entry point for all clients (web, mobile, admin, api)

The fact that the app-server was the entry point for all traffic had some second order effects, like it also performed all necessary authn and authz dances with the security components and it also became a pass-through proxy for other downstream services which needed to be client facing. 🙄

overloaded app-server wearing too many hats
overloaded app-server wearing too many hats

Hence, this meant there was still tight coupling for all squads with the app-server and we could not code freeze the monolith even if we wanted to. Add the fact that services were being spun up frequently to support new product development, this quickly became a high leverage bottleneck to tackle for us.

Objective

We wanted to introduce an API gateway in front of the app-server. It would intercept all public traffic and orchestrate or carry out all responsibilities of being a single entry point, proxy for other services, authentication and authorization. It would be a horizontal abstraction in our architecture decoupled from all business domains.

Established Patterns

There are some established and well documented patterns on having such layers in your architecture (see: API Gateway). There are also some useful patterns on how to tease apart application code. I’m not going to bore you with the details here but if you are keen, you can read more about Strangler Fig and Branch by Abstraction.

The basic idea we were going to apply is to intercept all calls to the old functionality and start moving traffic towards the newer functionality, eventually being able to delete the older code.

Fundamental Requirements

There are many cross functional requirements when it comes to a topic like an API gateway, but the core must haves for us were:

  • Eliminate coupling of the app-server being a passthrough proxy
  • Orchestrate authn and authz of requests
  • Perform GraphQL federation
  • Make it developer friendly by simplifying the integration points to be a simple config

In addition we were also clear upfront on what it is not meant to do. We did not want this new service to be used for inter-service communication (east-west traffic).

Build vs. Buy

Ah the age old question. We should probably preface by stating that our context might not apply to you so YMMV 😉. To quote Martin Fowler:

Frodo said in Lord of the Rings, “Go not to the Elves for counsel, for they will say both no and yes.” While I’m not claiming any immortal knowledge, I certainly understand their answer that advice is often a dangerous gift. If you’re reading this to make architectural decisions for your project, you know far more about your project than I do.

When deciding whether to hand roll a self hosted solution or buy a COTS product from the many vendors out there (AWS API Gateway, Kong, etc), we put the decision through our internal ADR rigour. Some of the juicy details are shared below.

Spoiler alert: we decided to build, but if the context behind the decision interests you, then please read on.

Why build, you may wonder. Well, there is little to no friction for us to get started. Almost no dependencies, the devs in charge have full control. We have the ability to spin up a new simple service and leverage the already battle tested CI/CD infrastructure.

To illustrate the point, we got the new service in production within 4 days of the first commit. It was nothing but a transparent proxy to the app-server at the time, but it served as a quick feedback loop of unblocking the path to production and then iterating on it.

As opposed to purchasing a COTS solution, where there would be delays, dependencies (cross team, product selection, platform team etc.), dealing with vendor vetting, testing multiple products and paying for features we don’t need (retries, circuit breaker etc, which Istio already provides)

We also needed to execute custom code which some products supported via writing plugins, but this brings about a new set of challenges like testability, propagating changes up the value stream to production via automated CI/CD and also the overhead of introducing a new language to our stack like Golang or Lua.

An API gateway is a “big bet” which is hard to move away from once a decision is made. This meant we wanted better clarity of the features required before we choose a winner.

Note: we did consider Istio, but requirements like GraphQL federation and execution of custom code were also in view at the time so we stayed on the side of building something

Just like that, we were off to the races!

The Plan

This is not the exact order in which we ended up executing it but just to give you a flavour of how we planned the transition:

  • Start with a transparent proxy that simply intercepts the traffic and passes it forward
  • Take over authentication, authorization and proxying traffic that was destined for other services (the traffic which was previously purely proxied via the app-server)
  • Take over authn and authz for traffic destined for the monolith app-server
  • Apply a staggered approach, use weights and feature toggles to gradually shift traffic
staggered plan of action
staggered plan of action

Sounds simple? 😅 It rarely is. There were some surprises and unexpected behaviours which we did not test for upfront, but we managed to get everything done smoothly without any impact to customers. 👍

What’s been happening since?

Once we had our API gateway in place, squads were freely creating new services and make their client facing traffic available via a simple config. As of this writing, we have 17 services integrated with API gateway in production 👏

Barring some security patches and critical changes, the app-server code is either frozen or on a decline because of deletion of code that has been moving out. 🎉 Some squads were spurred to migrate their domain out of the app-server monolith and into a well encapsulated service that is more aligned with DDD principles.

One more noteworthy change that happened was an internal strategic decision to move away from GraphQL in our stack. So the code that was being migrated out of the app-server was also being rewritten to use REST instead. This also meant that the federation feature we had built in the API gateway could be retired and the code was deleted (hurray for deletion of code!).

The agility of having control to do whats needed in a changing landscape of requirements really paid off. 😃

Lessons Learned

  • Always use feature toggles, even when you think the change is trivial, the cost of going back to clean it up later is worth the time investment
  • Start by building the observability you will need to manage the service. Continuously refine these as you go spotting issues in production will become second nature
  • Let changes soak in an environment and use observability to monitor changes against “the normal”
  • On observability again, maximise the tooling (eg: add custom tags on spans) give yourself information necessary to take action in the event of an issue
  • Make integration and testing simple. We made a script for teams to run which will generate the required config and test suites automatically. Less friction = more dev ❤️
  • Stagger your approach. We used Istio to apply weights on the inbound traffic so we could minimise the risk to our valuable customer traffic. Starting with 10% of traffic, we let it soak, monitored for anomalies, then slowly opened the flood gates

Conclusion

So there you have it, a simple abstraction like placing an interface in front of the monolith is all you need to begin the journey of breaking up your tightly coupled monolith. We certainly have gained a lot of leverage by making this change, hope this article provides you with some tools to guide your decision too. 😄

--

--