A strategic evolution of our core systems using microgateways

Engineers at Macquarie
Macquarie Engineering Blog
7 min readJul 31, 2023

--

By Ciaran Chan, VP of Engineering

Macquarie’s Banking and Financial Services (BFS) group has been on a journey to migrate all banking technology onto the public cloud. Along the way we have transformed many of our capabilities into cloud-native microservices, migrated our core systems including commercial-off-the-shelf (COTS) and re-engineered the full platform stack using an infrastructure as code philosophy.

The completion of one chapter is the beginning of another. We now have the building blocks to bring more innovation to increase reliability and velocity for our backend systems.

Current limitations and use cases

As much as our core systems have invaluable and rich functionality, there were some inherent limitations that we were eager to solve.

Our business and digital traffic volumes have been growing at a significant rate, and we have increasingly found situations where we need to have more sophisticated API traffic management capabilities like rate limiting, circuit breakers, distributed tracing and modern auth internally for our internal east-west traffic use cases. Historically, these cross-cutting capabilities have been embedded into specific applications and implemented in different ways that does not allow for a consistent way to manage our traffic effectively at scale.

Resiliency through circuit breakers and rate limiting

Being on the cloud does not make us immune to issues/failures in our network or core backends, under such scenarios we want to fail fast and safe to avoid further cascading impacts to other services. Extended timeouts in backends can often lead to resource exhaustion in calling apps.

Another use case is rate limiting to prevent core systems from being overloaded due to unforeseen issues internally within our ecosystem or unexpected spikes in traffic.

With these safety mechanisms we can improve our overall ecosystem resiliency by containing issues and reducing the blast radius whilst our engineers work to resolve the root cause. We have the flexibility to trigger these automatically pre-determined thresholds or manually enable them as needed within minutes.

Improved observability

As our ecosystem became more and more distributed, it became increasingly challenging to triage and diagnose issues. Our cloud native microservices already adopted distributed tracing, however this was not necessarily available in many of the core backend systems (especially COTS).

The microgateway enables improved observability, such as standardised logging and diagnostic info, including distributed trace IDs and other cross-cutting information like client duration (response time to the client) and payload size which have proven to be highly invaluable in helping us investigate issues.

Adopting modern auth methods

We are increasingly moving towards modern auth methods which are not always supported by our core backend systems. The microgateway allows us to implement auth using JSON Web Tokens (JWTs) or mutual Transport Layer Security (mTLS) to provide a modern and consistent API experience and to enable fine grain auth.

Blue/Green and canary routing

Whilst many of the cloud load balancers have this functionality, what we found invaluable is the ability to have this capability in a microgateway external to the backend full stack (which may have an internal load balancer). Leveraging this capability has allowed us to achieve zero-downtime major backend upgrades and with the benefit of quick and safe rollbacks.

Our solution

With a high reliability and lean mindset, we identified the key initial requirements:

  • Cloud-neutral to support our multi-cloud environment.
  • Prefer CNCF open-source components that are mature and proven.
  • Deployable in minutes using a DevOps GitOps based workflow — consistent with the flow of our microservices development to ensure a great developer experience and productivity.
  • Interoperable with our diverse backends — we have many different core systems that have similar needs, so a re-useable solution was the best solution.
  • Integration with our enterprise platforms for secrets and identity management and log analytics platform are essential.
  • Self-serviceable ownership — we organise our engineering teams by capabilities, and it is important to ensure we are empowered to maintain our stacks.

This diagram outlines our overall solution:

We decided on Solo.io’s enterprise-supported product, Gloo Platform, that is based on the open-source Envoy service proxy. This is deployed on a hyperscale PaaS Kubernetes cluster as the foundation for our microgateway platform. In recent years Envoy has gained momentum as one of the most popular, performant and lightweight proxies capable of being deployed in different configurations. Envoy is also the default service proxy in the Istio service mesh. We chose the Gloo Platform to get the benefits of vendor support along with enterprise features such as Lightweight Directory Access Protocol (LDAP) integration and Web Application Firewall (WAF) functionality that are accelerating our build-out of this platform.

A great developer experience and improved productivity through a GitOps workflow are also important to us. Given our deployment on Kubernetes, a continuous delivery and workflow toolkit using Argo CD and Argo Workflow was a natural fit. With this, we built out a platform that abstracts all the underlying complex moving parts to provide a simple declarative GitOps developer experience featuring self-service ownership by different domain/backend teams. It is as simple as writing/updating the YAML file, pull-request approval and automated deployment within minutes.

Production success

Our philosophy is value is delivered when software is live and in-use.

Our first implementation for one of our core systems is processing more than 14 million requests a day. The microgateway latency for our starting use case is approximately 2–3ms and well within our tolerance given the benefits. It has been highly stable with no production incidents or outages.

We have also leveraged the advanced routing capabilities to enable zero downtime for major backend upgrades with a fast rollback mechanism that would otherwise have been difficult to achieve.

The improved telemetry is now providing our engineers with deep observability to help trace transactions, saving hours of precious time during production incidents.

Client app perspective

Notwithstanding the importance of backend capabilities, a holistic approach to fault tolerance and resiliency also involves thinking about it from the perspective of a client calling app that is invoking APIs. In live production systems, a typical strategy for dealing with transient faults is to retry the request. However, a simple retry algorithm is not optimal as the retries themselves may contribute to spikes in requests that may overload the system. Our teams are leveraging techniques such as retries on errors with exponential backoffs to minimise cascading failures across our ecosystem.

Forward thinking: An evolutionary architecture

Our success so far gives us confidence that we have a solid platform to build on to enable our microgateway capabilities more broadly for our internal east-west traffic. Our engineers have developed a microgateway as a service as a platform to democratise these capabilities more broadly within our ecosystem so that other teams can leverage this.

We have commenced the journey to replace our older API access gateways with these microgateways. What used to take two weeks to configure and deploy, can now be done in minutes thereby providing teams with better agility to deliver awesome features for our clients.

The microgateway decouples consumers from backends in a way that allows for our systems to evolve their architecture and to modularise and break up the system overtime. We also have a flexible architecture that allows for each backend to adopt the microgateway at a pace that we are comfortable with and not take a big bang approach.

We are only just getting started and our engineers are excited by the possibilities and opportunities ahead of us.

Meanwhile we are always looking for ways to continuously improve our technology and solve interesting engineering problems for our business. Please visit our careers page if you are interested in being part of this journey with us.

Finally, I would like to recognise and provide a massive thanks to our integration platform engineers that contributed to turning this vision into reality. Watch out for a future blog from the team on more engineering details of our microGateway as a Service (we are calling it mGaaS).

--

--

Engineers at Macquarie
Macquarie Engineering Blog

Sharing insights, innovative ideas and ways of working at Macquarie.