Zero-Downtime API Gateway cloud migration

Budi Pangestu
Inside Bukalapak
Published in
5 min readFeb 16, 2021

2020 was a big year for the Bukalapak Technology team, it was the year we finally migrated our whole system from on-premise to cloud (yes, we are one of the cool kids now :D)! It’s have been a wild and exciting journey with the fact that:

  1. We had 500+ microservices
  2. Most of our microservices were not developed with cloud infrastructure in mind

I will focus more on the second challenge in this article with the case study of the Bukalapak main API gateway (zero-downtime!) migration story, without further ado, let’s jump right in!

A look into our API gateway architecture

In Bukalapak, we adopted Microservice architecture and mostly use RESTful APIs to interact between clients and servers (which resides under api.bukalapak.com). Therefore the need for an API-gateway was prominent (more on why). We decided to build our API-gateway built on Golang rather than using some well-known open-source solutions such as vanilla NGINX or Kong because of these several reasons:

  • Config management easiness using YAML compared to NGINX rigid structured .conf files
  • We needed caching logics that custom-tailored for our need
  • We found Golang to be a more modern language, hence easier to use and maintain
a glimpse into the overall architecture

in short, our API-gateway flow looks like this:

  1. Users would request API through api.bukalapak.com and serve by CDN (content delivery network)
  2. CDN forwards the request to a centralized NGINX Load Balancer which handles all external traffics, then to the API-gateway
  3. API-gateway handles authentication by checking into shared Redis-cluster instance
  4. API-gateway will either fetch the response from the Memcached instance; or proxy the requests onto the corresponding microservice based on the request pattern
  5. Communication between on-premise and cloud would go through a dedicated interconnect with 20ms latency

Now that you already familiar with our API-gateway, let’s get into the main topic

Cloud migration journey: the planning

In our initial migration plan discussion meeting, we identified 3 main challenges:

  1. API-gateway needed to route requests to both on-prem and cloud microservices, and we wanted to skip interconnect latency overhead completely
  2. API-gateway Memcached was also used by our monolith, making it tightly coupled. Given cloud migration is already scary as it is, we wanted to migrate one service at a time
  3. Last but not least, since it is a critical system, downtime (or even hiccups) would massively compromise our user’s experience (which we deemed top priority) and is not acceptable

hence a simple lift and shift migration was not an option, and we decided to reengineer and (happily) tackle the challenges!

Problem A: Interconnect overhead latency

Since we wanted to skip interconnect latency entirely, we can only think of dual-deployment as the only solution. However, how we route the User’s request to either on-prem or cloud deployment was tricky. The initial idea was to differentiate it from DNS level, so api.bukalapak.com would route to on-prem, while api2.bukalapak.com would route to the new cloud deployment, it would look like this:

But after a long thought, this idea is not applicable because we need to adjust the logic in the client application, which most of our traffic comes from Mobile Apps (Android/IOS) that support older versions

CDN Edge Workers come to the rescue!

Popular CDN providers tend to have Edge Workers capability (e.g: Akamai, Cloudflare) which lets us run code at the edge. Therefore, we decided to deploy our simple routing logic onto the edge worker, the script would look like this:

It checks the request URL pattern and routes accordingly to on-prem/cloud. This approach helped us achieve zero overhead latency!

the dual deployment architecture

Problem B: Tangled cache problem

The solution to this problem was fairly straight-forward but simplifies our migration a lot. We agreed that the response cache should be API-gateway’s responsibility, hence we move the Monolith’s direct access to Memcached into an API call to API-gateway, so any adjustment regarding the migration would only take place in the API-gateway. The new flow would look like this:

note how we double-write any cache updates made on-prem to cloud to warm-up the cloud Memcached naturally (pretty neat, eh) since the cached resource written by the Monolith could also be beneficial to Microservices in the cloud (I know it’s complex, maybe more onto that in the next write-ups :D)

multi-zonal Memcached configuration

On top of that, we also took advantage of the cloud awesome multi-zonal deployment feature which increases the Memcached high-availability

Final thoughts

the final architecture

This cloud migration journey was really fun, bumpy rides, and an important experience for our team and Bukalapak as a whole

We successfully achieved our API-gateway migration with zero-downtime and completely transparent to our users

It is by no mean a perfect solution and got a lot of room for improvement, for example:

  • An event-driven approach for the cache problem should be more proper and remove the monolith dependencies to API-gateway completely
  • While our simple edge worker routing logic gets the job done, it becomes harder to read as new patterns are added. A cleaner routing logic using trie would improve maintainability

my mentor once told me, “if you didn’t bash your 6-months old code, then you didn’t grow enough”

--

--

Budi Pangestu
Inside Bukalapak

Software Engineer at GoPay | ex-Bukalapak Associate Software Architect