BREW: The Multi-Tenancy Project

Matthew Chng
ShopBack Tech Blog
11 min readJun 25, 2024

--

Better Engineering Weeks (BREW)

Something’s… brewing 😉🍲

Since Q3 2022, ShopBack has dedicated two weeks each quarter to our Engineering team for Better Engineering Weeks (BREW). This initiative aims to tackle high-impact projects that cross team boundaries, driving significant progress in engineering investment areas and fostering greater collaboration across the organization.

The Whys:

  1. Enhancing Engineering Life: We strive to make our infrastructure more stable, operable, and cost-efficient;
  2. Planning for the Future: We aim to proactively plan for our technological future;
  3. Building Collaborative Teams: We believe in fostering collaboration by working together on impactful projects.

In this series, ShopBack Engineers will reflect on key BREW projects which have then become core initiatives that have transformed how we work and how the Group operates — and it all started… with a little brew ☕ #OwntheProblem

More about me

Hi, I’m Matthew and I joined the ShopBack Engineering team over 8 years ago in 2015. We were then processing under 2K customer transactions per day in Singapore, Malaysia and the Philippines. I witnessed ShopBack’s growth in product offerings, customer base, and service expansion into 9 more new markets. I was also involved in the evolution of our teams, processes and technology stack to keep in step with the rapid increase in the scale of our business as well as all that came along with it.

Today, we are successfully driving over half a million transactions daily, and powering over US$4 billion in annual sales for 20,000 online and in-store partners across 12 markets and are raring to leave our mark ⚡ beyond the Asia-Pacific region (we’re now in Germany too).

In this post, I’d like to share briefly on a recently completed major engineering project which I drove. We believe this sets us up for the next chapter of growth for ShopBack in which we will scale further, faster, and more efficiently.

Here’s a diagram summarising the evolution of how we serve each market from our backend service stack at the end of each period.

How ShopBack BE clusters serves the markets 2014–2024

Multi-Tenancy: How it all began

ShopBack’s backend stack has evolved over the past eight years to tackle various challenges faced by a fast-growing startup aiming to be the best-in-class in every market it entered — and to do so in the quickest time possible.

By the end of 2023, ShopBack was serving 11 markets, each with its own stack, hosted across 5 AWS regions. Over 200 Engineers were building applications within their teams with much autonomy and speed. This organic growth, and lack of standardisation, quickly resulted in our stack becoming more complex to manage as the ShopBack team expanded and the number of product offerings grew.

Over the last few years, the following issues became increasingly more glaring

  1. Difficulty in centralizing maintenance due to the fragmentation of tools used.
  2. Spawning a stack for a new market was very slow, as standardized set-up processes were non-existent. Each team had to set up their services from scratch, and gaps in knowledge due to the change in team members across time, or missing documentation definitely did not help.
  3. Each microservice required extra resources to ensure there is enough headroom for traffic spikes. This resulted in a large amount of provisioned resources remaining idle most of the time, which was very costly.

By the end of 2022, during the period of BREW, it became clear that these issues needed to be addressed to prevent ShopBack’s growth from being greatly impacted.

Work on what we termed The Multi-Tenancy Project began with support across ShopBack’s leadership. It became an organization-wide major initiative — with engineering and commercial leaders emphasizing its importance and the need for it to be done quickly, and more importantly, done well. The Multi-Tenancy Project was highlighted at nearly every Town Hall during the period of work.

The Problem we were Solving

The project primarily aimed to reduce complexity in three key areas, presenting significant near-term opportunities that would bring immediate and substantial value to the business:

  1. Reduce the time required to launch new markets
  2. Reduce dedicated infrastructure hosting costs
  3. Maintain a low pace of growth of maintenance costs

The multi-tenant aspect of the project focused on two major architectural goals that the Engineering team had to bring to life:

  1. Serving multiple markets from a common stack
  2. Serving multiple applications from a few common Aurora PostgreSQL clusters managed centrally

Serving multiple markets from a common stack

This meant that all applications had to be updated to handle requests from more than just a single market. This gave us the flexibility to add new markets to existing stacks or spawn a new stack at a datacentre with better latency for a new market.

  • Launching a new market in an existing stack would hence be faster, simply because the stack was already running correctly before.
  • This also created opportunities to reduce the cost of resources required to maintain a required level of idle headroom.

Serving multiple applications from a few common Aurora PostgreSQL clusters managed centrally

We wanted to move away from each team having their own dedicated databases using different technologies. This standardization would allow us to

  • Centralize database infrastructure optimization efforts : A smaller, centralized infrastructure team could implement improvements more rapidly.
  • Centralize scaling strategies : Managing fewer configurations meant the infrastructure team could easily build tools to improve scaling efficiency for all, realizing cost savings much more quickly.
  • Utilize shared idle capacity : Serving multiple markets from the same stack amplified the cost savings from shared databases.

Together with the Engineering leadership team, I planned, guided and drove the project to a successful completion, ensuring all our Engineering teams remained aligned with the project’s objectives, and delivered the necessary change promptly.

Our Biggest Challenges

I will share three major areas of concern we had and how we faced them.

1. Preventing Teams from Waiting on Each Other

No microservices in our architecture exist in isolation; they all depend on each other to deliver value to our ShopBack members (users / customers). To thoroughly test a change in one service, we must at some point call its dependencies or serve its dependents and check the results.

Our situation required all our services to be changed and tested together in a new stack, posing a major challenge where teams will eventually be requiring services from other teams, who might not yet be ready.

Deadlines were tight and every minute, hour, or day made a big difference. We needed a way for teams to work asynchronously to avoid slowing things down.

I worked with the infra team and came up with a creative solution relying on Istio’s Virtual Service feature in our Kubernetes stacks. We enabled teams to control traffic flow between the old development stack and the new one by deploying configurable Virtual Services to help proxy requests between the Kubernetes clusters. This allowed new services to receive traffic from dependents in the old development stack if the dependents’ owners had not yet made their services available in the new stack.

Similarly, Virtual Services were created by default to proxy traffic back to the old development stack for all services not yet deployed into the new stack by their owners. New services making calls to dependencies in the new stack would receive responses from the old development stack if the new dependencies were not ready by default.

This solution worked beautifully and enabled our teams to work asynchronously, playing a huge role in accelerating the development experience for our Engineers.

2. Ensuring the Resulting Architecture continues to Operate Correctly (and as well as before)

With so many changes going into all our microservices at the same time, it was difficult to be confident that the correctness of our system as a whole was not affected. We needed to establish a way to level up our confidence in this area, so we can eventually land and go live.

The plan we executed involved having multiple deployment environments and levels of testing. We had a development environment where the initial build and testing occurred. This environment had the aforementioned feature to allow each team to test their services without needing the owners of their dependents or dependencies to be ready. Once applications were working in this environment, they were also deployed to be tested on the staging and production environments.

We tested our application on the new production environment twice. The first time without internet connectivity, but with live data to ensure our external side effects did not persist in the event of correctness issues with our application in the production environment. And a second time where we were literally testing live in production.

We also had multiple levels and rounds of testing. Basic development practices involving run-of-the-mill testing such as automated unit, integration, and regression tests continued. We also carried out many additional manual tests for this project, first in our staging environment, and then in the production environment.

For production environment testing, we employed the help of all other departments in the company. We were also dogfooding our live application in each market. In every Town Hall during that period, leaders emphasized the importance of all ShopBackers partaking in dogfooding efforts. Joining the dogfooding programme became part of every new ShopBacker’s onboarding, regardless of their department. It was a heavily prioritized organization-wide project from start to finish.

Our go-live plan across our 12 markets (by then, we had extended our services into New Zealand) was intentionally staggered, allowing us to become more confident as each market was released.

These few paragraphs do not do justice to the amount of effort that went into surfacing issues in the new version of our backend stack before our customers encounter them.

3. Deploying the Change Safely with Minimal Disruption to our Members

Finally, we had to ensure all teams were aligned on the go-live strategy. A balance had to be struck between being overly cautious with too many preparations and delivering the project in a timely manner.

Different teams had different risk appetites, and each team was afforded their own consideration and go-live plan. We were pragmatic in our approach, managing each known risk in a calculated fashion. If something was less likely to happen, we were more likely to accept a more complex and messy resolution. This helped us maintain focus and avoid becoming paralyzed in the process.

A big part of the success in this area was due to the streaming data migration we set up to help migrate live data between the old stack and the new one. Remember we also performed a major consolidation in our database technology to Aurora PostgreSQL? (One of the two major architectural goals we had to bring to life.) Performing a backup and restore was not an option due to the sheer size of our databases and the length of downtime required. Instead, we relied on capturing data changes on-the-fly and built a streaming migration pipeline to transform the data for replication into our new databases. With this in place, we were able to successfully perform zero-downtime cutover of traffic from our old stack to the new one eleven times — once for each market that went live.

In fact, we designed our system so both the new and old stacks can be used simultaneously by different users. ShopBackers could dogfood the new stack with live data, while our ShopBack members were still served by the old stack. We also had our new stack serving some markets first, while other markets came onboard to the same stack a few days or weeks later.

A sense of Pride

Although there were many unknowns and worries from the beginning and along the way, the Engineering team of over 200 ShopBackers was united and able to deliver the changes without any major incidents. Many team members stepped up to bridge gaps they identified and led in their own ways. Much was asked of all Engineers, pushing them every single day to meet deadlines and make things work in some way or another — and many rose to the challenge. It was a true testament of grit and demonstrated how the team held true to ShopBack’s “Can’t is not an Option” value. Now, we can all own this success.

Can’t is not an option

We have already started reaping the benefits of the work done to consolidate and standardize our database technology. The infra team is now enabled to continue optimizing the scaling aspect, which has been a major pain in our history. It has been a long time coming, but I can now see a future where we will eventually be able to fully rely on automation to cover this, freeing us to focus on developing our product.

A big win: we were also able to extend our services into New Zealand in just 3 weeks, whereas in the past, it took 3 months to launch a new market. This feat would never have been possible without the completion of the Multi-Tenancy project.

Finally, regarding the estimated cost savings for the project, it was a stretch goal at the back of my mind going into the project, and not something I focused my attention on during the entire time. In the end, I was glad when we hit it, and even more so when we exceeded the projections.

Future work

Among other things, we also are aware of the glaring loss of isolation when we set out on this journey and have introduced new processes and new software features to manage such risks in our operations and infrastructure. These concerns were known from the beginning and are conscious trade-offs that the team decided to take on so that we can realize the gains we set out to achieve. Work doesn’t stop in our ever-evolving ShopBack product as we strive to ensure we’re relevant for our customers while continually improving our engineering practice.

❗️ Interested in what else we work on?
Follow us (ShopBack Tech blog | ShopBack LinkedIn page) now to get further insights on what our tech teams do!

❗❗️ Or… interested in us? (definitely, we hope 😂)
Check out here how you might be able to fit into ShopBack — we’re excited to start our journey together!

--

--