Breaking the Monolith
Modular redesign of Agoda.com
This article tells the story about the ongoing process of a modular redesign of the Agoda website (at the moment when the article was written). The first part contains information about the development process in Agoda and the second part is a high-level overview of the modular architecture.
I am Vlad, an Engineering Technical Lead at Agoda’s Frontend department, primarily working on modular redesign, including traffic migration from Windows to Linux and the codebase migration from ASP.NET Framework to .NET Core.
Let’s dive in!
How We Work
Agoda's front-facing web applications use the .NET stack and other modern frontend technologies like React, Redux, Webpack, and GraphQL.
The server-side is mainly concerned on APIs and returning minimal page-landing HTML. Client-side deals with the UI of the website (desktop and mobile) and WebViews for native apps.
For the Desktop website and a small number of WebView pages, all frontend teams work together in a single repository that has one codebase of an ASP.NET Framework monolith application. We move fast using GitHub Flow and every day developers push dozens of pull requests to main branch. We release a new version of this monolith application several times a day.
The Mobile version of Agoda's website is a Single-Page Application, that we shortly call MSPA. We develop MSPA in a separate repository. Majority of MSPA requests handled by a standalone API, called Gateway.
When developer finished with changes, new release candidate needs to pass a set of tests (we will talk more about these tests later). Our CI/CD uses a “canary” deployment process: we deploy to one cluster first, and if all good, we deploy to all data centers.
All the new business logic (including bug-fixing and refactoring) are covered by experimentation, known as A/B testing. We monitor and measure everything: booking rate, exceptions, latency, hits and many more things.
We need to wait data to compare 2 variants and if B results better than A (B win), we can Take code changes. Work on a particular Story is completed only when the code of the A variant removed from codebase.
If you want to know more about A/B experimentation in Agoda, please, read an article written by ex-Agoda Software Engineer and my good friend —
Max “Bear” Mahasak Pijittum, who shared his thoughts on this topic:
How to Fail Like a Boss at Agoda.
Now you know how Agoda’s Frontend Department works; let us discuss how we measure success and what is driving us to make such a significant change.
How Do We Measure Success?
Agoda is a data-driven company where the single source of truth is data. For example, each B variant of experimentation should prove, that it brings better metrics than version A. To measure our development process we are using 3 main metrics: devfeedback, leadtime, and CI success rate.
Imagine this workflow: a developer did code changes and wants to be sure, that application is still stable to be deployed to production. The time to answer this question is called Development Feedback. Here’s what happens during this time: build server-side, build client-side, run unit-tests and jest tests, run Feature Tests.
Feature Tests — Selenium tests on mocked data, no data from external systems such as Database or APIs.
What is Leadtime? Leadtime is the time from the moment when Pull Request starts it's trip in CI until it merged to the main branch of GitHub repo. In CI we additionally run Integration Tests.
Integration Tests — Selenium tests on real data. We are checking application behavior with actual data and all other system dependencies.
Formula of Success
The CI Success Rate value shows the percent of PRs, that passes all CI stages and end up as ready to deploy release candidate. We are controlling CI Success Rate. When it lower than it should be — we improving our tests.
DevFeedback tests are cheap and fast, and we control how fast they are, but a disadvantage, they can not cover everything. Then we add more Integration tests. They are slow and complex but protect us from real troubles. If somehow a bug leaks to production we cover business logic with tests, and again, much better if we can write only unit or feature tests for it. We always should be in balance.
Website modularization aim to solve existed technical problems and significantly improve our current metrics. Let’s discuss current architecture in terms of problems and solutions.
Key Principles of New Design
8 years ago, when a small number of teams working on Agoda website, there was no issue with a monolith application. In the last 4 years, the Agoda website has grown rapidly in terms of services and features: Hotels, Homes, Flights, Packages. Today Agoda website is in the hands of 20 different frontend teams.
Problem: Monolith Website Application
When so many teams work together on different things at the same place — each action may impact the other. An especially critical area is a Request Pipeline, where any minor change may significantly impact the performance of every request.
Solution: Domain Isolation
Isolate domains. One team handles one “product domain” and controls its codebase. It gives the team freedom to choose internal design and code conducts.
Problem: Cross-repository Development
The structure of repositories impacts the development process and slow it down. A new feature for the single page for Desktop and Mobile requires development in 2 different repos, and your “time to market” is doubled.
Solution: Full-Stack Repository
Full-stack repository segregation, where in each repository team has a codebase of server-side for pages, API and client-side for all Desktop, Mobile and WebViews they own. Domain repository should also have all required tests: unit-tests, feature tests, integration tests for domain and cross-domain use-cases.
Problem: Test all, Deploy all
Sharing one CI/CD process across 20 teams can also be a bottleneck. For example, when we have CI/CD environment issue — the whole website deployment is stuck until we fix it.
In monolith application we run tests of the whole application even for a minor change of code. When we have a “flaky” test it slows down the deployment process, until the owner team takes care to stabilize it.
Solution: End-to-end Ownership
Each team should be responsible of development and deployment of its Product using standalone CI/CD with full test coverage, including tests across systems.
As you can see, all technical proposals lead to “Monolith Break” and recombination of repositories. In the last chapter, we will talk about our strategy and tactic to approach Modular Architecture.
Migration to a new architecture takes time. From the business side we have 2 requirements: start using Linux servers as soon as possible, keep adding new features to a website. Therefore, migration to a new architecture is not a revolution, but an evolution.
First system component, that should be mentioned before we start talking about the website itself is a WebGate proxy. Proxy is placed in front of all frontend systems. Originally, WebGate was built for common needs, but become a cornerstone of a new design. In terms of website modularization, WebGate doing 2 important things:
- Route traffic to particular downstream. Website application is one of the downstream. We can add new applications and manage traffic on the WebGate level using A/B experimentation: variant A sends a request to Windows servers and old website; variant B send a request to Linux servers and new website.
- Request enrichment. WebGate appends to a request useful information by adding Headers. For example, device detection logic happens at the WebGate level and every downstream read the result from the Header. And we do not need implementation of the same logic in every application.
WebGate centralization is great. It solves a lot of bugs of distributed logic and improves visibility. WebGate is a critical component, so we can’t lose it for a single ms. Thanks to our super high Uptime standards WebGate always in an amazing state.
We start from the Proof of Concept (POC) phase, where we can build a new application as soon as possible. Using the same Website repository, we migrate one page (City Page) from ASP.NET Framework to .NET Core.
Move fast, fail fast.
Once we have a first working page, we immediately sent the first user traffic on it. In September 2020 we had only 1% of the website’s traffic on Linux servers. At the end of the year, we have 40% and keep working on it.
More and more teams migrate the code to .NET Core infrastructure building more and more standalone Modules. During a migration process we refactored a lot of code and applied various .NET Core optimizations, that you can read in this related article written by Ilya Nemtsev. As of this writing, we have 12 Modules. Talking about Modules, now’s a good time to go into details.
Domain Module, Product Module, or simply a Module — is an independent unit of distributed website. The module is WebApp to build one website page with backend-for-frontend API of it. You can start one Module or combine several Modules and run them together.
What is a Module and its design principle:
- Module belongs one team-owner
- Module have one or several Controllers for page rendering or API calls
- Module control external dependencies
- Module control appsettings
- Module don’t have dependency on any another Module
- Module have dependency on Bootstrap NuGet package
What is Bootstrap nuget package? Bootstrap library makes your Module become a part of one “distributed” website. Let’s talk about what is Bootstrap NuGet package, and why it is so important.
Bootstrap nuget package
Bootstrap is a NuGet package, that provides “core” functionality to every Module. Here is the a list of Bootstrap responsibilities:
- Host. Bootstrap responsible for a website startup, managing connections to external systems, and register your application in a website distributed network.
- Middleware. Bootstrap adds a request pipeline Middleware for Page rendering and API calls. The latency of the pipeline is 13ms (p99) for pages and 4ms (p99) for API calls. This time is used to execute a common website logic and prepare Context-specific data for Controllers.
- Context-specific data. Bootstrap middleware provides to Module Controllers an object with context-related information. An object contains information about the request, user, page, pricing, and more.
- Core Services. Bootstrap speed up the development process by providing a set of common services such as Experiment Manager, Service Discovery, Data Access Provider, Logs, Measurement, CMS, and many others.
- Client-side Commons. Provide Razor Layout to build Desktop/Tablet, Mobile and WebViews with same Header and Footer. Dynamic dependency on common CDN bundle of React components for Header and Footer.
For the client-side, we use React, and this is the last piece in the modular architecture: micro-frontend approach using npm package and bundle.
Fast catchup: Every Domain repository has a standalone client-side project with teams creating web pages using React and Redux. Pages are different, but still, we want to make sure, that the user experience is the same across all of them. Common part of all pages are Header and Footer. To provide this common functionality to all Modules we built the npm package.
When we debated on how to deliver client-side commons, we stopped on the micro-frontend approach:
- Header and Footer React components are developed in a separate repository, packed into npm package and published to npm-artifactory. Additionally, we also deployed them as a bundle to CDN.
- The product team uses the HeaderFooter npm package in their client-side project then develops website pages. Here the main trick. When they create a bundle for their client-side, they exclude this npm dependency out from the resulting bundle.
- Module uses Razor Layout from Bootstrap NuGet where we already inject a reference to a CDN bundle. We can dynamically change the version of bundle using Consul. In the browser, Page meet Header and Footer from the bundle and it works like a charm.
The Micro-frontend approach gives us incredible agility. When we develop a new version of npm we don not need to update and deploy each website application. We let teams working at their own pace. This important flexibility allows us to work on Header/Footer as an independent piece and deploy Header/Footer to all website Modules in one click. We still can follow our favorite practice: move fast, fail fast.
While this all for now, our journey is still not finished. At the moment of writing this article, we are still working on a Traffic Migration phase. I hope that the next time I can share the results of our whole redesign: where we were wrong, and where we made a right choice.
Thank you very much for reading.