From Monolith to Autonomous Services and Teams

Published in

cloudstory

7 min readApr 19, 2022

As part of the Digital Transformation, you may have migrated your App from the privately managed On Prem Data Centre, to the Public Cloud like AWS. The Lift and Shift migration is barely enough, but that is the critical first step to the exciting journey ahead. The next step would be to empower your team with necessary tools, architecture and structure that allows them to be highly productive, and agile so they can move fast.

Team Structure: Autonomous Team

Autonomous team is a small team (a typical two-pizza team) that is responsible for end to end delivery, being fully accountable for a subset of metrics that your organisation has set out. Let’s say your North star metrics is Time in App, or the amount of time your users will spend in your app in a given month. Every single section or subsystems in your app will need to define their own metrics that support the North star metrics. The subsystem can be the Landing Page, Search, Product, or Order service. Below are the example of individual metrics for each services:

Landing page → Smaller bounce rate
Search service → Higher click rate on search result
Product service → Higher conversion rate (adding item to shopping cart)
Order service → Higher conversion rate (cart to payment completion)

An autonomous team will take up one of the service, and run in full speed to continuously improve the metrics. While the team can not control the input absolute value, they can control the conversion rate. If the team is given the autonomy to do experiment, or run build-measure-learn cycle, the team will be fired up and fully motivated in improving their metrics.

Architecture: Autonomous Services

Having an autonomous team will not be enough if they have dependency to other teams. A team is responsible for one microservice, but the microservice has dependency to other microservices. This is typical Microservice issue known as the death star. Death star is characterised by microservices that have complex dependency to other microservices. Dependency will create tensions, and can fail a team that has been working so hard improving their metrics. We need to empower the Autonomous team with Autonomous service, a service that has total independence.

With those key concepts, let’s dive into the real use-cases in turning a Monolith into Autonomous Service Architecture.

Current Architecture

There are many versions of monolith architecture, but a very typical scenario is a 3 tier web applications as shown below:

In this example, Liferay Portal, a product that was hugely popular a decade ago, is at the heart of the system. In order to change the system, be it functionality or the UI, you need to use the plugin system, and everything needs to be deployed together. It’s super hard to make changes, and even harder to ensure that changes won’t break the rest of functionalities.

Transformation Strategy

We need to break the system into Autonomous Service, but breaking the system that has been working fine for years is not a small undertaking. Instead we can choose which functions have a major impact on the North Star metrics, and just focus on refactoring them into Autonomous services. Some low key services will still be served from the Legacy app, and we make both new and old services work together with Strangler Fig Pattern.

Target Architecture

After identifying the High Impact functions, let’s move onto turning them into Autonomous Service, and how we make them work in harmony with Legacy App, as shown below:

Autonomous service lives in harmony with Legacy app

While I am not going to elaborate on the various patterns we are using (CQRS, Event Sourcing, Backend for Front-end, Micro front-end, etc), there are some decisions worth highlighting that will show this architecture a perfect fit for an Autonomous Service:

Decouple services with Event Hub → each services are connected in decoupled manner by using event delivered and received via Event Hub
Data replication across services → to maintain full independent, each service is keeping the data it owns, and the data from other service in read-only that’s optimised for viewing
Wrap the legacy app with Service Gateway Pattern that turn the legacy app into “just another microservice”
Build materialised view as S3 object to make it super scalable and cheap, eg. instead of serving request by hitting Lambda and DynamoDB, we can just “build the view” by generating necessary object as JSON and store it in S3 as early as possible (when the change event received) , so subsequent request to view the object will be served from low cost S3 storage and CloudFront caching

The Migration

With all the services in place, now let’s turn our attention to how to migrate to the new architecture. We specifically have to deal with:

Hydrating the Event Hub with the data from Legacy App
Coordinating the switching from old to new architecture in a seamless manner

DMS as Migration tool

DMS is the immediate solution that comes to mind when it comes to migration. Migration to other relational or DynamoDB would have been a smooth sailing with Schema Conversion. But this would pose a challenge in our event-first Autonomous Service, for some reasons:

Each autonomous services has its own way to store the data or turning event into materialised view, performing mapping to each model and view for each service will be laborious
Event Lake will still be empty by the time the migration is completed, as the Event hub and Event lake are not involve in the migration
With Event lake being empty, any new services that will be introduced in the future will require a rerun of the migration for the new service

There is a better way though, by getting Event Hub and Event Lake involved in the migration process. DMS has recently introduced the ability to set Kinesis Data Stream as target. By getting DMS to send full load and CDC to Kinesis Data Stream, the Event Hub — powered by AWS Event Bridge can process the streams and send them to downstream autonomous services. We then just activate event archival and replay capability, to allow seeding of data to the future services. Problem solved.

But some questions remain: what is the appropriate size of DMS to be provisioned, and what is the optimum parallelism and commit size. When and how to switch to the new architecture that trigger the change of data flow. Specifically the switching process requires:

Update the direction of the data flow between services
Split the traffic in the front end into (1) legacy and (2) new service endpoints
Bring up and remove maintenance mode automatically

To solve this problem, we need orchestration, and the AWS Steps function is a good candidate for that.

Step Function and Lambda as Migration Tool

As mentioned earlier, DMS alone is not enough to perform end to end migration process. Now let’s explore Lambda instead of DMS as a tool to move data from legacy app to new services. Lambda gives advantage over DMS for the following reasons:

There is no need to think about provisioning and sizing the DMS service
Coordinating and utilising DMS for the one-time activity may be overkill, as the team needs to learn how to use it

Before we dive in onto the solutions, there are some requirements and design decision as follows:

Setup 2 new Event Bus in the Event Bridge: (1) Migration Bus (2) Live Bus. Migration Bus is used during migration and have one way stream flow, while Live Bus is the actual bus to be used by the system after migration. Live Bus has the actual stream flow to the respective services
CloudFront is being used to split the traffic between legacy and new services to implement the Strangler Fig pattern. All the origins need to be defined prior to the process
Maintenance mode is implemented by blocking traffic at the ALB level, by redirecting all traffic to predefined Maintenance Page

The state diagram below shows how we can use Lambda as reliable solution for migrations, coupled with Step Functions role to orchestrate the whole migration process:

Migration Start, is a lambda invoke step that is responsible for enabling rules in the Migration Bus
Run Maintenance, is a lambda invoke step that is responsible to activate Maintenance mode by updating rules in ALB
Migration Controller, is a lambda invoke step that is responsible to generate batches or query ranges for each tables. It will output an array of table and start index pair to be used by subsequent step
Query and Emit is a lambda invoke step that will query from the database and emit events to Event Bridge, the query is driven by output from Migration controller. Each lambda will receive table name and query start index, and use that to query the database. It also only fetch10K records. The lambda is called inside Map state to allow parallelism. If the amount of data to be migrated is huge, max_concurrency setting need to be set to prevent spawning massive amount of lambda that will hit the database performance
Switch Over is a lambda invoke step that will:
— Disable rules in Migration Bus
— Enable rules in Live Bus
— Configure traffic split at CloudFront
— Wait until the Cloudfront state is ready (fully propagated)
Migration End, is a lambda invoke step that will remove the maintenance mode by updating rule in ALB

That’s it, reader! Many of the details have been omitted for brevity and clarity purpose. Anyway, feel free to reach out for a deep dive into actual implementation.