Monolith to Microservices: Frequently asked questions.

Published in

Babylon Engineering

8 min readFeb 27, 2020

Monolith to Microservices, Photo by The AIRDEEz on Unsplash

At Babylon, we have been replatforming to replace our original Ruby-on-Rails monolith with microservices written in many different languages, the most common being Java. Over time, there have been frequently asked questions, lessons learnt and internal reflection over whether we are doing the right thing and in the right way. In this post I aim to share what we have learnt so far, even with the replatforming still ongoing.

My name is Will Poynter and I provide engineering leadership to the Platform: Foundation tribe here at Babylon. (If you are unsure what a tribe is I recommend you read this article). The Platform: Foundation tribe is one of the largest owners of domains either once or still delivered through Babylon’s monolith.

What is replatforming?

Putting it bluntly, replatforming means replacing the underlying platform while preserving the overall functionality. This could be moving a blog from Wix to WordPress or a data warehouse moving from AWS Redshift to Google Big Query. Replatforming in our case is moving from monolith to microservices.

Majestic Monolith, photo by Sven Scheuermeier on Unsplash

Why have we decided to use microservices instead of a majestic monolith?

Both patterns are entirely valid and each have different advantages and disadvantages. One advantage of microservices that is compelling for Babylon is the ability to use many different programming languages and frameworks. Relying on a singular skillset can be difficult and limiting the growth of the company. Over Babylon’s early years, the number of great Ruby engineers Babylon could hire is directly linked to the number of great Ruby engineers in London and Kigali.

Using a microservice architecture allows Babylon to develop different domains and microservices using many different languages, e.g. Java, Go, Ruby, etc. This provides us with better business continuity and less reliance on a single programming language.

Another important reason to replatform from monolith to microservices is our existing architecture was already a mix between both a monolith and microservices. Going back to 2017, our monolith provided the telemedicine and platform functionality, while the Babylon’s Symptom Checker, Assistant and (the back then still in development version of) Healthcheck products were consisting of a couple dozen microservices.

This meant Babylon had three options:

Consolidate all of our products into a single monolith
Support both monolithic and microservices architectures
Break up the monolith into microservices

Option 1 — Consolidate into a single monolith

There are a lot of reasons why this is not a good idea. One of the strongest and most compelling reasons is that Python is a much better language for creating AI products than Ruby.

Option 2 — Support both monolithic and microservices architectures

Although this is possible, it is more expensive and difficult for our orchestration and infrastructure to scale. A lot of effort has been invested in tools like Kubernetes and Istio. If we wanted to scale our platform to be able to handle both microservices and a monolith, we would also have to invest a lot of effort supporting a monolithic architecture. Expensive and two different sets of expertise and principles needed.

Option 3 — Break up the monolith into microservices

Having a microservices-only approach meant that we would only need to support one architecture and capable of utilising whichever language or framework is most appropriate for the task. As we continue to grow and scale Babylon this approach seemed most suitable for the task. This is therefore the option we decided to take.

What’s the plan?

Babylon’s monolith is a complex and still very active system, receiving the most requests of any service at Babylon. In addition to being highly active, our monolith was once responsible for every platform domain and all clinical care domains. It did everything from rendering PDFs of sick notes to emailing passwords resets and handing eCommerce orders.

In addition to the mission of replatforming from monolith to microservices, Babylon also needs to keep moving, closing deals, deploying to new countries and advancing its products to get us closer to providing affordable and accessible health service to every person on Earth.

Big Bang

All of these factors make a big bang change a very risky approach. Not only would we need to build out all of the existing domains into our new platform, with all of the subtleties in how they work understood and replicated, but we would also have to build the new platform to converge on the evolving functionality of our monolith. Even worse, we would have to wait until every single part was done before we could use any part of it. Famously, big bang changes tend to take much longer than predicted as the goalposts keep moving and therefore run significantly over-budget.

Lastly, when they go wrong, they go wrong in a big, and sometimes public, way. Most of you in the UK will remember what happened to TSB in 2018 when they moved from using the Lloyds Banking Group platform to the Sabadell Proteo platform. The fallout from their big bang change has cost the company over £360m, affected 1.9m customers and led to the planned closure of over 80 branches in 2020. [₁][₂] Big bang.

Incremental

A much safer option, and the one we decided to use, is to replatform one domain at a time. This has just as much parallelisation capability as big bang, but doesn’t require it. We can do 1 domain at a time. Or 2 at a time. Or 10 at a time. It’s up to us and how much resource we have to commit to replatforming.

However, in order to replatform domains separately, you need to:

Know what the domains are
Disentangle them from each other

And with that, we begin to answer another big question.

Why does it take Ruby engineers to replace a Ruby monolith?

The development of our Ruby-on-Rails monolith started in 2013 and was built rapidly to meet the app launch in April 2014. It has then since grown and grown to meet additional client requirements and feature requests. For 4 years our monolith grew and during this time it was never being designed to be dismantled. Although some excellent engineering has gone into our monolith, it was always with the view of meeting the requirements quickly. This not only led to a lot of unintentional coupling but also created another problem. A severe lack of documentation.

Documentation, the thing that all engineers want and few engineers want to spend time writing. The first couple of years of development at Babylon predates Babylon’s use of Jira, meaning the oldest, most entrenched elements can be complete mysteries as to why they behave or how they should be behaving. This leads to a lot of time spent on investigations, spikes and bug fixes, all of which require Ruby skill.

Once the domain is understood and stabilised/corrected, the next step is the work of decoupling the domain from the other domains within the monolith. For example, think of the Babylon Member Profile consisting of name, email etc. We want to replatform this into a new service that deals with this data in a much more flexible way. So we built a new service and migrated the data… but! The rest of the monolith code refers to the user profile data all over the place and now all these references need to be updated to point to the new microservice. This requires extensive reworking in Ruby but must be complete before the microservice can be fully used.

Roughly speaking, the technical debt is in Ruby, therefore, it has to be paid off in Ruby.

How do we track progress?

It is so important to track your progress for a plethora of reasons. One reason is that this is even more than a marathon; it’s closer to an Iron Man or a Spine Race, the peak of endurance exercise. Performing such a big feat can lead to motivation issues and calls of “we are not moving forward”, therefore it is important to be able to show how much you are moving and set expectations of how much further to go. Additionally, it is a vital communication tool to the rest of the business about which domains are available on the new shiny platform and those that are still only partially replatformed or not even started. This helps the business and the replatforming effort to build a roadmap. It also reduces confusion about how long the project has taken or how much longer it will take.

Firstly, what we learnt works well is one tracker, one owner, many contributors. You need a single place to go to find progress. This keeps the progress tracking in a standard format and improves the integrity of the tracking.

Secondly, you need someone to own the tracking. Someone is accountable for the tracker being kept up to date, able to explain how best to use the tracking and how the tracking will be performed.

And finally, each owner of a domain being replatformed submits their analysis of the progress they have made in their area.

At Babylon, we defined two metrics to describe our replatforming progress: replatformed percentage and adoption percentage. Replatforming percentage is the proportion of the functionality (we are keeping) of the monolith that has been rebuilt into microservices and adoption percentage is the proportion of clients (including the monolith) that are now using the new platform as the source of truth.

Replatformed percentage vs Adoption percentage

Are we getting rid of Ruby or just our Ruby monolith?

Probably neither. We are only replacing what we need to. We will continue to replatform our monolith until we reach a point of diminishing returns. There will be a point where our once Ruby monolith has become a large microservice responsible for several relatively simple low-churn domains with all other domains extracted. At this point, it may not be a priority to continue replatforming as the remaining domains are not causing us issues.

Even more importantly, we are not getting rid of all Ruby. Ruby, like all programming languages, has its strengths and its weaknesses. There is no technical reason to remove Ruby from the Babylon ecosphere. But, today Babylon has around 550 engineers, only 41 of these are Ruby engineers, i.e. they make up around 7% of our coding capacity. In a balanced and stable approach to development, only around 7% of our systems would then be in Ruby. Although it is very difficult to ascertain the exact proportion of our systems that are currently written in Ruby, we can be pretty sure it is more than 7%. We do not want to remove Ruby as a language but reduce its usage until it is in line with the capability we have. Once our monolith is only another microservice, it will be easier to determine how many Ruby microservices we are able to support.

At Babylon we believe it’s possible to put an accessible and affordable health service in the hands of every person on earth. The Technology we use is at the heart and soul of that mission. Babylon is a Healthcare Platform, currently providing online consultations via in-app video and phone calls, AI-assisted Triage and Predictive Health Assistance.

If you are interested in helping us build the future of healthcare, please find our open roles here.

References:

[₁] https://www.bbc.co.uk/news/business-50543665

[₂] https://www.theguardian.com/business/2019/nov/22/tsb-struck-by-new-it-glitch-just-months-after-330m-meltdown