Breaking up with a Monolith

Published in

Weave Lab

12 min readMar 8, 2019

Lesson 0: When is it time to call it quits?

Microservices are all the rage. It’s easy to get swept up in the excitement and decide that some state-of-the-art, industry “best” practice is right for your company simply because there are so many blog posts about how it worked for Netflix or Weave or whatever. This alone turns out not to be a good reason to do something.

If you are neither rapidly scaling your customer base nor rapidly scaling your engineering team, then you should think twice before rapidly scaling the complexity of your system by chopping it up into more services.

That being said, there are many good reasons to refactor a monolith into microservices. Taking periodic inventory of your existing services to make sure you are prepared to meet your company’s needs is well-advised.

Just because your company uses Kubernetes or has many backend projects (each with nice short descriptions in their README.md) does not mean you are free from any hidden monoliths waiting to cause issues for your development efforts. And what might be micro-enough today might need a chisel and hammer tomorrow as you experience growth and product changes.

So how do you know when you have a monolith that needs to be divided asunder?

It might be time for some rearchitecting if one of your services has some of the following qualities:

No clear owner; or many teams care about it and touch it
A myriad of external dependencies or a plethora of upstream dependents
Cannot run multiple instances of it (because it is stateful, etc.)
Difficult to test certain aspects of it
Deployment makes you ask coworkers to pray for you (even for small changes)

Breaking up at Weave

Sometime in 2017 our fearless CTO, Clint Berry, admonished each development team to make sure that their products and services would be able to scale to 4x Weave’s customer base by 2019, with no service interruptions.

I took stock of the projects for which we were responsible and found one crucial service that would likely not survive Weave’s growth in 2018.

Weave is an integrated-communications platform for small businesses (such as dentists), and one of our main selling points is automated messaging (e.g., appointment reminders). The service I was worried about was in charge of scheduling and sending all automated communications, including everything from texts for appointments to emails for birthdays.

At first glance, this service would not necessarily seem to be monolithic in nature. For one, it had a nice, succinct job description: “schedule and send automated messages.” It didn’t even actually deliver any text messages or emails, it passed them off to other micro-services to do that.

However, there were several red-flags with this service. It executed several batch jobs which meant it was not horizontally scalable. Testing even small changes to the service was difficult and complex. It “knew too much” by having a lot of external dependencies. And finally, deploying this service absolutely made me nervous — sometimes requiring a developer to wake up at 3:00 AM just to verify that everything was working alright.

The service had worked wonderfully up until now, but it was time to break it apart or suffer many stressful days in the coming year.

Lesson 1: Someone needs to own the migration

Migrating from a monolith to microservices is likely to be a non-trivial task that gets underestimated by everyone involved. There needs to be one designated person to fight for the work to be prioritized against user stories that more obviously drive value for the business.

It would have been a mistake to assume that our monolith would organically spawn into well-designed microservices through the magic of good intentions.

In reality we were committing to complete several smaller migrations. These migrations took many months (and in some cases are literally still happening today, more than a year later). We created a plan and moved pieces out one at a time, as it made sense.

It doesn’t need to become everyone’s top priority to get this migration done immediately; but it does need to be someone’s priority to ensure that forward progress is always being maintained and balanced against other business needs.

Lesson 2: Make it easy

Provisioning, Deployment, & Databases

As of a few years ago, all of Weave’s backend services were running on VMs. Creating a new microservice would require some back and forth between dev and Ops to provision a new VM with the necessary dependencies. Deploying services was a bit of a pain — rolling back even more so. As a result, Weave had maybe a dozen services.

I wouldn’t recommend getting too serious about migrating to a microservices architecture until developers can easily provision their own compute resources for a service, can deploy safely without complication, and can store their own data somewhere safe and replicated.

Kubernetes solves most of those issues for Weave and makes life really easy when we need to create a new service as part of a migration.

Isolate, Separate, Refactor, Rewrite

As we were breaking up with our monolith we were tempted to take the opportunity to do a complete rewrite of all business logic from the ground up.

While there are bound to be cases where this is the right course of action, we generally opted to hold off any major changes until after we had instituted new microservices and migrated all traffic to them. There were enough changes with the migration itself that we didn’t want to complicate matters more than we had to.

Our migration process was to first logically isolate pieces of the monolith that should be moved into their own service. This sometimes meant doing a little refactoring in the monolith as a preparation to the actual migration.

From there, we physically separated that functionality into its own service (often just copying and pasting the code). Then we slowly migrated customers to use the new microservice.

Once we were confident the microservice was functioning as intended, we were free to refactor or rewrite as necessary. The migration really was not complete until we had gone back and done this step, but we waited until we were comfortable that the migration was going smoothly before making any big changes to business logic.

Tracing

As the number of our services grows at Weave, the difficulty to debug issues grows as well. Pinpointing a failure is difficult when a request might depend on a handful of services for a response. Which service screwed up?

Request-tracing (using a tool like Jaeger) drastically improves our ability to investigate and test our microservices. We can see a request in its entirety rather than just in a local context. Jaeger is one of the single greatest tools that Weave has for its developers. Traditional logs pale in comparison.

I would hate to work in a microservices architecture without some type of tracing available to help deal with the complexity of a distributed system. I know that tracing is true.

Lesson 3: Know when you succeed

A migration from a monolith to microservices might seem simple, but it can be fraught with issues.

Think about it: you are expecting your current developers to completely rearchitect all the legacy work that your past developers created. There’s a level of intimate knowledge required that may have left your company years ago.

Sure, in many cases the migration might be as simple as copying and pasting from here to there, but surprises will most definitely come up. Because of these unknowns, it is important to know when your migration is succeeding and can move forward, and when it is failing and needs to move backward.

This means that your migration towards a microservices architecture for any given monolith needs to start long before you create clean and sparkly microservice repos.

Tests & Test Plans

Since we decided not to do any major refactoring we were able to port all of our unit tests into the new services we created, giving us confidence that we hadn’t made any major blunders. This, of course, required that we had tests in-place to start with. Unit tests are a boon to safe deploys, so if a particularly hairy piece of code was missing tests then we would often write them into the monolith before starting a migration.

Integration tests depend less on implementation and will verify more of the behavior of the refactored system, so those would have been even better. Unfortunately integration tests are a lot like unicorns: completely fictional.

Without automated integration tests we had to resort to manually testing as we migrated code to microservices. This was unfortunate and difficult, but still helped us track towards success.

Metrics

Another critical tool for tracking success was our ability to collect metrics using Prometheus and Graphite. As we migrated customers we relied on our metrics to know that nothing major or unexpected had changed. Just as a unit test will verify that some actual result matches an expected outcome for a single function, metrics can accomplish the same for your entire system.

In order for this to be effective, we had to (1) have metrics in the first place and (2) be cognizant that our new metrics would be comparable in a useful way to our existing metrics. Note that in this case having correct metrics is actually less important than having comparable metrics.

Unfortunately, this is a trailing indicator of any problems that might be occurring, but we still derived confidence from our metrics as we rolled changes out to more and more customers.

Lesson 4: Take appropriate risks and always have a rollback plan

Deploying is inherently risky — even for a small, seemingly obvious change — even when the code has great test coverage.

Refactoring a monolith into microservices is especially risky because there are large swaths of changes happening at one time, often involving legacy code that current developers may not fully understand.

The good news is that it is generally possible to leave the original monolith largely untouched and in-place until you are ready to swing some traffic to the shiny new microservices. Your monolith can continue doing its job while you work behind the scenes to cut its legs out from under it.

Eventually, after you’ve written, tested, and deployed your new microservices you are going to have to take the plunge and do the risky thing: put live traffic on it.

Assess the level of risk you’re willing to take to determine the best strategy forward, and balance that against how feasible each of those strategies is to implement.

Part of assessing each of these strategies requires considering the roll-back process should something go wrong. As discussed previously, it’s extremely important to know when you’ve succeeded, and it is just as important to be able to reverse course when you’ve realized something has failed. If ever you find your team suggesting to “burn the ships” in reference to a migration then I’d recommend finding a less risky approach.

As our team worked to refactor the automated-messages monolith we considered all of the following strategies in some form or another.

Move all traffic

For our particular problem, moving all of the traffic through the new microservices would have been easy, but came with a lot of risk. If something went wrong and we sent the incorrect appointment reminders there would be a permanent record in the form of SMSs that proved how bad we had screwed up. (For example, if thousands of the appointment reminders got sent at 4:30 AM then that would be less than ideal — good thing that never happened 😬 🤦‍♂️).

Had we been dealing with a simple CRUD application we may have just deployed it one evening for a few minutes to see if it worked out. The rollback strategy in that case is simple: if it doesn’t work then swing traffic back to the monolith.

Load-balance a fraction of the traffic

This approach may have worked, but there were two issues with it. For one, Weave didn’t necessarily have a framework for accomplishing this easily between our services. (We communicate using gRPC, and load-balancing has been a bit of an after thought). Furthermore, we needed more control over which requests would use the new services so that we could track success on a per-customer basis rather than leaving it to random chance for each request.

Feature Flags

Our team leveraged feature flags (or toggles) very easily because Weave already had a framework in-place for beta testing. We were able to turn on a feature flag that routed a specific customer’s traffic through the new services, giving us confidence that we could move forward safely with more and more customers, but still rollback easily if there were any negative results.

Shadow Reads

The last strategy we considered is to implement shadow reads for data. The idea is to use both the trusted, legacy monolith and the rearchitected microservices to service a request, then to compare the results to make sure the new services are working correctly. If there is anything unexpected, we log the error and use the legacy’s “calculation” as the truth instead.

The rollback strategy is baked right into the code. We can test our entire customer base at once (instead of the slow-roll required by feature flags). The downside is that is that implementing this is more work and therefore comes with more inherent risk.

We used this strategy effectively a few times, but generally opted to maintain more control by using feature flags.

Lesson 5: Avoid overcorrecting

I mentioned that Weave used to deploy services on VMs, but we made life easier by switching to Kubernetes. This has made spinning up a new service so easy that Weave’s devs will do it on a whim — hence why we now have hundreds of services to monitor and maintain. We may have made it a little too easy.

While I would largely defend where Weave is settling with its microservice architecture, there is always a risk of over-correcting from your monolith and ending up with more services than is prudent. If you’re not careful, a simple change like adding a field to some data model might require 6–10 pull requests to various projects. Plus, each new service adds complexity for all new developers who come to your company.

I think this is a lesson Weave is still learning, and we will see how it plays out. It’s probably important to realize, however, that it’s OK if you break apart a monolith and end up with 3 smaller, more well-defined monoliths; that just might be a better approach for your company than 25 truly nano-services.

Lesson 6: Clean up after yourself

The last lesson is an important one: the migration is not complete until you clean up after yourself.

I mentioned that our migration for the automated messaging service at Weave took place over the course of many months. So, as the number of microservices grew, the codebase of the original monolith (supposedly) shrank.

Unfortunately, I wasn’t always as diligent at cleaning up after myself as I should have been, which often lead to confusion. Other team members or myself sometimes found it difficult to debug an issue when the same or similar code seems to exist in two places.

In at least one situation, a new developer on my team did a couple of days worth of work on code that had been dead for about a month. I had to confess that I had never gotten around to deleting the old code and therefore caused the confusion.

We learned some lessons the hard way, but overall we had a good experience smashing some monoliths at Weave into more maintainable microservices.

Please clap.