An experience of breaking the software monolith
Startups grow and their engineering needs change to follow this growth. The product that was built to support the core features and to have a meaningful business impact now needs to process an always increasing quantity of API requests, async message consuming, event streaming, and so on. For the company to continue to grow, the focus now turns to scalability and to the maintainability of the engineering flow. For our company, this process involved starting to break the monolith.
As for the early days, we had a single application (affectionately called main and composed of almost 300k code lines) that contained the core of our business and all the engineers were working on it. As time passed, we started distributing new features and contexts in new applications, but a substantial part of our daily work was still on the monolith, with the normal problems of large codebases (long build/deploy times, degraded database performance, etc).
This showed us that starting to remove some contexts from this codebase was necessary and the goal was that, in some time, more code was removed than added to it, and when this point is reached, the breakdown becomes a more organic process and teams naturally prefer to implement their new features on separate applications. I had the fortune of participating in the monolith breakdown team and in this text, I’ll explain some of the steps we passed while removing one business context from the monolith.
Define the breakdown strategy
The objective here is that the new service must be small enough to be developed by a small team and to be easily tested, following the Single Responsibility Principle (valuing cohesion and a small set of strongly related functions), while it is big enough so most new and changed requirements about that business domain only affect this single service. Following the Common Closure Principle, codes that change for the same reason should be in the same place.
For this step, there are basically two schools of thought:
- Decomposition by business capability: the services are defined corresponding to business capabilities (something that a business does in order to generate value). In this case, the focus is on the operations rather than on the domain objects, so we often have multiple services that manipulate the same objects but for different business capabilities.
- Decomposition by subdomain: in this approach, we use the Domain-drive design to separate the business into multiple subdomains, each to be defined in a single service. In this case, what drives the breakdown are the concepts and not the operations to be done on these concepts.
For the case described in this article, we used the decomposition by subdomain approach since it suits better the architecture environment we currently have with our microservices.
Understand deeply the domain
Before getting into coding and starting removing classes and dropping tables, we need to define what will be (and most importantly what will not be) removed. Even small contexts communicate with others and it’s a tough call to know where to make the incision. Knowing deeply the domain is the key to make this decision. So, the first thing we did was to investigate the entities/tables related to the context we would like to remove, with special attention to where this context appeared in others (with foreign keys for example).
After having a better understanding of which context/data would be migrated to the new service, we started implementing this domain in this new codebase, improving it to solve the problems of this context that we were already aware of. With this, we would not only be separating the code but also improving its functionality, making it more flexible and reliable.
In our first experience with monolith breakdown here in QuintoAndar, we decided to extract the contract signature domain into a new service. Although it is a small domain compared to others like houses and users, it was not a trivial job to determine where the ties should be cut off, which relationships we would maintain, etc, since the domains are seldom completely separate from each other. This part is the most important one in all the process since the decisions taken here will determine if the extraction will be a success and there will be no coupling, or if together with keeping the coupling, the complexity will be increased since now there will be remote calls between services in the flow, so it’s fundamental to take the time here.
(Re)Implement the main features
After having the foundation of the new application (the domain), we started implementing the business flows that were related to this context in the new service. Here, a significant part of the legacy code could be reused, with the changes needed to fix known problems and the refactorings necessary to put this code at the highest engineering standard we now have (standards change over a company’s lifetime, so it’s important to pay attention to this detail when reusing old code).
Progressive release, but supporting the current flows in the monolith
The next step after adding all the required business functionalities to the new microservice is to start to really make the production environment flows pass through this new code. Did we do everything at once, in an all or nothing strategy?
Not at all!!! We value our peace of mind and our jobs, so we did this using the well-established Canary Deployment strategy(aka progressive rollout). This approach determines that we should use the new flow incrementally, starting with a few users and progressing until everything passes through the new code. The image below illustrates the process.
By not changing the whole structure at once, we give ourselves time to validate if everything is working properly and minimize the impact on the users if there are points that must be fixed. Although we have a lot of automated tests and did a few bug bash sessions with our quality analyst to guarantee that the flow was working from the user perspective as well, there are always things we just discover when we launch the system. By starting ‘slow’ before hitting the road definitely, we minimize a lot of the trouble of changing the whole architecture of a feature.
Since in this step we have two services performing the same flow and the domain ties have not been cut off completely yet, it may be necessary to inform the monolith about data that was processed by the new flow. Taking our case as an example, the Contract entity in the monolith needed to know about the signatures processed by the new service, so we had to send the events back to the monolith. This was done using Apache Nifi as a CDC tool, to listen to the database events, then our new microservice enriched this event with the info needed by the monolith and finally, the event was sent to it. With this approach, we guaranteed data consistency. The diagram below illustrates this flow.
Migrate the data
After everything was validated and we did the full release, it was time to remove the loose ends, so this and the next sections are related to this cleanup.
The first thing was to migrate the old data present in the monolith database to the new service (where it really belongs). This step also may involve dropping FK relationships between the tables that continue in the monolith and the ones we are removing.
Again, this process was done using Apache Nifi (our company’s standard CDC/event generation tool) to read the old signatures data, enrich them if necessary, and send them to the new service database.
Remove the legacy code from the monolith
This is my favorite step — removing legacy code!!! Together with the migrating data step just described, it’s time to remove the monolith functionalities that were migrated to the new service.
It is an important step, although it is easily forgotten in the ever-changing software development environment since its value is not so clear at the first glance (it does not impact directly any feature or functionality of the system). But keeping useless code in a codebase disturbs to a great extent the developer workflow, increasing build/deploy times and making the ramp up harder for engineers that are not aware that this code is not being used anymore (they will keep wondering This must be here for a reason). So it’s important to remove this code for good before going into the next projects.
Conclusion
Our experience described in this article was the first time we faced the monolith breakdown in a more structured way as a company, with a team totally focused on understanding the domains and the ties and discovering the difficulties that are inherent to this process.
We started by removing a straightforward domain from the monolith and yet, we discovered that the process has a lot of tricks and attention points that must be considered with care in order to have a calm and successful rollout. Since we are dealing with data, losing events due to poor architecture/implementation can be extremely harmful to the enterprise’s health.
But this is a challenge we must accept as a top-notch technology company that has already gained its space in the market; the monolith architecture does not suit our needs anymore in terms of development and performance, so the monolith breakdown will be a topic we will continue to discuss a lot in the future.
And I hope this article and our former experience act as a starting point for these discussions!
If you liked what we did and want to be a part of our amazing team, you can find all our job openings here.