A Journey From Blocking Monolith To Non-Blocking Microservices
Microservices as an architectural style is here to stay, and its adoption is picking up steam. Long gone are the days of the heated debate between advocates and sceptics. The same goes for reactive systems. I am not going to discuss benefits, or trade offs (yes, there are a few) of following these approaches, many more brilliant folks have already done that. This post is purely meant to address the process that led to the redesign and rewrite of one of my client’s flagship systems, from the perspective of its designer.
The system in question is rather basic, it’s main purpose is to allow users and B2B clients to search for records given a set of criteria. Nothing fancy.
It only manages the data it can search for, apart from a simple user model. It does not own the data, the data belongs to partner organisations and is uploaded daily to a FTP server. A nightly batch process is responsible for performing basic data validation, transformation and indexation.
The system also allows users to save search preferences and features a simple administration module, providing a few index maintenance operations and system parameterisation.
The legacy design, from 2009, did not take into account the enormous growth, in both index size and usage, experienced over the years. Moreover, both data and usage are expected to grow even more (most likely double) in the near future. Further work performed on the system did not address fundamental design issues, and if anything, only added complexity. One can’t really blame the original designers, back in the day this was business as usual. Plus, waterfall and off-site development had its share of blame as well.
The system was originally designed as a typical 3 tier J2EE web application. The stack is based on Spring MVC and EJBs, deployed to a stateful JBoss cluster. The index is based on Solr 2 with master-slave replication and user preferences are stored in a MySQL database. Due to its blocking-synchronous nature, it sometimes happens that during peaks a node fails and must be restarted. And we’re not even talking massive scale here, perhaps a few hundred requests per second is enough to bring a node down and seriously affect the system’s responsiveness.
User interface also requires a complete workout. Based on JSPs (with plenty of embedded logic), adding new features or improving usability is both complex and costly. Not to mention it screams nineties all over it. Ewww!
For those of us who are not lucky enough to be constantly exposed to the forefront of technological progress in their daily jobs (and sadly I like to believe it is the majority of us, software engineers), we must embrace any opportunity to put in practice all the knowledge and wisdom acquired from kind ground breakers, advocates and contributors who care to share, best practices and proven patterns.
So when asked by the Product Owner to think of ways to add a package of new functionality, and in the process address a set of much needed performance improvements, the thought immediately popped into my mind:
It’s time for a complete rewrite!
I had to expose my case carefully, rewrites are not well seen in this client (a Public Sector organisation). I suspect it has to do with a mix of budget/timeline considerations and a more obscure sense of “admission of guilt” that a system in its current form is no longer fit for purpose (if ever was).
At this point I feel I have to provide some context. As with most Enterprise environments, the organisation has a current Reference Architecture. And it mandates that, whenever it makes sense, new developments must follow the holy trinity of DDD-Reactive-Microservices. Also, the entire support infrastructure is already set up and put in place: discovery and configuration services, fully managed data stores as a service, CI pipelines, containers, etc. Some microservices are already running in production and some others are on their way. All this helped build the case.
The catch was that, for the PO, this was not considered a “new development”, but rather an improvement to the existing capability. What finally convinced them was an enterprise-wide UI/UX rebranding effort that would require an enormous effort to accomplish. So when, by their request, I laid down some high-level numbers and benefits/drawbacks for both approaches (refactor vs rewrite), it became clear to all what the right choice would be (I admit, it might have been a bit biased, lol).
The new design is highly constrained by the aforementioned Reference Architecture. Apart from the already discussed holy trinity, it also mandates:
- Use of open source and standards
- Cloud native
- Java back ends
- Most middleware and frameworks
- Platform services and Infrastructure (DCs, DR, network zones, etc.)
Reading this list one rightfully asks: “what’s left there to design?”. Well, even though boundaries are a bit tight, one can still have some fun! And again, this is business as usual in an Enterprise, plus I actively contribute to the Reference too, so that should count :).
In particular, we will introduce the event-driven architecture pattern. Even though the scope of this system is just to maintain a consistent domain index (events triggered by Partner organisations), other systems may also be “interested” in reacting to these events. Once the event store is in place, others may subscribe to it.
Another aspect the new design addresses is the need to stay responsive under any load, making individual components independently scalable (the system is actually read intensive).
As with many other Enterprises out there, we do a small variant of the infamous Water-Scrum-Fall pattern. A big amount of planning-analysis-design upfront (unfortunately required when contracting 3rd parties), a few actual development iterations in the middle and a final testing-acceptance-deployment phase.
We settled on performing a phased strangling of the monolith. Initially, a brand new UI will be produced, hitting a brand new API (actually fronting the monolith’s legacy search API). Before one asks, in order to try and tackle the existing instability issues, the first phase will already provide back-pressure and circuit breaker mechanisms. And also, UI master data will be moved to the configuration service. This is ongoing as I write.
Later, in a second phase, we will shift identity and access management aspects to where it belongs (the enterprise portal’s User Service), provide a new administration dashboard and completely replace the search/indexing engines.
The new looks
The first phase is all about complying with the new corporate identity and providing a fresh user experience. Usability is improved, benefiting from extensive study of the behaviour of similar, cutting edge, online search engines.
Anyway, the new UI Client is going to be a React app (served from a Spring Boot app), leveraging Flux’s data flow, Redux’s state container and HTTP/2’s efficiency. Presentation components build on top of components from the library mentioned above and action creators dispatch promises to the new back end REST API.
The Edge Service, also a Spring Boot app equipped with Netflix OSS (Zuul, Hystrix, etc.), must be capable of routing both new UI and legacy clients requests respectively to the Search Service and the Monolith, and master data queries to Consul. It will also provide load balancing, resilience and analytics capabilities.
And finally, the Search Service will act as an API adapter, exposing the new, redesigned search API, translate incoming requests and route them to the Monolith, and finally transform results before returning to the UI Client. This will be done in a non-blocking fashion, courtesy of Reactor.
At this stage, the current Monolith’s cluster will be reinforced with more standby nodes, just in case. Also, system administration functions will still be handled by the Monolith’s legacy UI.
The new engine
The second and last phase will be, broadly speaking, slightly more complex. The main tasks it sets to accomplish are:
- Finally strangle the monolith, splitting its search and indexation functionalities into separate components, following the CQRS pattern;
- Implement Event Sourcing as the engine of entity state;
- Redesign the index and replace the search engine;
The Search Service will not only no longer need to adapt UI Client requests, but it will have to expose the legacy API. The reason is: backward compatibility; external clients cannot be forced to upgrade. It’s usage will be monitored though, and a plan to phase out the old API will be put in place, notifying external clients well in advance.
The new Index Service, also a Spring Boot app, will no longer directly update the index as the Monolith currently does, instead it will record a log of events in a Kafka cluster. It will still continue to batch process data uploaded to the FTP server, though not in a scheduled fashion as before but rather anytime new data is available. A brand new REST API will also be exposed to allow Partners that wish to do so to provide real-time, batch or atomic, index updates.
The new index will leverage the client’s successful experience with Elastic Stack. A Logstash pipeline will act as the event processor, consuming domain events, applying necessary transformations and finally updating the Elasticsearch index with the processed documents.
Apart from the logic synergies of Elasticsearch and Logstash, and the benefit of strictly handling JSON objects, the main reason behind the Elastic Stack choice is the ability to customise the document routing, effectively splitting documents among shards for an optimal search performance, without incurring hotspots.
The Edge Service will, at this point, implement authentication filtering, configured at the route level. Unauthorised requests will be redirected to the enterprise portal’s User Service for login/registration. This service will be extended to manage portal and system level user preferences, backed by a MongoDB.
The last piece of the puzzle is a new administration dashboard, again a React app, adding new features and functionalities to those already being provided by the Monolith, especially around monitoring and reporting.
It has been an exciting undertaking, from convincing the Project Owner (and by extension the Programme Board) to jump into this boat, crafting the high level design, until having it approved by key stakeholders.
It goes without saying that the scope, approach and milestones carry a good amount of risk. Having at least 3 different providers involved in the process doesn’t make it any better. But the prospective benefits make it worth the effort. And besides, considering that we’re in the enterprisy world, in the public sector… I can’t ask for more. Really.
I hope that this was entertaining, perhaps even enlightening. I would like to take the opportunity to thank all the anonymous heroes (and the not so anonymous too) of the open source community, online learning platforms and, especially, the corporations advancing the cause of great free software. Without them, nothing of this would ever be possible.