How we modernise legacy systems at CSIT

From Monolith to Microservices

Terence TBK
CSIT tech blog
Published in
8 min readJul 29, 2022

--

Legacy systems are outdated software and/or hardware that are still in use. These systems tend to be faulty, and have compatibility issues with newer systems. Although they can be used according to their original design, they lack the ability to be enhanced.

With rapid advancements in technology, companies must deal with legacy systems to keep up. Legacy systems hinder the adoption of modern technologies.

Monoliths in legacy systems

Photo by Michael Dziedzic on Unsplash

A monolith system is a large application that contains code written by multiple developers over the course of many years. The code is often poorly maintained. Some of these developers may have left the development team or the company, resulting in knowledge gaps. The code is hard to refactor due to lack of knowledge, coupled with the complexity of modifying a system that is always in use in production (i.e. changing the wheels of a moving car).

Costs of having legacy systems:

  • Higher overheads in maintenance and support;
  • Lower efficiency.

Dealing with monoliths in legacy systems

Since the monolith system was still required (before we replace all of its functions entirely), we decided to take a more manageable approach by breaking it into smaller services using the microservices approach.

What are microservices?

Microservices are independent services that perform some business function in a form of software applications that exist as part of a larger system. These services are meant to be lightweight and easier to implement.

Benefits of microservices:

  • Each service is independently scalable;
  • Services have smaller code bases that make it easier to maintain and test;
  • Problems are isolated to a single service, enabling quicker troubleshooting.
Photo by Louis Hansel on Unsplash

As an analogy, let’s think of a restaurant as a system. A restaurant has many staff which may consist of waiters, cooks, dishwashers, a manager, etc. Each staff is like a microservice that does a specific part of work for the restaurant; the restaurant being the encompassing larger system.

If customers’ orders are not being taken down fast enough, more waiters can be hired. Similarly, more instances of a microservice can be added when the load gets higher.

Now, imagine a monolith as the entire restaurant with a fixed set of staff with no way to individually change the number of waiters, cooks, etc. The only way to take orders more quickly is to pay for a new outlet, even if all you needed was just one more waiter! It is very inefficient to scale monolith systems.

Microservice architectures

Imagining how a restaurant might look like with microservices architecture.

When adopting the microservices approach, there are two main microservice architectures to consider. Each has pros and cons that fit different use cases:

  • Orchestration, as the name implies, requires an orchestrator to actively control the work done by each service.
  • Choreography takes a less strict approach by allowing each service to carry out its work independently.

Microservice Orchestration

Going back to the restaurant analogy, the restaurant’s manager will take up the role of the orchestrator. Every staff will wait for the manager’s instructions before doing any work.

First, the manager will tell the waiters to take orders from customers. The waiters will return to the manager with the order slips. Next, the manager takes the order slips, and instructs the cooks to begin cooking the food.

Once the food is ready, the cooks hand over the food to the manager. The manager then hands the food over to the waiters and instructs them to serve the food to customers. Finally, the waiters go back to the manager to report that the food has been served, and the cycle repeats.

This relies heavily on the orchestrator/manager and is very tightly-coupled.

Microservice Choreography

Applying the analogy further, Choreography is a more loosely coupled approach than Orchestration. Staff can independently pick up work to do on their own, without needing to wait for the manager to assign them. It will seem familiar to you as it is how a real-life restaurant would run.

Staff will carry out work based on events. For instance, the first event occurs when a customer enters the restaurant. This event triggers the waiter to take the customer’s order. Once orders are taken, the order slips are queued up in a ticket clipper for the cooks to process. Cooks will proceed to cook the food in the order slip at the front of the queue.

Once the food is ready, the cooks will place the food on the collection counter. The waiters then pick up the food and serve it to customers. Finally, when the customers are done with their meal, the waiters will hand over the dishes to the dishwashers.

This architecture is loosely-coupled and event driven, where the first event is the entry of a customer into the restaurant. The next few events are triggered by the waiter, then the cook, and back to the waiter. Staff can work independently, and in parallel, without needing the manager to control everything.

Wondering where the manager went? Probably on vacation; without needing to do all that micro-managing.

Comparison

Orchestration vs. Choreography

Selecting the right architecture

It’s not about one being better than the other, it all depends on your needs.

Photo by Victor on Unsplash

In our case, the Choreography architecture was chosen for these reasons:

  • Decentralised Workflow
  • Independent Scalability
  • Ease of Extensibility

Decentralised Workflow

The main disadvantage of Orchestration is its reliance on a single orchestrator. The orchestrator embeds all the processing logic in itself, becoming the owner of the entire logical workflow for the system (centralised workflow).

Independent Scalability

Independent scalability requires processing logic to be spread across microservices (decentralised workflow). Basically, hiring more employees for specific roles instead of opening entirely new outlets.

Ease of Extensibility

Extensibility is the ability to add more functionality to the system while not (or minimally) affecting other parts of the system. This is important for us because our systems are highly interconnected with other systems. We need to keep improving and adding features.

A restaurant’s food supply chain team can optimise purchasing ingredients based on the popularity of specific menu items using order data.

Comparison for adding a new workflow

With the Orchestration architecture, the manager (orchestrator) has to juggle all the new workflow, on top of his the earlier duties. This is added complexity for the orchestrator, and it will affect the existing workflow directly.

Fortunately, with Choreography, the supply chain team can simply consume events from the event broker to keep track of the orders on their own. This does not affect other parts of the system.

Our experience with Microservice Choreography

Tech stack for our Microservice Choreography-based system

First, a quick introduction to our tech stack. Our microservices are mainly written using Spring framework, and deployed on Kubernetes as Docker containers. NiFi enables us to do extract, transform, and load (ETL). MongoDB and MinIO are our datastores, while Kafka and RabbitMQ are used as message brokers. Finally, the ElasticSearch + Fluentd + Kibana (EFK) stack is used for Observability and Monitoring.

Problems started to surface

Alas, no architecture is perfect.

A large set of microservices leads to:

  • Difficulty in finding root cause of errors
  • Difficulty in monitoring whether every service is running correctly
  • Difficulty in tracking progress
  • Difficulty in recovering from failure

Solving the problems

1. Building Resiliency with Chaos Engineering

Chaos Engineering is about conducting experiments on a system to ensure that it can withstand unexpected scenarios that are thrown at it. These experiments are conducted during something known as GameDay.

My colleague Rain Chua has written an awesome article on how we conduct GameDay here at CSIT: Improving Operational Resiliency through GameDay.

2. Enabling detection of errors/failures through extensive Observability

Key components of Observability

In Rain’s GameDay article, he touched on the need for extensive Observability to know the impact of the Chaos Engineering experiments being conducted.

A detailed look into how we implemented Observability will be written in a follow-up article.

3. Recovering from errors/failures using automation & customised tools

Observability enables the detection of errors, but it does not fix them automatically. We categorised our errors into 2 types: Simple errors and errors that required manual intervention.

For simple errors (such as those caused by API request timeouts), we created automated retry mechanisms. For those that needed manual intervention, we built tools to routinise the recovery process to reduce manual interventions.

We also built customised tools to provide a consolidated view for all errors and relevant information for debugging. Imagine having to repeatedly type in the commands in a terminal to gather information across the different platforms (MongoDB, Kafka, Kibana) for troubleshooting.

More details to follow in a future article. So, follow the CSIT Blog for updates!

Final thoughts

  • Microservices architecture may not be suitable for all projects.
  • Selecting an architecture should be based on the needs of the project.
  • Always expect new problems to arise, and be ready to adapt to them.

About CSIT’s Tech Environment

CSIT uses technology to enable and advance national security in Singapore. Due to the highly classified nature, the environment has to be air-gapped. Unfortunately, this prevents us from enjoying the ease of using SaaS/IaaS/PaaS products (e.g. AWS, Azure, GCP, Datadog, Dynatrace) that are readily available on the internet.

This means that development and deployments are done in non-internet connected networks. Thus, all platforms such as Kubernetes, had to be setup on-premise.

But fret not! Despite not being able to use internet connected services, CSIT has a Cloud Infrastructure and Services department that provides core infrastructure for developers to focus on software development.

My colleague Mark Lee has written a great primer on this in What it’s like to work as a Software Engineer (Infrastructure) at CSIT.

--

--