The Evolution of a Legacy System

Storkey
ELMO Software
Published in
6 min readNov 19, 2023

--

The cloud

In the ever-evolving landscape of technology, where innovation races ahead at breakneck speed, legacy software systems and cloud architectures, often find themselves getting caught being stagnated. We have a Legacy System, which was once a pioneer of its era at the organisation, has now become the bane of existence for the DevOps team. In this situation, often teams opt for a complete rebuild, however, with small incremental evolutions, we can breathe new life into this outdated relic and transform it into a modern powerhouse.

First Evolution — Charmander

We were running a legacy integrations system (we called LT-Cron) to run a set `Jobs` for a number of clients. LT-Cron was built off of a system of hourly Cron Jobs that would trigger a bash script that in turn would run the application — hence the name Cron. LT-Cron, would be triggered by a Linux cronjob at the first minute of every hour and would go through the list of clients one by one and if that client had a `Job` that needed to be run, the system would run that job then after completing that `Job` it would move onto the next client in the list. Like most legacy applications, this app was highly volatile, prone to errors and had poor logging. This inevitably meant that lots of time each week, was spent supporting and debugging this legacy app. Oh, and I almost forgot to mention it also required a large dedicated EC2 instance to run.

Obviously, this application had some flaws/bad architecture problems that needed to be rectified:
1. The sequential nature of the app meant that if there was a critical error in client 2/100, then the other 98 client’s `Jobs` would not be triggered at all. = BAD
2. If 1 of the clients `Jobs` triggered for hour `x` takes a long time, all other `Jobs` are delayed until this one is completed due to the sequential nature of the system = BAD
3. Using the Linux Crontab system to run/trigger a critical application is not ideal.
4. Runs on a dedicated EC2 instance.
5. Deployed via AWS Opsworks.
6. Has poor logging.

The DevOps team, we were tasked with rectifying this application, making it more stable and reliable, without changing its inner workings. Thus the evolution begins…

Second Evolution — Charmeleon

As a DevOps engineer, the first step in making an application more reliable is to move it to a more reliable Architecture. So what's the one tool, that every DevOps engineer throws at any problem they have… You guessed it KUBERNETES!

We moved LT-Cron to Kubernetes for a number of reasons:
1. Use the Kubernetes CronJob deployment type to trigger LT-Cron more reliably.
2. Move to containers and the organisation's more advanced CI/CD toolset which ensured regular security scans on workloads and all-round better software release practices.
3. Take advantage of the already running Kubernetes environment and remove a static 24/7 running EC2 instance saving precious $’s
4. Our Kubernetes environment automatically captures the STDOUT logs and ships them to our central Elastic stack.

So the new architecture now looks like this.

Now nothing here has really changed other than where the LT-Cron is now running. However, we did make the system more reliable, increase the visibility tenfold, increase security and move the application into the 21st century. The remaining flaws of this system are:

The sequential nature of the app meant that if there was a critical error in client 2/100, then the other 98 client’s `Jobs` would not be triggered at all. = BAD

If 1 of the clients `Jobs` triggered for hour `x` takes a long time, all other `Jobs` are delayed until this one is completed due to the sequential nature of the system

This is a big one and it causes lots of issues for the team on a regular basis. Constantly having to create cronjob with a `while true; do sleep 10; done` command, so that we can `kubectl exec` onto the container and re-run specific integrations that were missed. This became a top priority for the team to fix

Third Evolution — Charizard

The final evolution of this application involves changing the entrypoint of the containers and rearchitecting it to be a pub-sub architecture. Currently, the LT-Cron application entrypoint starts up, reads from a list of clients and sequentially runs the `Jobs` for that client, one by one. In this evolution, we changed this up quite a bit. Now the entrypoint of the Kubernetes Cronjob is called `sqs_publisher.py` takes the same list of clients, but publishes messages into an AWS SQS Queue. The message attributes look like this:

{
“client”: “client1”, // client name for job
“currHour”: “15” // 24 hour time
}

Client — to tell the LT-Cron container what client to run for
currHour — to tell the LT-Cron container what hour the job is intended to run (used for the logic of the application and the definition of jobs)

We now also have a new Kubernetes CRD introduced into this architecture, Keda ScaledJobs. Keda ScaledJobs allow us to scale pods on things like messages in an AWS SQS Queue. So with this, we scale 1 LT-Cron K8s Job per message in the SQS Queue. The entrypoint command for these K8s Jobs is `sqs_consumer.py` and now pulls off a single message from the SQS Queue, extracts the client name and hour from the message attributes and triggers a modified version of the original LT-Cron entrypoint to run the `Job` for the specified client and the specified hour.

Side note — we included the hour into the message to ensure that even if the system gets clogged up by larger long-running jobs, the message can be consumed at any time but still run the client's `Job` for the intended hour.

These changes also had some massive benefits on the overall system:
1. Parallelism for LT-Cron — the system is no longer sequential
2. Built-in retry mechanism from SQS Dead letter queue
3. Scalable — We no longer have to be worried about the amount of clients we can progress through per hour

The key to this new architecture is that the LT-Cron system is now Fault-tolerant. A single `Job` critical will not stop the system from even attempting the other scheduled jobs and on top of this, because LT-Cron now runs jobs in parallel, long-running jobs no longer clog up the system.

Here is the new Architecture of LT-Cron

What’s next — Gigantamax Charizard?

Technology is ever-changing and always evolving, this is not the end for LT-Cron but in my humble opinion, I have stopped calling it a “Legacy System

--

--

Storkey
ELMO Software

• 🇦🇺 • 26 • Content Creator • DevOps Engineer •