How We Achieved Real-time Metrics From A Legacy System Without Changing A Single Line Of Code

Fernando Alvarez
SSENSE-TECH
Published in
7 min readJul 23, 2021

--

Real-time, or almost real-time, statistics are extremely important for any business. Everyone needs that level of observability in order to quickly react to the unexpected, and have shorter feedback loops for improvements. SSENSE is not the exception to the rule. We trust our business metrics to measure our success and to create plans which meet our goals and expectations. Recently, my team faced a couple of problems of observability with one of the business metrics:

  • No Source of Truth. Sources were diverse and VERY unreliable
  • Longer Feedback Loops. Data was ready after a few days and was managed by another team. We did not get the proper feedback on time to quickly react and create a countermeasure.

My team needed this precious information almost immediately in order to short this feedback loop. The source of truth for the information was only present in one of our legacy microservices. This article tells the story of how we get this information without doing a single line of code in that service while only using Change Data Capture (CDC) and AWS Database Migration Service (DMS).

The Problem

As I mentioned, the team and I received a requirement from the business to obtain a business metric in near real-time, in order to react as fast as possible, reduce any feedback loop, and ensure the source of truth is a legacy microservice that the team owned. Sounds easy, right? Just push the metric that you need and you can call it a day! Well… (Sadly) it is not that easy. This microservice, in particular, have a couple of the characteristics of a problematic legacy service:

  • Hard to maintain. Any new feature to be added to this service needs either a complete refactor or modifies multiple files in the project.
  • Brittle Tests. The test suite in this project is extremely unreliable, while any change to the code also requires a major change to the tests.
  • Original maintainers are not working anymore in the project. Everyone hates that person who wrote a piece of code that no one can understand, right?

As you may guess at this point, pushing this metric from this legacy microservice was not an option or, at least, it was the very last one to consider. We needed to get that piece of precious data and found a way to do it with the minimal number of changes to the service. What did we do?

Our Path to the Solution

We only knew one thing as a fact, the source of truth for the metric was a PostgreSQL instance used by the microservice. Our first draft of the solution would have a job that pulls that information from the database every X time and puts it somewhere like a queue or an event stream. We found this article from the AWS Database Blog called “Stream changes from Amazon RDS for PostgreSQL using Amazon Kinesis Data Streams and AWS Lambda” that looked exactly like what we needed.

What the article proposed is using Change Data Capture (CDC) in our database and fortunately PostgreSQL has a feature dedicated to it called Logical Replication which in ssense — pun intended — is “a replication technique based on Publisher-Subscriber mechanism to replicate data objects and their changes’’. The very pubsub nature of this feature fits perfectly with our event-driven needs. The process is simple and begins by replicating all changes from the Source table that we indicate. The Lambda cron Job will take each one of those changes, transform it into a common format, such as JSON and then push it to an AWS Kinesis Stream.

At the end of this pipeline, a listener will receive those changes in the form of events and extract the metric we need.

First Revision of the Design

This sounds good, right? As much as it solved the problem, there are a couple of things that we didn’t like:

  1. We have to apply a downtime to our database (sadly, proven to be inevitable).
  2. We needed to manage a new service only to pull these changes and push them to AWS. Managing a new service, even a Lambda, means creating new code, maintenance, and bug fixes.
  3. By nature, Lambda and SQL databases are not a good combination since DB connection management pooling is especially hard in this context and how easily we can bust our database connections with the same.
  4. Using a database directly is not a good integration pattern between services due to the tight coupling.

This solution was good enough but still not the best for the problem. We went back to the whiteboard and after some discussions, we were introduced to an AWS product that essentially does this; it’s called Database Migration Service (DMS).

DMS is a service from AWS that easily integrates with Databases as a source to push changes to a target which for us will be Kinesis. This product really fits well in our needs since:

  1. We don’t really need to manage any new code in order to get the changes from PostgreSQL. This greatly reduces the risk as the legacy microservice is not affected.
  2. We just focus on what we do with the CDC event, which in our case is to process the business metric and push it to a dashboard.
  3. It decouples a lot of specifics of the legacy microservice.
  4. We can expand the functionality of this to other needs inside the company.

After our implementation in using the serverless framework, our final architecture looks like this, which is simple yet powerful:

Final Design
  • We created a DMS Replication Instance to temporarily save our CDC events.
  • We created a DMS Migration Task that reads all the CDC in real-time from our legacy DB and will redirect to a Kinesis Stream.
  • We attached a Lambda that will listen to this stream and will react to the new events that come. The actual content of the Lambda is out of the scope of this article.

Operational Concerns and Hiccups

“With great power comes great responsibility” said Uncle Ben to Peter Parker, and the same can be said about this new service. Having new moving parts in your organization eventually introduces new operational problems such as more observability of your services, costs, and resource optimization.

When we released this, we put in place a couple of alarms that would let us know:

  • Provisioned throughput alert when we exceed our capacity in our Kinesis Stream. If this happens, messages in the stream will start to stack and cause a potential delay in the consumption.
  • Kinesis Stream alert when it stops receiving items (PutRecord = 0 for at least 10 mins ex.).

And for a moment, we thought that was enough and just like the narrator of Arrested Development said: “It wasn’t”. At some point of the testing — thankfully not in production — we noticed that our database presented an aggressive decrement in the available disk storage and at some point we ran out of disk space.

Rapid decrease of free storage in our PostgresQL DB

At the time, a quick solution was to increment the disk storage but that was just a patch for a larger issue.

Larger scope of the rapid decrease problem.

The source of the issues was associated with how PostgreSQL handles replication. All the changes are kept in WAL logs until all replication nodes have caught up. This impacted a metric called Replication Lag.

Replication slot lag visible impact

The solution was simpler than we thought. We found out the replication nodes, also known as replication slots by using the query:

SELECT * FROM pg_replication_slots;

We found one unused slot and then removed it by using the query:

SELECT pg_drop_replication_slot(‘replication_slot_id’);

After that, it was business as usual:

Immediate decrease of oldest replication lag after drop slot
After dropping the slot, the free storage immediately went up.

This incident made us realize a couple of points for our production operations:

  • When we are targeting a table for CDC in DMS, let’s make sure not to change it; and, if necessary, make sure to remove the replication slot.
  • Create alarms for the metrics Oldest Replication Slot Lag and Free Storage Space to know right away if our DMS task is busting our resources.

For cost optimization, we made a couple of decisions:

  • Disable our Workflow and remove all of the resources in the testing environment since the relevant data is always on prod.
  • Using Infrastructure as a Code — for us with Serverless Framework — being able to Create/Destroy these resources when testing is needed.
  • The replication instance used is on-demand and one tier below the current dbDB it replicates from.

Conclusion

When working with legacy systems, always consider the risk associated with adding additional functionality, no matter how simple it may seem. Sometimes it is better to extract that functionality to a newer/external system that can react to changes on the legacy system to provide what is expected. This is just one example on how you can deal with these types of systems and, at the end, we all hate legacy microservices, right? For SSENSE, this is one step forward in starting to decouple from this legacy system, and move towards a more elegant and scalable solution.

--

--