How One Engineering Question Could Improve The Entire Tech Design?
One engineering question could change the entire tech design and could have additional impacts such as a robust system, faster delivery, and lower cost. But, it is often overlooked.
Essentially, this question should be asked if you are building data-intensive systems.
Are real-time updates really vital or delay of X hours is acceptable?
More often overlooked, engineering teams always tend to choose real-time systems because it’s cool! and they convince themselves by assuming that their customers would love to see the data in real-time.
In reality, most of the systems like product dashboards and scorecards don’t need real-time updates, people are okay with waiting for some minutes to even hours for the aggregated view.
For example, One of the biggest blogging platforms such as the “Medium” updates story stats and audience stats every 24 hours, they are not striving to update stats in realtime, because no one asked for it.
To be more precise, this question has to be asked by the product team itself, which could improve the entire product design.
Learn system design in an elaborative way from simple to complex topic:
https://www.designgurus.io/course/grokking-system-design-fundamentals?aff=uygy41
An Old Story
In my early career, I once built a dashboard that processed a staggering ~3 crore messages daily. The messages, originating from multiple microservices, were funnelled into an aggregator microservice using Kafka, utilizing the Mongo aggregation framework for real-time processing.
The product was successful, serving various stakeholders on a day-to-day basis. However, hindsight has a way of revealing missed opportunities for improvement.
There were so many problems with the above design.
- Processing every update from various DBs leads to temporary inaccurate results due to out-of-order events.
- Tight coupling between the processing and storing layer. Both processing and storing had been taken care of using Mongo which resulted in high memory and CPU usage.
- MongoDB demanded larger machines because of heavier aggregation. It has to aggregate ~70 lakh data for every 5 minutes for a single tenant.
It was only later in my career that I grasped the depth of this crucial question. Looking back, I couldn’t help but think, “I should have asked this question earlier.” The realization struck that posing this question during the initial design phase could have significantly altered the trajectory of the entire system.
Even though there are ways to do a better real-time system using Change data capture, Kafka streams, etc,
I advocate doing it only if it is necessary because it will increase both the delivery time and infrastructure cost.
Batch Processing System Design
Find the true necessity and go for batch processing systems.
For example, the same above system design could be revised like below:
In the batch processing system design, Sqoop schedules data extraction from MySQL tables to HDFS. Spark processes, transforms, and aggregates data efficiently, replacing real-time Mongo aggregation. Orchestrated by Apache Airflow, periodic Spark jobs run seamlessly. Processed data from Spark is sent to a Kafka queue, consumed by the dashboard service, stored in MongoDB, and displayed in the UI later.
This approach, powered by Spark, optimizes resource utilization by efficiently handling large volumes of data in cohesive batches, eliminating the tight coupling between processing and storage layers seen in real-time systems. Moreover, it enhances data accuracy by avoiding the pitfalls of out-of-order events, contributing to a more reliable and precise system.
The cost-effective scaling achieved through Spark in a batch-oriented framework addresses the previous issues of heavier machines demanded by real-time MongoDB aggregation.
Final Note
Real-time stream processing systems, while not inherently problematic, can often be considered over-engineering.
Everything is Over-engineering when it doesn’t add any value for the customer.
The complexity they introduce makes monitoring and error correction more challenging compared to simpler batch systems. Considering the potential trade-offs, it might be prudent to defer such implementations to later quarters, ensuring that technological choices align closely with actual customer requirements.
It’s advisable to collaborate closely with the product team and engage in brainstorming sessions to assess the genuine need for a real-time dashboard.
Kindly follow, clap, and share the content!