Improving Data Quality in a Lambda Architecture
Data quality is not free.
Detecting lost messages and auto recovering those messages can improve data quality in a lambda architecture. Since, we are talking in the context of streaming, we will be focusing more on the data in transit instead of data at rest. For a data in transit layer, latency (turn around time), loss rate, duplication rate are important operational metrics a data science team can use to discover data quality issues and solve for them.
Data quality is not free. A data science team needs to proactively ensure data quality.
Detecting Data Quality Issues
I have seen in several organizations where data science teams rely on count to detect loss. In this method, the team uses a counter to first count the number of messages sent by the producer in a time frame. Then they also count the number of messages received by the consumer in the same time frame. Once you have both, the team compares the received count and the send count.
This is a rudimentary way of thinking about data loss. There are duplicates in the messages being sent, count aggregation in streaming is not accurate and consumer resets to earlier positions on failure can cause entropy in the count number. Additionally, the count provides an inaccurate idea of the loss but provides no insight on lost messages which are crucial to improve data quality.
Data Trails
Basically, we need to know the trail of data instead of just the count. The trail of data as it moves through the pipeline, from ingestion to deposit needs to be tracked. Let us say each of the points of interest in a pipeline are called trail-heads like a head to a hiking trail. Instead of leveraging count, the teams can track the data at each trail-head. To analyze, the team can use sampling methods with levers to modulate sampling rate. A 50% sampling rate means every second message is tracked. The data from these can be analyzed to generate operational metrics. This is far more accurate that count.
As far as recovering lost messages goes, specially since those metrics can just point us accurately to weather a loss exist but fixing the loss is still paramount to improve data quality.
Trackers
One way to do that is with tracking messages. We will need to replicate the existing streaming cluster setup to do so. For predefined time intervals, we can send tracking messages through the duplicate streaming cluster when normal messages are flowing through the normal streaming cluster. When we receive the tracking messages on the other side, we can evaluate against the performance for both clusters if messages were lost, duplicated, delayed or otherwise. For latency, we can calculate the latency for any data hub using the timestamp in the traces.
Netflix Inca is modeled on these ideas.
Recommendations
Based on these ideas, the team can better detect failures in terms of data quality and address the quality issues in a first-principles based approach. There might still be scenarios where the technology stack might have limitations like a Kafka broker which could cause issues in data quality. Apart from that the techniques above can be applied in varying levels by teams depending on their maturity to improve data quality in a streaming system.
Subscribe to our Acing AI newsletter, I promise not to spam and its FREE!
Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.