Arjun: Data Veracity Framework at Myntra
The world of Big Data is generally characterized by the big 3V’s - Volume, Velocity, and Variety. But here is one more V which is getting increasingly spotlighted in the world of big data which is Veracity [1].
What is Veracity?
Veracity is one of the characteristics of big data related to consistency, accuracy, quality, and trustworthiness. Data Veracity refers to the biasedness, noise, and abnormality in data. It also refers to incomplete data or the presence of errors, outliers, and missing values. [2]
What are the sources of error in data?
- Software or application bugs can transform or miscalculate the data.
- Anomaly and Ambiguity of data
- Human error
Why is it important?
Big data is extremely complex and it is yet to be discovered how to unleash its complete potential. It is fruitless to use big data without validating and explaining it. Only trustworthy data can add value to any analysis.
Origin of Arjun in Myntra
The Data Platform at Myntra has the capability to capture data from different data sources, relational as well as non-relational, and store it on a data lake or data warehouse for further analysis. The minimum requirement of any pipeline which works as a bridge between source and target needs to be consistent at the data level and schema level. Every component in the pipeline should capture data correctly. Data platforms should also have their own rigorous checks & balances to become a source of truth in analytical & transactional data. This solves the problem of data disputes & mismatches which reflect only after a long time & fixing this requires a lot of manual intervention and developer effort. Here is the tale to create a framework that can increase trust in the consistency, accuracy, and quality of data.
Use-case at Myntra
The following places are where Data Veracity is currently used on a daily basis:
- Finance reporting use cases: All the data about SKUs, transactions, discounts & revenue reporting need to be verified with utmost precision as it helps build the performance view of the company. The Veracity Framework helps generate a trusted source of data for all core dashboards & reports.
- Supply Chain Operations & Reporting use cases: The precision of our supply chain network also hinges on confirming the correctness of data on every data run, so that operations can take quick & nimble actions on assigning delivery agents, assessing warehouse operations & monitoring the efficiency of forward & backward logistics.
Arjun ensures data correctness with the help of query execution on the data sources. A comparison between the data sources is done with an acceptable threshold, considering some unavoidable circumstances.
Guiding principles
Any Data Veracity Framework should majorly have the following properties:
- Ability to connect to different types of data sources.
- Generate a comparison report for analysis.
Building Arjun
Based on the guiding principles, we built Arjun as a Service in Java using a spring boot framework. The API request body includes information about the data source type, credentials, and respective queries or table names. Queries on both source and target are run simultaneously. Once both are complete, the resulting report is created which contains the difference between the source and target along with the respective threshold. A mail is sent to the mail subscriber with the report. This report is also saved for future analysis.
Capabilities
1. Supports different sources: The framework supports connectivity to various data relational database sources like MySQL, Hive, etc.
2. Type of veracity: The framework supports different types of Veracities:
2.1 Count Veracity: In this, a “row count” query for multiple tables or collections is run together. The difference is compared with the threshold. The report highlights all violations.
2.2 Schema Veracity: In this, a “get-schema” query for tables or collections is performed. It returns column names and their respective data types. The report highlights any column name and their respective data-type mismatches.
2.3 Custom Veracity: In this, a “custom” query consisting of any form of joins, filters, and combinations over a collection is provided. We execute veracity on metric names based on dimension names.
Eg. Suppose we want to get the number of people living in every city, then, in this case, “city” will be treated as a dimension name and sum is the metric name.
We get the same query result from both sources and compare the results. Mismatches are highlighted in the report.
3. Schedule Type: The framework supports veracities on a scheduled basis:
3.1 Concurrent Veracity: In this, a single API is called which executes queries on both data sources at the same time. The difference is highlighted in the veracity report.
Eg. In a real-time streaming pipeline, data should be in sync between the source and target. In such scenarios, the framework should perform query execution on the source and target at the same time.
3.2 Multifold Veracity: In this, two APIs are executed at different times. So, query execution on one source happens on the first API call and on another source at the second API call. The report gets generated at the time of the second API call.
Eg. In the batch processing pipeline, data is delayed in the target, depending on the batch job schedule. If a batch job runs daily, then a veracity check on the target can be run only after the batch job completes. So the query execution needs to be done at different times for the source and target
4. Storing result: The framework stores the result in a relational database. We can perform further analysis by querying the database. Eg. Comparing the growth of data in a table, Anomaly detection, etc.
5. Reporting: The framework supports different kinds of reporting based on the type of veracity.
In case of count veracity, the report should contain the table name and total row count from both the data source
Eg. Perform count veracity over table1, table2, and table3.
In case of schema veracity, the report should contain the table name and respective schema in both data sources.
Eg. Perform schema veracity over table1 and table2.
In the case of custom veracity, the report should contain Dimension and Metric-based query results.
Eg. Perform custom veracity on table1 *— get the total number of people living in the city, and the total number of people born in the same month.
6. Multiple dimensions and metrics: The framework supports multiple dimensions and metrics in the execution of custom veracity.
Eg. Perform custom veracity on table1*- get the total number of people who lived in the same city and were born in the same month. In this case, city and month will be the dimensions (Multi dimension).
Perform custom veracity on table1* — get the total number of matches played by the cricketer and the total run made by the same cricketer (Multi metric).
7. Tracking Veracity Runs: Every request submitted in the framework is trackable. The user can get information on the number of queries still running in the source and the time taken for each query to complete.
There are other admin APIs that tell us how many veracities are running at a particular time.
8. SLA Monitoring: SLA time is requested by the user at the time of submitting a veracity request. If the time threshold is breached, a monitoring script will send an alert mentioning the long-running queries.
Non-functional Capabilities
- The framework is highly scalable, vertical as well as horizontal.
- The report generation is pretty quick as the queries are executed in parallel. It mostly depends upon the query execution time.
- The service is stateless as it can be scaled to multiple virtual machines and fronted by a load balancer.
Future work
- Currently, this framework supports only relational sources. We can further extend support to non-relational databases and file systems.
- New kinds of veracity support. E.g.. Sampling veracity — In this kind of veracity, the framework can take some amount of data from both the data sources and compare complete data.
- Further, we can create a user interface on top of this service to have scheduling capabilities.
- We are also looking to extend this framework for anomaly detection in the system.
I’m excited to hear from the rest of the community on how they’ve solved the problem of data correctness, and any Data Veracity solutions built or available in the market. Please comment below. Stay tuned for further updates…
Credits: Thanks to Abhinav Dangi for their review and support