Ensuring Data Integrity: A Guide to Validating Translated Data in ETL Pipelines Amid Data Warehouse Migration
Migrating a data warehouse from Hive to BigQuery offers many advantages, such as scalability, performance, and cost-efficiency. During this transition, ensuring the fidelity and accuracy of data becomes paramount. Robust data validation mechanisms are required in order to ensure that no regression has occurred in your newly migrated pipelines. In this blog post, we give a general overview of the validation process for data pipelines during a data warehouse migration.
What is data validation?
Data validation is the process of ensuring that data is accurate, consistent, and meets the predefined standards or requirements. The goal of data validation is to identify and rectify errors or inconsistencies in the data, thereby improving its quality and reliability for use in decision-making, analysis, and other business processes. Given that migrations from Hive to BigQuery can lead to differences in data pipeline outputs, data validation is crucial to ensure that your data remains in the same state after the migration.
Data Validation for Data Warehousing
On a high level, the process of a data warehouse migration generally involves three steps. First, you must translate your data pipeline scripts or queries from your original dialect into the target dialect (e.g. HiveQL to BigQuery SQL). From there, you run your original pipelines in parallel with the newly migrated ones. The final step is then performing data validation and attending to any potential errors that are detected in the process.
Given that ETL pipelines often consist of multiple steps and intermediate tables, the first question is, at which stages should I validate my tables? In our experience, the best solution is to validate any table within the current pipeline that is consumed by some external operation. For example, if a table will be used as input data for some machine learning model, or if a table will be used as input to some other pipeline. This allows you to save resources, as validating each individual step within a pipeline leads to additional costs and effort. If you find your table has errors that may stem from an earlier stage in your pipeline, then you can add the additional validation steps at that point.
For data validation, we implemented a tool that builds up a query that calculates various comparison values for the source and target table and then compares the two. Here there are two approaches one can take. First, you leave each table in its current location (e.g. Hive tables stay in Hive, and BigQuery tables stay in BigQuery), and then you query each table and pull the resulting values into a common location, such as loading into pandas DataFrames. An advantage to this approach is that tables can be compared across different platforms. However, if you are working with massive tables, then you can run into potential memory or computational limitations. In our experience, we find that uploading the original Hive table to BigQuery is advantageous. This allows us to perform the validation query directly in BigQuery, leveraging its processing power.
Types of Data Validation
Once you have both your original and migrated tables ready, it is time to validate them against each other. In this section, I will go over four main validations that we recommend to perform. To exemplify each approach, we will use the following two tables, one containing the (source) data generated by the original pipeline, and the other with the (target) data after migration.
Source Table
Target Table
1. Table Size Validation
When migrating data from Hive to BigQuery, a simple first step is to validate the size of the tables to ensure all data has been successfully transferred. This involves comparing the row count and data volume (in kB for example) between the Hive and BigQuery tables to detect any discrepancies or data loss during the migration process. Given that our two input tables had the same size and number of rows, both validations passed.
2. Schema Validation
Schema validation of tables is important as downstream consumers may expect the data to be of a certain type, and for columns to have particular names. Schema validation is not always a straightforward task, for example when encountering struct columns with case-sensitive sub-field names. Here it is important to evaluate whether such issues are legitimate concerns within the context of your own systems.
In the output below, we see that we have a failing validation when we compare the data type of the “age” column. In our original data this was an integer, whereas in the migrated data it is now a float.
3. Column Comparisons
Column-level statistics can be useful in determining whether the translated pipeline is generating the same data as your original pipeline. Calculating statistics such as sum, average, min, max, or percentiles can provide insight into the distribution of the data within a given column. The advantage here is that the aggregation of the data into single statistical values allows for fast results that scale well. However, some important considerations need to be kept in mind here. First, not every data type can be easily converted into such metrics. How do you get the sum of a string value? In such cases, the data must first be transformed into some numeric representation upon which these statistics can be calculated. In the case of a string column, one approach could be calculating the length of each string and summing them together. However, such an approach is not robust, as small differences can go undetected.
This is exemplified in the validation results below. Here we see that all three validations passed. However, if we look back at the data in each table, we see that there are indeed discrepancies in both the “name” and “favorite_food” columns. From the “name” column, the name “Jörg” in our source table is now “J?rg” in our target table. However, since they are both the same length, it goes undetected. For the “favorite_foods” column, the last row has a difference in the second element of the arrays. Here, since we were taking the minimum value of the length of the arrays, the difference again goes undetected. In such cases, row validations are necessary to properly detect the differences.
4. Row Comparisons
While column-level comparisons can give a rough picture of your data, row-level comparisons provide a more fine-grained inspection of the data between two tables. Techniques like hashing, concatenation, and direct cell-level comparison help identify inconsistencies in data values. However, this type of comparison is not without its own unique challenges. For example, matching rows using composite primary keys, managing the order of array elements, and handling nested structures. These all require solutions to achieve accurate row-level validation results.
Below we see the results of the row validation where all three cases failed. In the first case, we see the result of a hash run on the source and target tables where the rows are joined where the “name” is “Sally”. Given the difference in the data types for the `age` column, we end up with a failure. Next, we run a hash on the rows joined by the `name` `Jörg`. However, given that there is no matching row with that value in the target table, the join operation returns `null` for all of the target columns, and thus leads to a failure in the comparison. Finally, we perform a cell comparison of the `favorite_foods` column in the rows where the `name` is `Maria`. Here we detect the difference between the two arrays and get a failure in the comparison.
Fixing errors
Once you have run your validation step on a set of tables, you will inevitably find an inconsistency in your migrated data. At this stage, you will need to manually dig deeper to investigate the cause. Here, we found row-based comparisons to be invaluable, as they provide information on exactly which primary key the offending row has. Inconsistencies can crop up due to a multitude of reasons, with the main culprits being translation errors and inherent differences between your original and migrated dialects (e.g. Hive and BigQuery).
Post Migration Monitoring
Once you have cleared up all of the data inconsistencies detected during validation, it is a good idea to keep your validation steps in place for a short while afterward to continue monitoring the fidelity of your data. The recommended length of such a monitoring period will vary depending on the nature of your data pipelines.
Once you feel comfortable with the number of passed validation iterations for a given table, then you can turn off the validation step with a sense of trust that your newly migrated data pipeline will continue producing quality data.
Wrapping Up
So there you have it, a general guideline on how to perform data validation when migrating your data pipelines. In future posts, we will be diving deeper into the details of the most important topics covered here, providing real-world examples and solutions from our experiences migrating data pipelines to Google Cloud’s BigQuery. So if your organization is gearing up for a move to Google Cloud, then stick around to learn from our own experiences so that you can navigate your data warehouse migration with confidence and ease.