Mastering Data Migrations with DataComPy: from Legacy Systems to the Modern Data Stack

Published in

Indicium Engineering

11 min readDec 14, 2023

by Joao Eccel, Matheus Oliveira, Rulyan Fernandes

Embarking on a data migration journey, Analytics Engineers face the critical challenge of validating refactored data — a task demanding precision and care. This article discuss the vital role DataComPy, a python package, plays in comparing dataframes, ensuring the consistency, completeness, and accuracy of migrated data. Join us as we explore how DataComPy empowers Analytics Engineers in their quest for reliable data migrations.

Introduction

Are you involved in a data migration project, particularly the critical task of ensuring that your refactored data mirrors the original? As an Analytics Engineer, you’re well aware that this process isn’t just complex; it’s a high-stakes endeavor demanding a lot of attention and care.

Data refactoring is not just about making cosmetic changes in the code base. Data refactoring is actually a natural process of development, due to evolving business needs and requirements, which sometimes demand rebuilding entire applications from scratch, in more modern environments that enable new functionalities. It aims at improving the quality, structure, performance and reliability of data, without changing its fundamental meaning or usage.

Data refactoring leads to several challenges and risks, such as: introducing errors or inconsistencies into the refactored data, changing the logic or functionality of the solution unintentionally, or even losing track of the changes made to the data. All of this can have a negative impact on business outcomes.

To ensure that refactoring does not alter the meaning or use of your data, you must validate the data (tables, for example) from both the legacy and new solutions.

This critical step involves thorough comparisons, ensuring that the data holds up in terms of consistency, completeness and accuracy. Imagine migrating data from a legacy system to a cloud-based solution. By comparing the output from both, you can pinpoint and rectify discrepancies before the new system goes to production.

But how do you efficiently and reliably validate refactored data? Which tools to use? How do you deal with different databases, formats, structures, data types and levels of file accuracy? How to handle large volumes of data?

In this article you will be introduced to DataComPy, a python package used to compare two dataframes, interpret the differences between them and ensure the integrity of migrated data. The article details its use, presents a methodology for fully interpreting its results and suggests practical approaches for common validation challenges, so you can confidently adopt DataComPy in your validation workframe.

Figure 1. Representation of the DataComPy application in the validation of migrated data

DataComPy is a python package that allows one to compare two pandas or spark dataframes and identify the differences between them.

It can be used, for example, to validate data migrated from a Hadoop system that uses PySpark as data manipulation language, to an architecture in Redshift that uses SQL as data manipulation language.

To make comparisons, DataComPy initially makes a join between the dataframes. To perform the join, it uses a column, a set of columns, or the index of each row in the dataframe. For the records that are matched in the join key, the tool compares all the columns and indicates the number of records with all columns equal and the number of records with some unequal column(s). For records without match in the join key, the tool indicates the amount of records present only in one or in the other dataframe.

The following diagram illustrates how DataComPy compares two dataframes and organizes its report:

Figure 2 — Representation of DataComPy’s comparison output

At the end of the comparisons, it provides a detailed report with the results of the comparison, the differences found and samples of divergent records.

Using DataComPy, you have access to the following functionalities:

Detailed report: DataComPy generates a human-readable report containing statistics and samples of the differences found, allowing you to check the integrity and reliability of the data.
Definition of absolute and relative tolerance: DataComPy allows you to adjust the absolute and relative tolerance levels for numerical columns, and compare the values even when the data types are not the same.
Versatility for exploring the results: DataComPy allows you to export the results of the comparison, both equal and unequal records, allowing you to evaluate the data in detail and seek explanations for the results obtained.

How to use DataComPy?

To use DataComPy, you will need to install it and import it into a notebook.

— In your terminal, install DataComPy:

pip install DataComPy

— Import DataComPy into your python code using the following command:

import DataComPy

— Read the files you want to compare and store them in two pandas dataframes (calling them as df1 and df2, for example)

To read the csv files, you can use pandas’ read_csv method, which returns a dataframe from a file or url. If your csv files are called legacy_system_data and new_system_data and are in the same folder as your code, consider the code below:

import pandas as pd
df1 = pd.read_csv("legacy_system_data.csv")
df2 = pd.read_csv("new_system_data.csv")

— Create a variable (results, for example) and store the results of the comparison in it.

results = DataComPy.Compare(
df1 #dataframe 1 to compare
, df2 #dataframe 2 to compare
, join_columns=None #(string or list, optional)
, on_index=False #(bool, optional)
, abs_tol=0 #(float, optional) absolute tolerance
, rel_tol=0 #(float, optional) relative tolerance
, df1_name='df1'#(string, optional) personalize df1 name
, df2_name='df2'#(string, optional) personalize df2 name
)

Details of the parameters:

df1: dataframe where legacy data will be stored
df2: dataframe where the new data will be stored
join_columns: a list or string with the names of the columns that will be used to join the two dataframes.
abs_tol: a numeric value representing the absolute tolerance for the numeric columns. For example, if the desired tolerance is plus or minus 10, use the value of 10 as a parameter.
rel_tol: a numeric value representing the relative tolerance for the numeric columns. If the desired tolerance is 5%, use the value 0.05 as a parameter.
df1_name: an optional name for the first dataframe
df2_name: an optional name for the second dataframe

There are more parameters available in the package’s documentation, such as ignore_spaces (ignore whitespace in string columns), ignore_case (ignore case differences in string columns) and cast_column_names_lower (turn column names into lowercase letters). For didactic purposes, we only present here the ones we consider most important for your initial use.

— Finally, print the results of the comparison

print(results.report())

In addition to the comparison report, DataComPy stores important information in attributes and allows you to access them to further interpret the comparison.

The main attributes are:

df1_unq_rows: contains all the records from the first dataframe that were not matched in the column(s) used for the join.
intersect_rows: contains all the records that match the index or join key. The column(s) used as the key remain with their original name, and the other columns are given a suffix (“_df1” or “_df2”, depending on the dataframe). In addition, for each column compared, a column is created with the suffix “_match”, which indicates whether the record is the same (True) or different (False) in the comparison.
df2_unq_rows: contains all the records from the second dataframe that were not matched in the column(s) used for the join.

To print them, you can use:

print(results.df1_unq_rows)
print(results.intersect_rows)
print(results.df2_unq_rows)

You can also export these results as csv files, and take them to other analysis tools.

DataComPy in action: a practical example

We will now explore a practical example of using DataComPy.

Considering the two dataframes df_legacy and df_new below:

Figure 3. dataframes df_legacy and df_new

After installing and importing DataComPy, you should run:

# Comparing the two dataframes 
results = datacompy.Compare(
    df_legacy,
    df_new,
    join_columns= ['product_id', 'date'],
    rel_tol=0.00,
    abs_tol=0.00,
    df1_name='legacy',
    df2_name='new'
)

Using fields product_id and date as the join key, DataComPy joins the two dataframes and compares the other columns. In this case, DataComPy will compare the columns of following green rows, as they match the join key.

Figure 4. Indication of which columns are compared in rows that match the join key (green rows)

The product_nm column, as it appears only in df_legacy, will not be compared.

The rows that do not match the join key are not compared, and they are classified as unique for each dataframe.

Next, the result should be printed.

# Printing the result
print(results.report())

The printed result contains the following information:

Figure 5. Detailed DataComPy report (part 1)

Figure 6. Detailed DataComPy report (part 2)

Finally, you can analyze the complete set of unique rows in each dataframe, as well as the rows in the intersection.

Figure 7. Unique rows in each dataframe and rows in the intersection

Analyzing DataComPy Reports: A Guide for Analytics Engineers

When analyzing the results, it’s crucial to note that, beyond identical records, those unique to either dataframe 1 or dataframe 2 are equally important.

A substantial number of mismatches in the join key can be attributed to differences in characters in the records (for example, one record has words with special symbols and the other does not), but also to some erroneous data transformation. Thus, special attention should be given to the join key (or to the order of the dataframes, when using the index), as incorrect configurations may exclude records from the comparison and lead to inadequate conclusions.

Considering the many possibilities for interpreting the results, we propose a 5-step analysis methodology for a comprehensive understanding of the DataComPy report:

Figure 8. 5-step analysis methodology to interpret the DataComPy report

Step 1: Analyze the number of rows in each dataframe

This initial step provides a general indication of the similarity between your dataframes.

Step 2: Analyze the number of columns in each dataframe

Usually, when setting up the dataframes for comparison, the number of columns is observed and this number tends to be the same between the dataframes. However, it is still important to observe this when interpreting the results.

Step 3: Define the columns to join

This step deepens the understanding of the dataframe similarity. It involves analyzing the number of records that had a “match” in the join key (number of rows in common).

Strive for a join key with as many matches as possible. It might be necessary to run this step several times before moving to step 4.

Step 4: Analyze the similarity of the compared columns

Evaluate the similarity between matched records in the join key. Aim for the number of rows with equal values to be the same as the number of rows in common.

For numerical columns, you can set tolerances to get a better idea of how divergent the values are. For example, when setting a relative tolerance of 1% for a numerical column, if the comparison results indicate that all the rows are equal, you can conclude that the differences are within this threshold.

Additionally, address details like case sensitivity or spacing discrepancies in columns by analyzing rows with a match status of ‘False’ in the results.intersect_rows data.

Step 5: Analyze rows without a match in the join key

Evaluate the cause of the mismatch. Possibilities are differences in source data, encoding, existence of special characters (for example, one dataframe has words with special characters and the other does not), or to erroneous data transformation.

Implementing this methodology not only enables the validation of migrated data but also allows for the identification of potential causes of differences.

Additionally, besides understanding the tool’s output, it is highly recommended to establish a reliable validation workframe for the whole team. Establishing a standardized workframe for all validations can increase productivity, eliminate guesswork and improve service quality.

Additionally, besides understanding the tool’s output, it is highly recommended to establish a reliable validation workframe for the whole team. Establishing a standardized workframe for all validations can increase productivity, eliminate guesswork and improve service quality.

Practical approaches for dealing with large volumes of data

When working with DataComPy, you might face a common challenge: the volume of data is too large to fit in your computer’s RAM. Initially, you might consider using cloud resources and a Python environment for processing. However, there are efficient ways to manage this on your local machine using data summarization techniques and sampling.

Summarizing Data

A practical method involves grouping data by key attributes, and then summarizing important metrics with functions. This approach reduces both processing time and memory usage. For example, when comparing dataframes of a company’s sales data, the dataframes can be grouped by attributes (like year and month, product, and category), summarizing metrics (like quantity, price, total price, and discount) with functions (like sum, average, minimum, and maximum).

Figure 9. Sales table with indication of possible columns to group by and summarize

After summarization, use DataComPy to compare the grouped dataframes, but be aware that summarizing data can lead to loss of detail. It’s important to balance this approach with detailed tests on specific groups when necessary.

Sampling Data

Another strategy is to analyze samples from different periods. For instance, if comparing data spanning from 2018 to 2023, select random months from each year for a representative comparison. This sampling approach can provide insights without overwhelming your system with the entire dataset. You will need to ponder if the results you got from the sample are good enough to validate the entire dataset.

Practical approach for handling asynchronous data

In scenarios where it is not possible to ensure a synchronization of data sources, differences in comparisons are expected.

To mitigate these differences, DataComPy’s relative or absolute tolerances can be adopted. Their use makes it possible to establish limits of acceptable values, which minimize the impacts caused by reasons beyond the validator’s control (for example, asynchronism between sources). They might also be handy for grouped data, where data rounding differences can occur.

However, be careful… Setting it too high might overlook critical discrepancies, while too low might point to minor differences, which might exist due to external reasons. You should always align the adoption of tolerance limits with the business team, as they are the ones who can appropriately tell what is acceptable based on the use and characteristics of each data solution that they use.

Conclusion

DataComPy is a powerful and flexible tool that can be integrated into your data validation workflow. It makes it possible to interpret the differences between two dataframes and helps to ensure data quality and consistency as well as improved reliability in data migration projects. It is especially useful for Analytics Engineers who need to refactor data from a legacy system into the Modern Data Stack.

In this article, we demonstrated the applicability of DataComPy to validate data migrations between two systems. We presented its detailed report, and how it allows for interpreting the similarity between dataframes. We also highlighted some other important aspects, such as the 5-step methodology for interpreting the report as well as practical approaches to allow its usage with large datasets and asynchronous data. With all that, DataComPy proves itself as an important tool in data validation.

However, DataComPy is not the only tool available for this purpose. There are other tools that can be used to compare migrated data, such as dbt’s audit_helper package, which contains numerous useful macros for comparing dbt models.

In this context, how does DataComPy compare to dbt’s audit_helper package? What are the advantages and disadvantages of each tool? How do you choose the best tool for your use case? Well, that’s for a future article…