Comparison of Two Data Sets using Python

Published in

Analytics Vidhya

5 min readFeb 18, 2021

For any one working in an analytical role, Comparing two data sets will be a day to day activity. Whether that is to prove changes made are not impacting the rest of the data in the file which is typically called as “Regression testing” or to understand the difference between two files /data sets.

When data is too small excel comes to rescue. But what if your dataset is huge ? And you have to perform comparisons frequently ? In this case you require something more powerful that just Excel or Excel Macros

There are lot of file comparison tools available in the market like beyond compare .etc

In this article ,we will be exploring how to compare two large files/datasets efficiently while creating meaningful summery using Python Library “datacompy”

datacompy : is a package to compare two DataFrames. Originally started as a replacement for SAS’s PROC COMPARE for Pandas DataFrames with some more functionality than just Pandas.DataFrame.equals(Pandas.DataFrame)

Let’s see how can we make use of this library.

Installing datacompy

pip install datacompy

Details :

datacompy takes two dataframes as input and gives us a human-readable report containing statistics that lets us know the similarities and dissimilarities between the two dataframes.

It will try to join two dataframes either on a list of join columns, or on indexes.

Column-wise comparisons attempt to match values even when dtypes doesn't match. So if, for example, you have a column with decimal.Decimal values in one dataframe and an identically-named column with float64 data type in another, it will tell you that the dtypes are different but will still try to compare the values.

Example :

Lets say you have 2 Csv files —

Load the files into two data frames

df1

df2

Import Datacompy & Compare two dataframes

import datacompy

compare = datacompy.Compare(
df1,
df2,
join_columns=’acct_id’, #You can also specify a list of columns
abs_tol=0.0001,
rel_tol=0,
df1_name=’original’,
df2_name=’new’)

Generate the output (in the form of report )

print(compare.report())

Output :

DataComPy Comparison
— — — — — — — — — —

DataFrame Summary
— — — — — — — — -

DataFrame Columns Rows
0 original 5 7
1 new 4 6

Column Summary
— — — — — — —

Number of columns in common: 4
Number of columns in original but not in new: 1
Number of columns in new but not in original: 0

Row Summary
— — — — — -

Matched on: acct_id
Any duplicates on match values: Yes
Absolute Tolerance: 0.0001
Relative Tolerance: 0
Number of rows in common: 5
Number of rows in original but not in new: 2
Number of rows in new but not in original: 1

Number of rows with some compared columns unequal: 3
Number of rows with all compared columns equal: 2

Column Comparison
— — — — — — — — -

Number of columns compared with some values unequal: 3
Number of columns compared with all values equal: 1
Total number of values which compare unequal: 5

Columns with Unequal Values or Types
— — — — — — — — — — — — — — — — — —

Column original dtype new dtype # Unequal Max Diff # Null Diff
0 dollar_amt float64 float64 1 0.05 0
2 name object object 2 0.00 0
1 qty float64 float64 2 3.00 1

Sample Rows with Unequal Values
— — — — — — — — — — — — — — — -

acct_id dollar_amt (original) dollar_amt (new)
0 1234 123.45 123.4

acct_id qty (original) qty (new)
5 1238 8.0 5.00
3 1237 NaN 345.12

acct_id name (original) name (new)
0 1234 George Maharis George Michael Bluth
3 1237 Bob Loblaw Robert Loblaw

Sample Rows Only in original (First 10 Columns)
— — — — — — — — — — — — — — — — — — — — — — — -

acct_id dollar_amt name qty date_fld
6 1239 1.05 Loose Seal Bluth 29.0 1/1/2017
4 1237 123456.00 Bob Loblaw NaN 1/1/2017

Sample Rows Only in new (First 10 Columns)
— — — — — — — — — — — — — — — — — — — — —

acct_id dollar_amt name qty
7 1240 1.05 Loose Seal Bluth 111.0

Details of output are self explanatory

What is happening behind the scenes :

In the above example, we are joining the two data frames on a matching column (acct_id) . We can also pass: on_index = True instead of “join_columns” to join on the index instead.
Compare.matches() is a Boolean function. It returns True if there’s a match, else it returns False.
we can pass in ignore_extra_columns=True to ignore non matching column and not return False(It will still check on overlapping columns)
DataComPy by default returns True only if there’s a 100% match. We can tweak this by setting the values of abs_tol & rel_tol to non-zero, which empowers us to specify an amount of deviation between numeric values that can be tolerated
we can also fetch the details intersection of two sets and unequal rows from df1 or df2 by using below —

intersect_rows, df1_unq_rows, df2_unq_rows

print(compare.df1_unq_rows)

intersect_rows— compares all the columns of Matched rows and returns true if matched and False if unmatched

df1_unq_columns() — Gives list of columns present in the df1 but not in df2 .In our example date_fld is not present in the df2 hence returned.

print(compare.df1_unq_columns())

Condition to Handle duplicate values :

The duplicate matching is somewhat naïve when it comes to picking which rows to match when there are duplicates. Datacompy sorts by the other fields before generating the temporary ID, then matches directly on that field. If there are a lot of duplicates you may need to join on more columns, or handle them separately.

Limitations

The dataframes that you’re comparing have to fit in memory. In comparison with SAS PROC COMPARE which can operate on datasets that are on disk, this could be a constraint if you’re using very large dataframes. ( I have tried comparing two datasets with 0.15+ million records which works smoothly )

To sum up :

DataCompy is fairly useful library if you want to quickly compare two datasets. It also allow you to cater for minor differences between the data sets and provides detail summary about two dataframes.

I hope you found this article useful.