Data Stability with Python: How to Catch Even the Smallest Changes

Published in

datamindedbe

4 min readAug 1, 2024

Slight changes can mean a whole lot of resource usage

As a data engineer, it is nearly always the safest option to run data pipelines every X minutes. This allows you to sleep well at night, knowing you haven’t missed any updates coming from upstream.

Unfortunately, this approach doesn’t bode well when you have to account for the large load on infrastructure and the bill that comes at the end of the month. Quite often, it is only really needed to run them when there has been changes upstream. Additionally, simply re-running them means that we do not actually know when the real updates have occurred upstream. One of the ways to see if 2 files are identical is to compare their hashes, which can be done with a simple function shown below:

import hashlib

def calculate_file_hash_chunked(file_path: str) -> str:
    """
    Calculate the hash of a file using chunks
    Due to shorter hash and as such more readable hash, md5 is used

    :param file_path: The path to the CSV file.
    :return: The hash of the file as a hexadecimal string.
    """ 
    hash_obj = hashlib.md5() 
    with open(file_path, 'rb') as file:
        for chunk in iter(lambda: file.read(4096), b""):
            hash_obj.update(chunk)
    return hash_obj.hexdigest()

Hashes are generally an alphanumeric string, which uses a deterministic algorithm to compute a unique result of an input, which is one of the reasons it can be used. A deterministic algorithm in this context mean there is no outside inputs nor there is any randomness except for the input provided. Here you can see an example of a “shorter” MD5 hash; `938c2cc0dcc05f2b68c4287040cfcf71`.

However, we might find this approach insufficiently consistent, as we might not be aware of how the files are generated upstream. The fact that a single row being moved 1 row up/down will create a completely different hash makes it difficult to know if the data inside has changed, as what we truly want to know is if the data has changes instead of the file itself.

Did the data change or are the rows simply rearranged?

Pandas

To solve this, a simple pandas DataFrame method is usually sufficient to identify changes. If we load the data into a DataFrame, then sort the columns followed by sorting all the rows, we can confidently say that the data hasn’t changed by hashing it all together.

An important mention should be left for stability of the sort, as obtaining the same results on the run is an important feature. A stable sort is a sort that maintains the order of elements with identical values, as they were before the sort. However we work on the assumption that only the data inside the DataFrame matters and that the DataFrame might simply be rearranged, so the stability of the sort does not matter. So regardless of the stability of the sort, the fact that we sort rows one after another makes it an option that does not need to be considered.

import pandas as pd
from typing import Optional

def sort_dataframe(
    df: pd.DataFrame,
    sort_rows: Optional[bool] = True,
    sort_columns: Optional[bool] = True,
) -> pd.DataFrame:
    """
    Sort the DataFrame by columns and/or rows based on the provided options.

    :param df: The input DataFrame.
    :param sort_rows: Whether to sort the rows. Default is True.
    :param sort_columns: Whether to sort the columns. Default is True.
    :return: The sorted DataFrame.
    """
    if sort_columns:
        df = df.reindex(sorted(df.columns), axis=1)
    if sort_rows:
        df = df.sort_values(by=list(df.columns)).reset_index(drop=True)
    
    return df

This will already allow us to easily compare two ambiguous DataFrames, as seen below:

data1 = {
    'A': [True, False, True],
    'B': [1, 2, 3],
    'C': ['foo', 'bar', 'baz']
}
data2 = {
    'C': ['baz', 'foo', 'bar'],
    'B': [3, 1, 2],
    'A': [True, True, False]
}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

if df1.equals(df2):
    print("DataFrames are equal before sorting.")

df1_sorted = sort_dataframe(df1)
df2_sorted = sort_dataframe(df2)

print(df2_sorted)

if df1_sorted.equals(df2_sorted):
    print("DataFrames are equal after sorting.")

------
       A  B    C
0  False  2  bar
1   True  1  foo
2   True  3  baz

DataFrames are equal after sorting.

As seen above, boolean values are sorted with column B being the deciding factor for which value comes first, as they hold the same value in column A.

Quicksort is used by default as pandas sorting method

Hashing

However, quite often, getting access to the previous DataFrame can be quite tedious, time-consuming, and inefficient. What makes more sense is simply hashing the DataFrame as a whole and saving the result somewhere it can be easily fetched.

import hashlib
import io
import pandas as pd

def hash_pandas_dataframe(df: pd.DataFrame) -> str:
    buffer = io.BytesIO()
    df.to_feather(buffer)
    buffer.seek(0)
    feather_bytes = buffer.read()
    return hashlib.md5(feather_bytes).hexdigest()

The example above loads it directly into memory and hashes it in the Feather format, which is deterministic. If the hashes of two separate files, or in this case, DataFrames, are the same, it means that either the two files are identical or a hash collision has occurred. Given that we are trying to determine if two objects are identical, we can assume the much more likely probability ( ± 1 in 10²⁹ ) that the two datasets are identical.

This approach works for other formats such as Parquet if they are also read into Pandas, as the core processes of sorting and hashing remain the same.

Stay tuned for the next entry on hashing data files, where different ways to hash are highlighted and the superior speed of Polars is demonstrated.

Data Stability with Python: How to Catch Even the Smallest Changes

Pandas

Hashing

Written by Grigorii Osipov