Hash Your Way to Data Science Glory!

Martin Hillebrand
Teradata
Published in
15 min readApr 4, 2024

Intro

Everything in the world has a half-life, a concept that resonates deeply within the realms of both science and technology. This principle was underscored by many of my university professors, who advised us to prioritize understanding core concepts over chasing the latest tech trends, given their transient nature. One such enduring concept is hashing, whose significance in data science workflows I’ve only fully appreciated years after my exams.

Hashing lies at the heart of Teradata technology, particularly its capacity for massive parallel processing (MPP). This capability hinges on efficient data access and retrieval, powered by a robust hashing function. While the mechanics of hashing might remain behind the scenes for most users, gaining an understanding of how it works can be incredibly beneficial.

In this blog post, I’ll demonstrate through four use cases how hashing can be a game-changer in the workflow of a data scientist. These four use cases are:

  1. Pseudonymize a categorical feature
  2. Splitting data into random subsets for train, evaluate and test
  3. Encode a categorical feature with unknown number of values in buckets
  4. Encode a categorical feature with known number of values without collisions

Before that, let’s start with the basics. What is hashing and what are the qualities of a hashing function?

About Hashing

In simple terms, a hash function is a special mathematical algorithm that takes input data of any size and produces a fixed-size string of characters, which typically looks like a random sequence of letters and numbers. Think of it as a unique digital fingerprint for the data. No matter how large or small the input is, the hash function generates a fixed-length output. For example, whether you’re hashing a single word like “hello” or an entire book, the output (hash) will be of a consistent length.

Teradata’s efficiency highly relies on its hashing function. Let’s highlight some key properties of the Teradata hash function

  • Deterministic: The same input will always produce the same output.
  • Fast computation: It’s quick to calculate the hash for any given data, hence insertion/ reading will be fast
  • Non-invertible: It’s practically impossible to reverse the process, meaning you can’t easily figure out the original input from the hash output.
  • Collision-resistant: It’s extremely unlikely (though not impossible) that two different inputs will produce the same output hash. This depends on the length of the output token. When converted to an integer, the results from the HASHROW function can have over 4 billion different codes, 4,294,967,295 hash codes to be precise
  • Uniform: When your input is unique, such as a primary key, the output will be uniform, once you process it further with the modulo operator.

But let’s not be just theoretic. It is much easier to grasp with a demo data set. For our demonstration, we will rely on the Adult dataset, also known as the “Census Income” dataset from the UCI Machine Learning Repository (https://archive.ics.uci.edu/dataset/2/adult) . It comprises 48,842 instances with 14 features, aimed at predicting whether an individual’s income exceeds $50,000 per year based on census data. The dataset includes a mix of categorical and integer feature types, covering demographic attributes such as age, work class, education, marital status, occupation, relationship, race, and sex.

To start with, we can use the ClearScape Analytics Experience demo environment (Sign up for free here: https://www.teradata.com/experience), which comes with a Jupyter Hub environment and a Vantage Cloud instance. It is a perfect sandbox that allows you to explore over 80+ real-world use cases across multiple industries and analytic functions within a fully functional Vantage environment.

First, let’s create a database connection and populate a table in Vantage with the census income data set.


# connecting to Vantage
import teradataml as tdml
eng = tdml.create_context(host = host, username=username, password = password)

#!pip install ucimlrepo
from ucimlrepo import fetch_ucirepo

# the census income dataset has got id 2,
# see here for more info: https://archive.ics.uci.edu/dataset/2/adult
adult = fetch_ucirepo(id=2)
# data (as pandas dataframes)
df = adult.data.features
df["income"] = adult.data.targets.income.str.replace(".","") # some cleaning needed.
df.columns = [c.replace("-","_") for c in df.columns]
df = df.reset_index().rename(columns={'index': 'row_id'})

# upload dataframe to Vantage
tdml.copy_to_sql(df, "census_income", if_exists='replace',
primary_index="row_id")
A screenshot of a data frame table titled “census_income” displayed in a coding environment. The table has several columns labeled row_id, age, workclass, fnlwgt, education, education_num, marital_status, occupation, relationship, race, sex, capital_gain, capital_loss, hours_per_week, native_country, and income. Each row represents an individual’s demographic and economic information.

Use Case 1: “Hidden by Hash”

In exploring how hash functions benefit data science, we start with anonymizing categorical variables. This technique is essential for protecting data privacy. By using hash functions, we transform sensitive details into anonymized forms. This protects personal and confidential information while allowing us to still carry out meaningful data analysis.

Let’s consider an example where we want to combine three categorical variables — relationship, race, and sex — into one anonymized variable using hash encryption. Here’s the code:

DF = tdml.DataFrame("census_income")

from sqlalchemy import func as f

DF_encrypted = (
DF
.select(["row_id", "relationship", "race", "sex"])
.assign(demographic_encrypted =
f.abs(f.from_bytes(f.hashrow(
DF.relationship.expression, DF.race.expression, DF.sex.expression ),
"base10"
).cast(type_=tdml.INTEGER)
)))

Let’s dive into the code, breaking it down to understand each step:

A handy tool in teradataml is the show_query() function. It can be attached to any DataFrame expression, allowing us to peek at the resulting SQL query. In our case, here's what it looks like:

A screenshot of a piece of code within a programming environment. The code is a query written in SQL syntax. It begins with the method call DF_encrypted.show_query() followed by a SQL SELECT statement.
A screenshot showing a section of a dataframe named ‘DF_encrypted’. The dataframe is displayed in tabular form with five columns and eight rows. The columns are labeled ‘row_id’, ‘relationship’, ‘race’, ‘sex’, and ‘demographic_encrypted’.

If the hashing process avoids any collisions, it creates a consistent mapping. This means if a row has the same values across the three selected columns for hashing, the resulting hashed value will be the same as well. The key benefit here is that hashing obscures the original clear text values, which might often be sufficient for privacy purposes. However, it’s worth noting that if someone is familiar with the original categories’ (multivariate) distributions, they could attempt to backtrack to the original values.

Moreover, should the model ever be exposed, it becomes ineffective without knowledge of the specific characteristics of the hash function used. This adds an extra layer of security, as the model’s utility is closely tied to the unique properties of the hashing technique employed.

Use Case 2: “Precision in Partition”

As we move to our second exploration of hash functions in data science, we turn our attention to effectively dividing datasets into training, validation, and test sets. By applying a hash function to a unique primary key for this division, we achieve not only incredible efficiency but also a level of reproducibility and consistency that enhances data analysis projects. This technique smoothly generates distinct subsets of data. Thanks to the predictable behavior of hash functions, we can ensure that each piece of data consistently finds its way into the same subset, allowing for accurate comparisons and solid evaluations of model performance. Here’s the breakdown:

  1. Hashing Identifiers: Start by calculating a hash value for each record’s unique identifier. This could be a singular ID, a mix of different fields, or any attribute that uniquely defines a record.
  2. Determining the Split: Transform the hash value into a numerical range (for instance, by applying modulo 6 to the hash value). Then, assign the record to the training, evaluation, or test set based on its range. For instance:
  • Assign records with a value of 0 to the test set (making up 16.7%).
  • Assign records with a value of 1 to the validation set (also 16.7%).
  • Assign records with values from 2 to 5 to the training set (comprising 66.7%).

Now, applying this to the census dataset:

  1. Hashing Identifiers: We’re in luck here; row_id serves as our primary key.
  2. Determining the Split: We’ll allocate two-thirds of our data to training and one-sixth to both validation and testing. This involves taking the modulo 6 of our integer hash value to ensure even distribution.

Let’s see how swiftly we can put this into action, opting for SQL this time for a bit of variety.

DF_fold = tdml.DataFrame.from_query(
"""
SELECT
-- create 6 equally sized buckets
MOD(
ABS(CAST(from_bytes(hashrow(row_id), 'base10') AS INTEGER)),
6) as rowid_hashbin,
-- assign to folds as per requirement
CASE rowid_hashbin
WHEN 0 THEN 'test'
WHEN 1 THEN 'evaluate'
ELSE 'train'
END as fold,
t.*
FROM
census_income t
""")
A screenshot showing a section of a larger dataframe labeled ‘DF_full’. The dataframe is displayed in tabular form with multiple columns, extending beyond the frame of the screenshot to the right. Visible columns include ‘row_id’, ‘age’, ‘workclass’, ‘fnlwgt’, ‘education’, ‘education_num’, ‘marital_status’, ‘occupation’, ‘relationship’, ‘race’, ‘sex’, among others.

Let’s check to make sure our data splits are fair, meaning they don’t have uneven distributions of the target labels. This step highlights the flexibility of teradataml, which seamlessly blends SQL and pandas-style syntax for an intuitive workflow. Given that our aggregated DataFrame has just 6 rows, we’ll move it over to pandas for visualization. This allows us to take a closer look and ensure our model training is based on balanced and unbiased data.

DF_fold_counts = DF_fold.select(["fold","income","row_id"]
).groupby(["fold","income"]).count()
pivot_df = pddf_fold_counts_pd.pivot(index='fold', columns='income', 
values='count_row_id').fillna(0)
ax = pivot_df.plot(kind='bar', stacked=True, figsize=(10, 6))
for bar in ax.patches:
x = bar.get_x() + bar.get_width() / 2
y = bar.get_height()/2 + bar.get_y()
value = int(bar.get_height())
ax.text(x, y, str(value), ha='center', )

plt.title('Distribution of Income Groups by Fold')
plt.xlabel('Fold')
plt.ylabel('Number of Rows')
plt.xticks(rotation=45)
plt.legend(title='Income')
plt.tight_layout()
plt.show()
A bar chart titled “Distribution of Income Groups by Fold”. The chart shows three sets of bars, each corresponding to a different ‘Fold’: ‘evaluate’, ‘test’, and ‘train’. Each set of bars is split into two segments representing two income groups.

This looks great! The chart displayed paints a clear picture of our data split, confirming that we’ve met our goals for both the size of the splits and the evenness of the distribution. It shows the three subsets — train, test, and evaluate — each with a proportional mix of income categories, both <=50K and >50K. The ‘train’ fold is the largest, as intended, with the ‘test’ and ‘evaluate’ folds being smaller yet similar in size to each other. The balance across these folds suggests that our hash function has done its job well, assigning data points to each subset in a way that mirrors the overall composition of our dataset.

Use Case 3: “Italy, Peru, and Portugal unified through hashing”

In our journey through the practical uses of hash functions in data science, let’s look at a challenge that often comes up with categorical data, like the native_country column in the census income dataset. This column has 43 different countries, and sometimes new ones appear that weren't seen during the model's training phase.

Other common methods for dealing with categories, like label encoding and one-hot encoding, have some drawbacks. Label encoding requires a fixed-size lookup-table. One-hot encoding creates a new column for each category, which can make our dataset much bigger and harder to work with, especially when new categories show up.

Hashing provides a clever way around these issues. It lets us put many categories into a smaller number of groups, even if that means some different categories end up in the same group. This is okay because it keeps our dataset manageable and our models flexible, able to handle new categories without needing a complete overhaul. For example, countries like Italy, Peru, and Portugal might all end up in the same group, but this simplicity helps us keep our model running fast and smoothly. Let’s see how using hashing this way can make our models more straightforward and ready for whatever new data comes their way.

Our census income dataset contains some categorical variables, and the one that stands out as a candidate for feature is the native_country column. Currently there 43 distinct countries. In future during model deployment, there could be countries not seen during training, and the worst thing would be that our algorithm fails.

For a start, we accept collisions, and we would like to only use 10 buckets derived from hashing, leading to 4.3 countries per bucket on average

In practical situations, you’ll likely need to apply hash-encoding to more than just a single variable. So, the next step is to craft code that can handle this efficiently. We’ve learned that to achieve our transformation, we need to link together several functions. Fortunately, we can embody the spirit of good software practice — specifically, the DRY principle (Don’t Repeat Yourself) — by designing a function that generates these derived columns for us.

def get_feature_hashbucket(column_name, num_buckets=10):
return f.abs(f.from_bytes(f.hashrow(DF[column_name].expression), "base10"
).cast(type_=tdml.INTEGER)) % num_buckets


columns_to_encode = ["relationship", "race", "sex", "native_country"]
my_kwargs = {(f"{colname}_encoded"):get_feature_hashbucket(colname,10)
for colname in columns_to_encode}


DF_hashbin = (DF
.select(["row_id"]+ columns_to_encode)
.assign(**my_kwargs))
A table titled “DF_hashbi” displaying various columns such as ‘row_id’, ‘relationship’, ‘race’, ‘sex’, ‘native_country’, ‘native_country_encoded’, ‘race_encoded’, ‘relationship_encoded’, ‘sex_encoded’. Each row represents a record with specific values for these attributes.

It looks like our method is taking shape — the initial results are in, and they’re promising! The sample output doesn’t reveal any collisions, but that might be due to certain values being more prevalent than others. To get a clearer view, we’ll need to aggregate the table.

DF_collisions = tdml.DataFrame.from_query(
"""
SELECT
native_country_hashbin,
COUNT ( native_country) no_countries_bin,
TRIM(TRAILING ' ' FROM (XMLAGG(TRIM(native_country)|| ','
ORDER BY native_country) (VARCHAR(1000)))) as countries_list
FROM (
SELECT
DISTINCT (native_country),
MOD(ABS(CAST(from_bytes(hashrow(native_country), 'base10') AS INTEGER)),10) as native_country_hashbin
FROM
census_income t
) t
GROUP BY native_country_hashbin
""")
A table titled “DF_collisions.sort(‘native_country_hashbin’)” with columns ‘native_country_hashbin’, ‘no_countries_bin’, and ‘countries_list’. The ‘native_country_hashbin’ appears to be an identifier for a group of countries that have been hashed into bins. The ‘no_countries_bin’ column lists the number of countries in each bin. The ‘countries_list’ column contains lists of country names associated with each hash bin.

We’ve run into collisions, which isn’t surprising. As we saw, Italy, Peru, and Portugal all share hashbucket number 2. Before we pivot to our next use case, let’s address a significant point: the choice of how many hash buckets to use.

If you’re working with a modest number of bins and categories, you’ll probably want to examine any collisions to decide if they’re acceptable. If they’re not, consider increasing your bucket count. Whether this extra step is worth it depends on how much it could speed up data preparation against your specific use case needs.

Think of the number of hash buckets for each feature as a dial you can turn in your data science process — it’s essentially a hyperparameter you can tune!

When it comes to best practices for setting the size of hash buckets, it’s all about the context and balancing act between performance and computational demands. More buckets mean fewer collisions but a larger feature space, which can bump up memory and processing requirements. On the flip side, a hash space that’s too snug could lead to collisions that mask important details. A good rule of thumb is to begin with a hash space around ten times the size of the number of unique values you expect in your variable. From there, you can tweak as needed, based on real-world results and the computational power at your disposal. The sweet spot for hash bucket size is where you minimize information loss from collisions without unnecessary growth in dimensionality.

Use Case 4: “Serving faster with Matcha Green Tea Salt”

We’ve already seen that finding the right way to turn categories into numbers that our models can understand can be a challenge. Label encoding and one-hot encoding are common choices, but they’re not perfect. They can struggle with a lot of different categories, either by needing a big table to keep track of them all (label encoding) or by making our dataset huge with too many columns (one-hot encoding). Plus, they don’t handle new, unseen categories very well.

But as the saying goes, there’s no free lunch — this applies to hash encoding too. However, sometimes we really need the best of both worlds: a way to encode features efficiently without mixing up the categories we already know.

We’ve mentioned a simple solution: use more buckets. But there’s another clever technique — adding a “salt” to our categories. No, not the kind you cook with, but a special addition to our data that mixes things up. By attaching extra text to our category values, we create a kind of chaos that changes how they’re sorted into buckets. Why does this help? Because with the right amount of shuffling, we can avoid mixing up our known categories in the same bucket.

To get this right, we need to understand a bit of probability theory — don’t worry, it’s not as scary as it sounds. Think about the birthday paradox, which shows us how likely it is for people in a group to share a birthday. It’s a bit like our categories and buckets: the chance of two categories ending up in the same bucket (a “collision”) depends on how we shuffle them and how many buckets we have. With the right adjustments, we can keep our known categories from colliding, making our data easier to work with and our models more accurate. Let’s explore how this technique can help us manage our categories more effectively, even when they’re numerous or new ones show up.

First up, let’s wrap our heads around how likely it is for categories to end up in the same bucket, a.k.a., a collision. Look at the Python function that can figure out this probability for us, and also at the graph that shows what happens with different numbers of categories and buckets.

from math import factorial

def category_collision(num_buckets, num_categories):
if num_buckets < num_categories:
return 1.0
else:
return 1.0 - (factorial(num_buckets)
/ (factorial(num_buckets - num_categories)
* (num_buckets ** num_categories)))
A graph titled “Birthday Paradox applied to Category Collision Probability,” plotting the probability of category collision against the number of hash buckets. The curves represent different numbers of categories ranging from 10 to 50 categories. As the number of hash buckets increases, the probability of a collision — meaning two items being hashed to the same bucket — decreases. For fewer categories, a smaller number of buckets is needed to reduce the collision probability significantly.

In our case, we’re zooming in on the scenario with 43 categories because that’s how many different native_countries we have in our dataset. Suppose we decide on 250 as our magic number of buckets. According to the graph (and the math behind it), if we do a simple hash with these 250 buckets, there's a 97.84% chance we'll see at least one overlap.

But what if we don’t settle for just one try? What if we experiment with 100 different ways to assign these buckets by mixing in 100 unique “salts”? This strategy boosts our odds of hitting a combination without any collisions to 88.71%. Sounds like a plan, right? So, let’s experiment with a whole hundred of different salts.

salts = ['TableSalt',  'SeaSalt',  'HimalayanPinkSalt',  'KosherSalt',  'CelticSeaSalt',  'FleurdeSel',  'BlackSaltKalaNamak',  'RedHawaiianSalt',  'BlackHawaiianSalt', 
'SmokedSalt', 'FlakeSalt', 'SelGris', 'EpsomSalt', 'DeadSeaSalt', 'BolivianRoseSalt', 'PersianBlueSalt', 'AlaeaSalt', 'MaldonSalt', 'MurrayRiverSalt',
'CyprusBlackLavaSalt', 'DanishSmokedSalt', 'ChardonnayOakSmokedSalt', 'HawaiianBambooJadeSalt', 'SicilianSeaSalt', 'PeruvianPinkSalt', 'SelMelange',
'ApplewoodSmokedSalt', 'CherrywoodSmokedSalt', 'VanillaBeanSalt', 'SzechuanPepperSalt', 'LemonFlakeSalt', 'VintageMerlotSalt', 'GhostPepperSalt',
'LavenderRosemarySalt', 'MatchaGreenTeaSalt', 'TruffleSalt', 'PorciniMushroomSalt', 'GarlicSalt', 'OnionSalt', 'CelerySalt', 'HabaneroSalt',
'EspressoSalt', 'CinnamonSpiceSalt', 'IndianBlackSalt', 'BlueCheeseSalt', 'HickorySalt', 'AlderwoodSmokedSalt', 'AnchoChileSalt', 'BasilSalt',
'ChiliLimeSalt', 'ChocolateSalt', 'CoconutGulaJawaSalt', 'CuminSalt', 'CurrySalt', 'FennelSalt', 'GingerSalt', 'HerbesdeProvenceSalt', 'JalapenoSalt',
'LimeSalt', 'MapleSalt', 'OrangeSalt', 'RoseSalt', 'SaffronSalt', 'SageSalt', 'SrirachaSalt', 'SumacSalt', 'TurmericSalt', 'WasabiSalt',
'WhiskeySmokedSalt', 'WineSalt', 'YuzuSalt', "ZaatarSalt", 'SmokedApplewoodSalt', 'BeechwoodSmokedSalt', 'NorwegianSeaSalt', 'BrittanySeaSalt',
'CornishSeaSalt', 'IcelandicSeaSalt', 'KoreanBambooSalt', 'MalaysianPyramidSalt', 'MexicanSeaSalt', 'NewZealandSeaSalt', 'PortugueseSeaSalt',
'SouthAfricanSeaSalt', 'SpanishSeaSalt', 'ThaiFleurdeSel', 'VikingSmokedSalt', 'WelshSeaSalt', 'YakimaApplewoodSmokedSalt', 'OakSmokedSalt',
'PinkPeppercornSalt', 'LemonHerbSalt', 'ChipotleSalt', 'BourbonBarrelSmokedSalt', 'AguniSeaSalt', 'AmabitoNoMoshioSeaweedSalt',
'BlackTruffleSeaSalt', 'CaviarSalt', 'HarvestSalt', 'HawaiianRedAlaeaSalt', 'ItalianBlackTruffleSalt', 'JapaneseMatchaSalt',
'OliveSalt', 'PumpkinSpiceSalt', 'RosemarySalt', 'ShiitakeMushroomSalt', 'SicilianWhiteSalt', 'TibetanRoseSalt']

Next, we’ll set up a temporary table listing all the distinct countries. We’ll tweak our earlier function to consider our chosen “salt” by tacking it onto the end of each country name. Then, we’ll run a check to see if we’ve managed to dodge any collisions with our new, salt-enhanced hashing method.

tdml.execute_sql("""
CREATE MULTISET VOLATILE TABLE countries_t AS
(SELECT native_country FROM census_income GROUP BY native_country )
WITH DATA NO PRIMARY INDEX
ON COMMIT PRESERVE ROWS
""")

DF_countries = tdml.DataFrame("countries_t")

def get_feature_hashbucket_salted(column_name, num_buckets=10, salt = ""):
return f.abs(f.from_bytes(f.hashrow(
f.concat(DF[column_name].expression, salt)),
"base10").cast(type_=tdml.INTEGER)) % num_buckets

my_kwargs = {(f"native_country_{salt}") : get_feature_hashbucket_salted("native_country",250, salt)
for salt in salts}

DF_countries_hashbucket = (DF_countries
.assign(**my_kwargs))
A screenshot displaying a Pandas DataFrame method chain. The method chain involves converting a DataFrame column ‘countries_hashbucket’ to a Pandas series, getting the number of unique values with `.nunique()`, sorting these values with `.sort_values()`, and finally displaying the last 10 entries with `.tail(10)`. The output shows various ‘native_country’ entries appended with ‘Salt’ and their corresponding counts of unique hash bucket values.

Great news: we’ve got options on the table! Just like picking between table salt, kosher salt, matcha green tea salt, olive salt, or fennel salt to flavor our dishes, we can choose our “salt” for hashing to get that perfect, collision-free categorical encoding. And the best part? We don’t need a massive number of buckets to make it happen. It’s all about your preference now, like choosing the right seasoning for your meal.

Sure, it might sound like extra steps to take, but it’s absolutely worth it when you’re aiming to fine-tune your model or speed things up in production, especially when there are strict performance requirements to meet. Think of it as the secret ingredient that could give your model the edge it needs, ensuring it runs smoothly and quickly, just when you need it to.

Summary

Hashing might not be a familiar concept to everyone, but understanding and leveraging it can be a real game-changer — if you know how to use it and when. In this blog post, we’ve taken a deep dive into how hashing works and why it’s so important, especially when dealing with huge amounts of data in Teradata Vantage. We explored four key use cases: anonymizing data to protect privacy, splitting data sets for model training, and two ways of encoding categorical data to make it easier for machines to understand.

We started with the basics, showing how hashing turns any input into a fixed-size string, a bit like giving every piece of data its own unique fingerprint. This process is crucial for handling data quickly and safely. From there, we saw how hashing helps keep personal information private, ensures data is divided fairly for machine learning, and simplifies complex data into a format that’s easy to work with, even introducing a clever “salt” trick to avoid mixing up different pieces of data.

Overall, we have shown that while hashing might seem a bit technical or obscure, it’s actually a powerful tool in data science. It can make big data tasks more manageable, secure, and efficient, proving its value across a range of scenarios. So, the next time you’re working with data, consider how hashing might help you achieve your goals more effectively.

Do not hesitate and try it out yourself by signing up for ClearScape Analytics Experience for free: https://www.teradata.com/experience.

All the code of this blog post can be found here: https://github.com/martinhillebrand/ClearscapeCookbook/blob/main/05_Hashing/hashyourway-notebook.ipynb

Feedback and Questions

Share your thoughts, feedback, and ideas in the comments below, and explore resources available on the Teradata Developer Portal and Teradata Developer Community.

About the author

Martin is a Senior Data Scientist at Teradata. With his blend of psychology and computer science, our author brings an interdisciplinary perspective to the field of data science. Currently based in Munich, Germany, Martin has integrated his curiosity about human cognition with technical skills in data analysis. His journey includes work with big data sets in psychology and beyond, as well as conducting cognitive research with great apes. Connect with Martin on LinkedIn.

--

--