Hashing Pandas DataFrame Column with Nonce
Your dataset can commonly contain sensitive data in one or more columns. For example, user IDs, patient numbers, or license numbers. Here I share how to create a new column containing hashed strings based on the clear-text strings of the other column of Pandas DataFrame.
Here are the properties of my code:
- It is designed to be unconvertible back to the original clear-text data, even for the data owner. Technically it is possible if you keep the nonces (the random numbers) used to generate the hashes. But if you simply, intentionally or accidentally, re-run the code again, the random numbers will be completely different, and identifying the clear-text from the hash would be very difficult.
- It is designed to generate hashes such that the hash based on the same original clear text will be the same. That is, nonces generated for the same clear-text string are the same. Therefore, although the records are kept anonymous, readers know which rows are from the same (hashed) name/id/thing/type. For example, if the hashed column is hospital’s patient IDs, we can tell which records belongs to the same patient (without knowing who).
This can be useful when you wish to conceal sensitive data before releasing the dataset to public or third party.
The Data
To make it practical, I create a DataFrame (DF) from a real .csv
file, containing several columns. Here I use the results of English football leagues of 2014/2015 season, E0.csv
, which can be downloaded here.
I strip the columns down to only few and create a new hash column based on the column named HomeTeam
(i.e., I suppose HomeTeam
is our sensitive data). In case you are curious, here are descriptions of some columns of the data.
Hashing and Nonce
In short, Hashes can be created by entering a clear text as a parameter to a hash function. The same clear text would generate the same hash value from the same hash function.
Suppose you have a document with hashed ID numbers. If you happen to know someone’s ID, together with the hash algorithm (there are only a handful of the well-known ones), it is easy to identify the hashed records of the known ID.
This is why we need nonces. Nonces are some random strings or numbers added to the original text. With nonces appended or mixed to the original clear-text string as an input to the hash function, knowing only the ID would not be enough to identify the records anymore.
Algorithm
Here is the algorithm of my code.
- Imports and reads the .csv file as a DF (
d0
). Select only few columns asd1
. - Create a list of the data from the sensitive, clear text column (‘
HomeTeam
’ in this case) - Get a unique list of the clear text. The purpose is to generate the same nonce for the same clear text value.
- Generate 2 nonces for each clear text, and added in front and behind the clear text. For example, a clear text
Liverpool
becomesaaaaaaaLiverpoolbbbbbbbb
. - Create a new DF (as
d2
in the code, just to leave the originald1
intact) and create a new hash column. - Generate hashes in hexadecimals (0–9, a-f, and A-F)
- Create new DF (d3) with the original clear text column removed and replaced with the hash.
The Code
import pandas as pd
import hashlib
import random
import stringd0 = pd.read_csv('./E0.csv')
d1 = d0[['Date','HomeTeam','FTR','HTR','Referee']].dropna()
display(d1.head())# Get a unique list of the clear text, as a List
tmplist = list(set(d1['HomeTeam']))# Add some random characters before and after the team name.
# Structured them in a Dictionary
# Example -- Liverpool -> aaaaaaaLiverpoolbbbbbbbbmapping1 = {i : (''.join(random.choice(string.hexdigits) for i in range(12)))+i+(''.join(random.choice(string.hexdigits) for i in range(12))) for i in tmplist}# Create a DF to leave the original DF intact.
d2 = d1.copy()# Create a new column containing clear_text_Nonce
d2['newname'] = [mapping1[i] for i in d2['HomeTeam']]
display(d2.head())# Hash the clear_text+Nonce string
d2['hash'] = [hashlib.sha1(str.encode(str(i))).hexdigest() for i in d2['newname']]
display(d2.head())# To proof that the same clear text has the same Hash string.
display(d2[d2['HomeTeam']=='Chelsea'].tail(30))# Create a new DF with the clear-text column removed.
d3 = d2[['Date','hash','FTR','Referee']].rename(columns={'hash':'HomeTeamHash'})
display(d3.head())
The Output
Here is the original DF (d1
) with clear text.
Here is the new DF (d3
) with hashed strings.