PII Data Detection BIO EDA

6 min readFeb 14, 2024

The security of personally identifiable information (PII) has become an increasingly concerning issue in today’s digital age. Particularly organizations working with sensitive data such as educational datasets are taking significant steps towards detecting and removing PII data. The Kaggle competition “PII Data Detection” aims to bring together data scientists and researchers to develop solutions in this regard.

Kaggle Competition: https://www.kaggle.com/competitions/pii-detection-removal-from-educational-data

In this article, we aim to conduct an Exploratory Data Analysis (EDA) for the PII Data Detection Kaggle competition. This analysis will help participants to better understand the dataset and delve deeper into the problem domain.

The “PII Data Detection” competition focuses on detecting and removing personally identifiable information (PII). Participants will strive to develop models that accurately identify and remove PII from the text data provided in the dataset.

During this analysis, we will perform some basic exploratory analyses to help participants gain a better understanding of the dataset. We will cover topics such as the overall description of the dataset, its size and structure, distribution of labels, the ratio of PII data, basic statistics of text data, and visualizations including PII data distributions.

EDA

train = json.load(open('/kaggle/input/pii-detection-removal-from-educational-data/train.json'))

Firstly, we begin by loading the training dataset provided by the competition from the ‘train.json’ file in JSON format. This file is located in the directory ‘/kaggle/input/pii-detection-removal-from-educational-data’. In this step, we load the data into the ‘train’ variable.

labels = []
tokens = []
for i in train:
    labels.extend([j for j in i['labels'] if j!='O'])
    tokens.extend([k for j, k in zip(i['labels'],i['tokens']) if j!='O'])

Next, we create two empty lists, ‘labels’ and ‘tokens’, which will hold the labels and tokens, respectively. Then, we iterate over each item in the ‘train’ dataset. For each item, we iterate over the pairs of ‘labels’ and ‘tokens’. If the label is not ‘O’ (indicating a non-empty label), we append the label to the ‘labels’ list and the corresponding token to the ‘tokens’ list.

data = pd.DataFrame({'labels':labels,'tokens':tokens})
data.head()

We use the ‘labels’ and ‘tokens’ lists to create a Pandas DataFrame named ‘data’. This DataFrame contains two columns, ‘labels’ and ‘tokens’, which respectively hold the non-empty labels and their corresponding tokens. The ‘head()’ function is used to display the first few rows of the DataFrame. These steps encompass the basic procedures of loading, processing, and initially visualizing our dataset.

Label Distribution

We can see the distribution of labels in the dataset in the following graph:

Distribution of B and I Labels

In this competition, BIO tags are used. These are a tagging scheme commonly used in natural language processing tasks such as Named Entity Recognition (NER).
- B (Beginning): This tag indicates the beginning of an entity, marking when a word is first seen within an entity.
- I (Inside): This tag indicates the continuation of an entity, marking when a word is a continuation within an entity.
- O (Outside): This label marks words that are not inside an entity or do not belong to a specific entity type.

The consecutive sequences of tags ‘B’ and ‘I’ are then concatenated into a DataFrame column named ‘BIO’. The lengths of these concatenated sequences are calculated and unique IDs are assigned to them. Two new columns named ‘lengths’ and ‘ids’ containing the calculated lengths and IDs respectively are added to the original DataFrame named ‘data’.

Token Lenghts

Longest token length: 11
Shortest token length: 1

Most Frequently Used Tokens

Average Token Counts

Average number of tokens relative to the dataset before merging tags B and I:

def calculate_token_count(document_tokens):
    return len(document_tokens)

document_token_counts = data['tokens'].apply(lambda x: calculate_token_count(x))

average_token_count = document_token_counts.mean()

print("Average Number of Tokens in Documents:", average_token_count)

Average Number of Tokens in Documents: 7.617378605330413

Average number of tokens according to the dataset after merging tags B and I:

def calculate_token_count(document_tokens):
    return len(document_tokens)

document_token_counts = length_data['tokens'].apply(lambda x: calculate_token_count(x))

average_token_count = document_token_counts.mean()

print("Average Number of Tokens in Documents:", average_token_count)

Average Number of Tokens in Documents: 13.696762141967621

Length_data

After combining tags B and I, the token lengths are as follows:

Longest Tokens and Character Counts

The analysis of the longest tokens and their character counts for each label provided valuable insights into the characteristics of our text dataset. Examining the longest tokens for each label demonstrates the diversity and complexity of the texts in our dataset.

This analysis also offers important clues for text processing models, particularly in determining the maximum token lengths that our model should consider when processing text data. By doing so, we can enable our model to learn more effectively and enhance its performance.

In conclusion, the analysis of the longest tokens and character counts contributes to a better understanding of our text dataset and helps us achieve better results in the modeling process. This analysis enables us to gain insights into the characteristics of our dataset and make better decisions in future analyses.

Thank you for your time and attention!