Using unsupervised clustering techniques to extract first/given names from single field name clusters in consumer/mobile applications

Published in

Givelify Engineering

10 min readOct 21, 2023

I recently worked on what turned out to be an unexpected data-science/ data-engineering problem: extracting First Names (or an appropriate addressable form of expression) for more personalized communication when engaging with Givelify’s donors. This problem is due to how our Donors enter their names in our Giving Apps. We have a single, unrestrictive field for donors to articulate their names, a design choice intended to minimize user friction and to allow users a broader spectrum of expression. However, this user-centric design inadvertently magnifies an already intricate challenge. When it comes to names, the variations are manifold — we encounter titles, hyphenated components, initials, nicknames, culturally specific names comprising multiple parts, and generational titles, each adding a layer to the intricate tapestry of name structures. This rich diversity in name representations demanded a meticulous, thoughtful approach to extracting meaningful, personalized identifiers for enhanced engagement with our donors.

Name Structure Analysis: A Fun Exploration of Various Clustering Techniques

Names may seem trivial and straightforward, most seemingly possessing identifiable first and last names. However, this simplicity is often a misconception. Consider an entry ‘Mr. J. Doe’; should we address them as ‘J’ or ‘J. Doe’ or ‘Mr. Doe’ in our correspondence? Or think about a name like ‘Dr. Aarav Singh Chaudhary’; how do we discern between possible multiple last, middle, or compound first names and decide the most respectful and accurate form of address? Rigid, hardcoded rules fall short when navigating this multifaceted landscape of names from various cultures and structures. Therefore, I decided to employ clustering algorithms to discern underlying patterns in how users input their names and address the inherent complexities and diversities.

Laying the Groundwork: Feature Identification and Normalization

Identifying Features

The first step in any machine learning endeavor is identifying the features. A feature in machine learning is an individual measurable property or characteristic of the observed phenomenon. It serves as a piece of the puzzle that the algorithm uses to learn a pattern. In our scenario, we analyzed every name in our dataset and extracted various features. The chosen features were the number of words, the count of special characters or punctuations, the count of emojis, and the presence of titles or other special characters. Each feature provides a different perspective on the structure of the name, allowing the model to discern underlying patterns in the data. This is very subjective (as it should be), and keep playing around with it to see what fits your needs. If you have trouble thinking of features, pull a random sample of names and see what patterns you can pull out. For example, “titles” or “hyphenated names” can be a feature.


def extract_features(names):
    features = []

    for name in names:
        # Number of words
        word_count = len(name.split())
        # Count of special characters or punctuations
        punct_count = sum([1 for char in name if not char.isalnum() and not char.isspace()])
        # Count of emojis
        emoji_count = count_emojis(name)
        # Name length
        name_length = len(name)
        religious_title = has_religious_title(name)
        features.append([word_count, punct_count, emoji_count, religious_title])

    return np.array(features)

def count_emojis(text):
    emoji_pattern = re.compile(r'[\U00010000-\U0010ffff]')
    emojis = emoji_pattern.findall(text)
    return len(emojis)

Normalizing Features

Once the features are identified and extracted, it is crucial to normalize them. Normalization is a technique used to bring all the features to a standard scale. This process is vital because, without it, the scale of one feature might overshadow the others, skewing the learning process and potentially leading to incorrect insights or predictions. By ensuring that each feature contributes equally to the learning process, normalization allows for a more balanced and accurate model, enabling it to make better generalizations about the data it has not seen before.

You can normalize your features in a single line as shown below:

#need to import standard scalar
from sklearn.preprocessing import StandardScaler 
scaled_features = StandardScaler().fit_transform(features)

Choosing the Right Tool: Algorithm Selection

Selecting the correct clustering algorithm was pivotal to navigating the vast and varied landscape of name structures. Initially, we wanted to get insights into the diversity and complexity of the names in our dataset, so we needed an unsupervised algorithm capable of uncovering subtle, underlying patterns and structures within the data. We chose HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise), a robust and versatile algorithm. One of its significant advantages is that it doesn’t require us to determine the number of clusters beforehand. We also have a large dataset and it does better when compared with DBSCAN. But you can also use DBSCAN. It excels at discovering different groups or ‘clusters’ in our data, even if they are of different sizes, shapes, or arrangements. It is especially suited for name structures’ diverse and unpredictable nature.

For us, that initial HDBSCAN run was invaluable in providing a preliminary glimpse into the myriad ways users input their names, laying the groundwork for us to refine the features we wanted to use and targeted clustering in subsequent iterations. However, depending on the nature of your data and project goals, other algorithms, such as K-Means or DBSCAN, might also be worthy contenders, each bringing its own set of strengths, limitations, and assumptions to the table. I choose HDBSCAN over K-Means because I did not know the number of clusters before hand. If you have strong understanding of the number of clusters (if the dataset is not as varied), then K-Means might be a good algorithm to go with. Understanding the intricacies of your chosen algorithm and ensuring it aligns well with your data characteristics and project objectives is crucial.

 # Clustering using HDBSCAN
    clusterer = hdbscan.HDBSCAN(min_cluster_size=400, min_samples=50, alpha=1.5)
    clusters = clusterer.fit_predict(scaled_features)

min_cluster_size : how sensitive you want to be, larger the cluster size the larger the noise cluster
min_samples : How many points are required to form a dense region. Think of this as the core of your cluster. This defines how conservative your clusters are.
alpha: Alpha is another scaling parameter to determine how conservative your cluster is to be. Basically, it determines if a set of features is one cluster or two clusters. Alpha is defaulted to 1.0

Tweak these values to get the most optimal values.

Discoveries and Insights: Varieties and Patterns

Our exploration unearthed more than 50 distinct types of name structures and variations. These ranged from the conventional ‘First Name, Last Name’ combinations to one-word names, titles with last names like Dr. Doe, single initials, honorifics paired with single initials like Mr. X/Y, and even names interspersed with emojis. We also encountered hyphenated last names, names with generational titles, and culturally rich names with multiple components. It was enlightening to witness the multitude of ways users represented themselves, reflecting our user base’s diversity and richness.

Assigning labels taking actions based on clusters

Now that you have time to look at the various clusters and discern the patterns in each cluster. Based on that you have to come up with labels. These are the labels we assigned to our various clusters


cluster_label_samples = {
    "First Name Last Name Cluster": "Michael Burnham",
    "Title First Name Last Name Cluster": "Admiral John Adama",
    "Title First Name Cluster": "Commander Tuvok",
    "Single Name Cluster": "Starbuck",
    "Title First Name and an Emoji Cluster": "Dr. Baltar 🌌",
    "Single Name and an Emoji Cluster": "Athena 💡",
    "Title First Name Abbreviation Cluster": "Admiral H."
    "Abbreviated First Name and Abbreviated Last name Cluster":"H K"
    "Abbreviated First Name and Full Last name Cluster":"M Burnham"
    "First Name and Abbreviated Last name Cluster":"Michael B."
}

We built a function that looks at a the attributes of the core cluster (these are the exemplars from HDBSCAN function) and compare against the sample data point we have with each cluster label. If the average match value of the sample data point with the examplers is greater than acceptable threshold we assign the corresponding label to that cluster.

def assign_labels_to_clusters(clusterer, cluster_label_samples):
    # Create a dictionary to store the labels assigned to each cluster
    cluster_labels = {}
    
    # Assign noise cluster directly
    cluster_labels[-1] = "Noise"
    
    for cluster_idx, exemplars in enumerate(clusterer.exemplars_):
        # Skip the noise cluster
        if cluster_idx == -1:
            continue
        
        assigned_label = None
        label_similarities = {label: 0 for label in cluster_label_samples.keys()}
        
        for exemplar in exemplars:
            for label, sample_name in cluster_label_samples.items():
                # Extract features for the sample name
                sample_name_features = extract_features(sample_name)
                
                # Compute cosine similarity
                similarity = cosine_similarity([exemplar], [sample_name_features])[0][0]
                
                # Sum up the similarities for each label
                label_similarities[label] += similarity
        
        # Get the label with the maximum average similarity
        max_avg_similarity = max(label_similarities.values()) / len(exemplars)
        assigned_label = max(label_similarities, key=label_similarities.get)
        
        # If no label was assigned, label it as "Unknown"
        if max_avg_similarity <= 0:
            assigned_label = "Unknown"
        
        # Assign the label to the cluster
        cluster_labels[cluster_idx] = assigned_label

    return cluster_labels

Now every name is in a cluster (including noise clusters) and each cluster has a label, from that we could easily decide what to do for the various types of names/clusters.

Adding New Names as they come along

When introducing a new name, it is crucial to determine which cluster (or noise bucket) it falls under promptly. For this immediate assignment, we deploy a distance-based algorithm. This algorithm calculates the ‘distance’ between the new name and each existing cluster based on our predetermined feature set, considering factors like the number of words, count of special characters, emojis, and the presence of titles. If the calculated distance is within a defined threshold, the new name is assigned to the closest matching cluster, ensuring the consistency and reliability of our clusters.

However, this technique can lead to potential inaccuracies. Hence, to maintain the accuracy and relevance of our clusters, our system performs a comprehensive re-clustering of names every night. It allows for the swift incorporation of new data while maintaining the integrity and accuracy of our clusters through detailed, periodic analysis. This dynamic, two-fold approach enables us to continually optimize our user engagement, catering to the evolving tapestry of names with accuracy and precision.

Through this iterative, evolving process, we maintain high standards of accuracy and relevance in our name clusters and ensure our system’s readiness to adapt to new, unforeseen patterns in name structures, solidifying our commitment to personalized and respectful user experiences.

def assign_new_data_point(new_name, scaled_features, clusters, threshold=0.5):
    # Extract features for the new name and scale them
    new_features = extract_features([new_name])
    scaled_new_features = StandardScaler().fit(scaled_features).transform(new_features)
    
    # For simplicity, we're using Euclidean distance here
    # Calculate distances between new data point and all other points
    distances = np.linalg.norm(scaled_features - scaled_new_features, axis=1)
    
    # If the minimum distance is below the threshold, assign to the corresponding cluster
    if np.min(distances) < threshold:
        return clusters[np.argmin(distances)]
    
    # Otherwise, label as noise
    return -1

Training, Testing, and Validation: Ensuring Reliability

To validate our algorithm, we leveraged ChatGPT to generate creative test data sets populated with names and titles from the Star Trek universe. This playful approach allowed us to rigorously test our model and optimize its performance metrics, ensuring reliable, high-quality clusters.

training_names_dataset =[
    "Captain Kirk", "Mr. Spock", "Dr. McCoy", "Lt. Uhura", "Commander Data",
    "Lt. Cmdr. Riker", "Captain Picard", "Ensign Chekov", "Nurse Chapel",
    "Captain Janeway", "Seven of Nine", "Gul Dukat", "Ambassador Sarek",
    "Q 🌌", "Commander Worf ⚔️", "🚀 Warp Speed Scotty 🚀", "Morn",
    "Counselor Troi", "Odo", "Dr. Crusher", "Quark 💰", "Captain Sisko",
    "Lt. Cmdr. La Forge", "Ensign Harry Kim", "Admiral Nechayev",
    "Subcommander T'Pol", "Lieutenant Yar", "Cadet Wesley Crusher",
    "Guinan 🍸", "Ensign Ro Laren", "Commander Kira", "Grand Nagus Zek 💎",
    "Rom 💡", "Lt. Ezri Dax", "Captain Archer", "T'Kuvma", "Saru",
    "Michael Burnham", "Dr. Phlox", "Lieutenant Paris", "Admiral Adama",
    "Commander Tuvok", "Starbuck", "Dr. Baltar 🌌", "Chief Tyrol ⚔️",
    "🚀 Apollo 🚀", "Six", "Captain Adama", "Athena 💡", "Roslin",
    "Commander Apollo", "Lt. Helo", "Colonel Tigh 🍸", "Boomer 💎"
]

test_name_dataset_with_labels = {
    "Ronda Paris": "First Name and Last Name Cluster",
    "Admiral Adama": "Title with Single Name Cluster",
    "Commander Tuvok": "Title with Single Name Cluster",
    "Starbuck": "Single Name Cluster",
    "Dr. Baltar 🌌": "Title with Single Name and Emoji Cluster",
    "Chief Tyrol ⚔️": "Title with Single Name and Emoji Cluster",
    "🚀 Apollo 🚀": " Single Name with Emoji Cluster",
    "Six": "Single Name Cluster",
    "Captain Adama": "Title with Single Name Cluster",
    "Athena 💡": "Single Name with Emoji Cluster",
    "Roslin": "Single Name Cluster",
    "Jeffery Apollo": "First Name and Last Name Cluster",
    "Lt. Helo": "itle with Single Name Cluster",
    "Colonel Tigh 🍸": "Title First Name Emoji Cluster",
    "Boomer": "Single Name Cluster"
}

How did we test (and names are getting properly labeled)?

We used `pytest` for unit and functional testing but also to test the validity of clustering algorithms and labeling algorithms. We built tests to ensure the following :

At least a few Clusters are formed.
A noise cluster is formed.
The labels are correctly assigned to the correct cluster
When a new name is added, it is added to the right cluster

Final Thoughts, Lessons Learned, Next Steps:

Extracting names for better personalization is not a binary problem. Even with unsupervised machine learning techniques, I could not identify a personalized name for every one of our donors. However, I felt justified that machine learning was the right approach, even for what might seem like a rules-based problem.

Furthermore, as you look to internationalize your products, using unsupervised techniques will help uncover latent opportunities to truly personalize your customers’ experience. Utilizing these machine learning techniques in more products and features is the way to go. It is lightweight and gets the job done!

Other Information to help get started:

These are the libraries that I imported

import numpy as np
from sklearn.cluster import DBSCAN
import hdbscan
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity
import emoji
import pandas as pd
import re

The following packages had to be installed:

pip install emoji 
pip install hdbscan
pip install numpy
pip install pandas

p.s. I would love to hear feedback, other techniques and methods that were used to solve a similar problem space or if you used this technique in other areas.