Building Blocks for AI Part 2: Clustering and Classification

5 min readNov 16, 2023

In the last post, I talked about some core areas needed for setting up your AI/ML roadmap like vectorization, similarity detection, and sentiment analysis. Two additional aspects that will form the core of any AI/ML roadmap are clustering and classification.

Clustering involves organizing data into meaningful groups based on logical criteria. For instance, when dealing with movie data encompassing genres, descriptions, and actors, clustering facilitates the grouping of similar movies. This clustering can be leveraged to recommend related movies to end users. Another practical application is employing clustering algorithms on user data to enable targeted promotions and marketing strategies for specific customer segments.

Classification, on the other hand, is a supervised learning technique designed to assign predefined class labels to input data. The objective is to develop a mapping from input features to specific classes or categories. A straightforward example includes classifying emails into spam and non-spam categories or categorizing loan applications as low-risk, medium-risk, or high-risk. Image classification stands out as another common use case where images are categorized into predefined classes.

Clustering

Clustering is an unsupervised learning technique that groups similar data points into clusters based on their inherent patterns or similarities. Clustering does not require labeled data for training. It identifies patterns and similarities in the data without predefined categories.

Popular clustering algorithms include K-means, hierarchical clustering, and DBSCAN. K-means clustering minimizes the sum of squared distances between data points and the centroid of their assigned cluster. Hierarchical clustering builds a tree-like hierarchy of clusters. DBSCAN is a density-based clustering algorithm that groups together data points that are close to each other and separates sparse regions.

Let’s take a closer look at K-means clustering approach and code.

1. Place K points randomly on the graph. These will serve as initial center points or centroids of the K clusters.
2. Now assign each item (point) existing in the graph to one K clusters, based on its closeness (shortest distance from the centroid) to the centroid of the cluster
3. When all items are assigned a cluster, recalculate the centroid of the cluster (point from where the distance to all the items in the cluster is minimum on average).
4. Repeat steps 2 and 3, until there is no scope for improvement (no movement is seen in centroids).

https://en.wikipedia.org/wiki/K-means_clustering

Sample data

Age,Income_Group,Gender,Marital_Status
30,Medium,Female,Married
48,High,Female,Married
27,Low,Male,Single
35,Medium,Female,Single
42,High,Female,Married
59,High,Male,Married
65,Low,Male,Single
52,Medium,Female,Single
45,Medium,Female,Married
26,Low,Male,Married
58,Medium,Male,Single
50,Low,Female,Single

Code

import pandas as pd
from sklearn.cluster import KMeans
# Step 1: Read Data from CSV
data = pd.read_csv('insurance_leads.csv')
# Step 2: Data Preprocessing (One-Hot Encoding for Categorical Features)
data = pd.get_dummies(data, columns=["Income_Group", "Gender", "Marital_Status"], drop_first=True)
# Step 3: Choose the Number of Clusters
k = 4  # You can adjust this value based on your analysis or objectives
# Step 4: Apply K-Means Clustering
kmeans = KMeans(n_clusters=k, random_state=0)
data['Cluster'] = kmeans.fit_predict(data)
# Step 5: Map numeric cluster labels to meaningful labels
cluster_labels = {
    0: "Cluster A",
    1: "Cluster B",
    2: "Cluster C",
    3: "Cluster D"
}
data['Cluster_Label'] = data['Cluster'].map(cluster_labels)
# Step 6: Print the resulting DataFrame with cluster labels
print(data)

Classification

Classification is supervised learning based approach that requires a labeled dataset for training, where each data point is associated with a known class or category. Classification algorithms include decision trees, random forests, support vector machines, logistic regression, and neural networks, among others.

Decision trees are a popular choice for classification tasks. They work by recursively splitting the dataset into subsets based on the most significant feature. The goal is to create a tree-like structure where each node represents a feature, and each branch corresponds to a possible value of that feature. The leaves of the tree represent the class labels (image below).

https://towardsdatascience.com/understand-decision-tree-classifier-8a7497d4c5b3

Random Forests are an ensemble learning method based on decision trees. They consist of a collection of decision trees. SVM is a binary classification algorithm that aims to find a hyperplane that best separates data points of different classes with the maximum margin. Logistic regression models the relationship between input features and the probability of a data point belonging to a particular class.

Deep learning models, such as convolutional neural networks (CNNs) for image classification and recurrent neural networks (RNNs) for sequential data, have achieved state-of-the-art performance in many classification tasks.

Let's try to learn based on an example. Say we have Leads data, where we have a lead source, lead score, time spent by salesperson, and final result if the lead was converted to a sale or not. Based on this we want to create a decision tree-based model where we can predict for future cases if the lead is convertible or not.

Sample data

Lead_Source,Lead_Score,Time_Spent (minutes),Conversion
Website,80,10,1
Referral,65,15,1
Social Media,45,5,0
Email Campaign,70,20,1
Website,60,8,0
Referral,75,18,1
Email Campaign,55,12,0
Social Media,50,7,0
Email Campaign,40,14,0
Website,85,9,1

Code

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Load the dataset from the CSV file
data = pd.read_csv('lead_conversion_data.csv')

# Features (X) and target variable (y)
X = data[["Lead_Source", "Lead_Score", "Time_Spent (minutes)"]]
y = data["Conversion"]

# Convert lead source to numerical values using one-hot encoding
X = pd.get_dummies(X, columns=["Lead_Source"], drop_first=True)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Decision Tree classifier
clf = DecisionTreeClassifier()

# Train the classifier on the training data
clf.fit(X_train, y_train)

# Make predictions on the test data
y_pred = clf.predict(X_test)

# Evaluate the classifier's performance
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=["No Conversion", "Conversion"])

# Print the results
print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:\n")
print(report)

Conclusion

Clustering and classification stand out as vital tools within the realm of Machine Learning, offering effective solutions for organizing data logically. Clustering, an unsupervised learning approach, plays a pivotal role in categorizing data into coherent groups, allowing the aggregation of records in a meaningful manner. This is particularly beneficial in scenarios such as assembling users for targeted marketing campaigns, forming the foundation of recommendation engines by grouping akin data, and organizing similar products.

On the other hand, classification, a supervised learning technique, excels in categorizing records or data into predefined sets. For instance, it proves invaluable in discerning whether a transaction is fraudulent or legitimate, predicting potential machinery failures for preventive maintenance, forecasting diseases, and categorizing sentiments.

Collectively, clustering and classification emerge as two potent tools that empower data analysis, facilitating informed decision-making.

All the code examples shared in this post are available in the following code repository.

https://github.com/kamalmeetsingh/MLExamples

Building Blocks for AI Part 2: Clustering and Classification

Clustering

Classification

Conclusion

Written by Kamalmeet Singh