Body Performance Project-2.4

Unsupervised Learning: Clustering Analysis

5 min readOct 14, 2023

In the previous post, we explored getting to EDA: Exploratory data analysis and now we continue with our routine; Unsupervised Learning: Clustering Analysis.

What is unsupervised learning?

This is simply a machine learning paradigm that deals with unlabelled data. It’s primary aimed towards knowledge extraction, discovery of hidden structures/patterns within the data, thereby allowing it to either group the data points, reduce dimensionality or perform other tasks without prior knowledge of what the output should look like.

Unsupervised learning could be used to uncover hidden patterns via clustering algorithms such as K-Means Clustering, thereby allowing this cluster labels to be appended back into the dataset for either further machine learning tasks or exploratory data analysis.

Kinds of unsupervised learning

Unsupervised transformations: These are algorithms that transform the existing data into new representations which might be more interpretable or suitable for both human and/or other machine learning algorithms when compared to the original data representation. A common task involves dimensionality reduction using PCA (Principal Component Analysis).
Clustering algorithms: This is involves the partitioning of data into distinct clusters/groups which are similar to each other. Some common clustering algorithms are K-Means, Hierarchical Clustering, and DBSCAN.

Applications of unsupervised learning

Customer segmentation
Image compression
Anomaly Detection
Recommendation systems
Data Visualization
Genomic Data Analysis

Now you know a bit of unsupervised learning, let’s go into the project. The dataset and code could be gotten from my repo . This dataset is continuation from the previous article published so you won’t need to do any data cleaning.

Here’s our clustering analysis process;

Load the data
Identify features that require normalisation and normalise
Apply the clustering algorithm; K-means
Evaluate the number of clusters
Pipeline creation
Visualise clusters
Append cluster labels to normalised dataset.

# Load the data
df = pd.read_csv("classification_dataset.csv")
df.sample(random_state=42,n=5)

to_be_transformed1 =["weight_kg",
"body_fat_percent",
"grip_force",
"sit_ups_counts",
"diastolic",
"systolic"]

to_be_transformed2 = [
    'age',
    'bmi'
]

columns = df.columns.to_list()
indexes1 = [columns.index(column) for column in to_be_transformed1]
indexes2 = [columns.index(column) for column in to_be_transformed2]

Carefully identified columns requires some form of normalisation

# Preprocessing algorithms
from sklearn.preprocessing import  StandardScaler, \
       FunctionTransformer
from sklearn.compose import ColumnTransformer

# Clustering algorithms
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer,SilhouetteVisualizer


def log_tranform(x):
    return np.log1p(x)


columns_to_scale = df.copy() # data
standard_scaler = StandardScaler() # scaler 
function_transformer = FunctionTransformer(log_tranform) # functional

column_transformer = ColumnTransformer(
     transformers=[
        (
            'standard_scalering',standard_scaler,indexes1
        ),
        (
            'functional_transformer',function_transformer,indexes2
        )
    ], remainder='passthrough'
)

scaled = column_transformer.fit_transform(columns_to_scale)
scaled[:5] # not quite readable

Having transformed the dataset, the above array isn’t quite readable, hence we convert it into a Pandas DataFrame to better readability. Kindly refer to the code and image below which addresses this.

# converting the scaled data into a dataframe
scaled_frame = pd.DataFrame(scaled,columns=df.columns.to_list())
scaled_frame.sample(random_state=42,n=5)

Clustering Evaluation

To evaluate the cluster algorithm, the Elbow curve is employed. The Elbow Method is a technique used to determine the optimal number of clusters for the K-means clustering algorithm. K-means requires us to specify the number of clusters beforehand, which can be challenging. Using the Yellow brick package, we can easily inspect the number of clusters and determine which number of clusters would suit the dataset.

As seen from the Elbow curve below, five clusters would be the optimal way of grouping our data into five distinct categories or clusters.

from yellowbrick.cluster import KElbowVisualizer
km = KMeans(random_state=42)
visualiser = KElbowVisualizer(km,k=(2,10),metric='distortion')
visualiser.fit((scaled))
visualiser.show()

What is a Pipeline?

Quoting from Introduction to Machine Learning with Python by Andreas C. Müller and Sarah Guido

Pipeline is a class that allows gluing together multiple processing steps into a single scikit-learn estimator. The pipeline class itself has fit, predict and score methods and behaves just like any other model in scikit-earn.

Pipelines offers a seamless and convenient way of connecting each step of a machine learning process. Once the machine learning process has been tested and approved, you could embed most of the steps into a pipeline.

As seen below, having figured out the number of clusters, the column transformer and K-means algorithm have been tied together and embedded into a single pipeline.

# Bringing scaling and clustering together
from sklearn.pipeline import Pipeline
my_pipe_line = Pipeline(
    [
    ("my_column_transformer",column_transformer),
    ('my_cluster',KMeans(random_state=42,n_clusters=5))
    ]
)

result = my_pipe_line.fit(columns_to_scale)
my_cluster = result.named_steps['my_cluster']


import sklearn
def cluster_plots(x:int,y:int, pipeline_cluster:sklearn.cluster._kmeans.KMeans):
    # improved version for pipleine
    sns.set_theme(rc=rc,style='whitegrid',palette='bright')
    # data points
    sns.scatterplot(x=scaled_frame.iloc[:,x],y=scaled_frame.iloc[:,y]
    ,hue=pipeline_cluster.labels_,palette='bright')

    # centers
    sns.scatterplot(pipeline_cluster.cluster_centers_[:,x]
    ,pipeline_cluster.cluster_centers_[:,y]
    ,marker='X',s= 80, label="centriods",color='red')
    
    # styling
    plt.xlabel(f"{scaled_frame.columns[x]}",fontdict=font_label)
    plt.ylabel(f"{scaled_frame.columns[y]}",fontdict=font_label)
    plt.title("Cluster plots",fontdict=font_title)

cluster_plots(0,1,my_cluster)

We can also visualise the clusters in 3D format with the use of plotly as shown below;

# Combining clusters with scaled data
cluster = scaled_frame.copy()
cluster['clusters'] = my_cluster.labels_

cluster['clusters'] = cluster['clusters'].astype('category')
cluster['encoded_class'] = cluster['encoded_class'].astype('int8').astype('category')
cluster['gender'] = cluster['gender'].astype('int8').astype('category')

cluster.sample(n=5,random_state=42)

Appending the cluster labels back into the scaled Dataframe

Conclusion

With the use of K-means algorithm, knowledge and hidden patterns have been discovered from the dataset. Moreover, cluster labels have been assigned to data points which are similar to each other.

Recommendations

Try using a different clustering algorithm
Try combining normalisation and PCA with a clustering algrithm
Try appending the cluster labels to the original dataset and perform some form of exploratory data analysis; box plots could show you some insights to make further observations.

References

Introduction to Machine Learning with Python by Andreas C. Müller and Sarah Guido
https://neptune.ai/blog/k-means-clustering
https://www.quora.com/Which-one-is-better-before-clustering-standardization-or-normalization

Body Performance Project-2.4

Unsupervised Learning: Clustering Analysis

Written by Daniel Chiebuka Ihenacho