Body Performance Project-2.4

Unsupervised Learning: Clustering Analysis

Daniel Chiebuka Ihenacho
5 min readOct 14, 2023
Photo by Pierre Bamin on Unsplash

In the previous post, we explored getting to EDA: Exploratory data analysis and now we continue with our routine; Unsupervised Learning: Clustering Analysis.

What is unsupervised learning?

This is simply a machine learning paradigm that deals with unlabelled data. It’s primary aimed towards knowledge extraction, discovery of hidden structures/patterns within the data, thereby allowing it to either group the data points, reduce dimensionality or perform other tasks without prior knowledge of what the output should look like.

Unsupervised learning could be used to uncover hidden patterns via clustering algorithms such as K-Means Clustering, thereby allowing this cluster labels to be appended back into the dataset for either further machine learning tasks or exploratory data analysis.

Kinds of unsupervised learning

  1. Unsupervised transformations: These are algorithms that transform the existing data into new representations which might be more interpretable or suitable for both human and/or other machine learning algorithms when compared to the original data representation. A common task involves dimensionality reduction using PCA (Principal Component Analysis).
  2. Clustering algorithms: This is involves the partitioning of data into distinct clusters/groups which are similar to each other. Some common clustering algorithms are K-Means, Hierarchical Clustering, and DBSCAN.

Applications of unsupervised learning

  • Customer segmentation
  • Image compression
  • Anomaly Detection
  • Recommendation systems
  • Data Visualization
  • Genomic Data Analysis

Now you know a bit of unsupervised learning, let’s go into the project. The dataset and code could be gotten from my repo . This dataset is continuation from the previous article published so you won’t need to do any data cleaning.

Here’s our clustering analysis process;

  1. Load the data
  2. Identify features that require normalisation and normalise
  3. Apply the clustering algorithm; K-means
  4. Evaluate the number of clusters
  5. Pipeline creation
  6. Visualise clusters
  7. Append cluster labels to normalised dataset.
# Load the data
df = pd.read_csv("classification_dataset.csv")
df.sample(random_state=42,n=5)
Sampled data
to_be_transformed1 =["weight_kg",
"body_fat_percent",
"grip_force",
"sit_ups_counts",
"diastolic",
"systolic"]

to_be_transformed2 = [
'age',
'bmi'
]

columns = df.columns.to_list()
indexes1 = [columns.index(column) for column in to_be_transformed1]
indexes2 = [columns.index(column) for column in to_be_transformed2]

Carefully identified columns requires some form of normalisation

# Preprocessing algorithms
from sklearn.preprocessing import StandardScaler, \
FunctionTransformer
from sklearn.compose import ColumnTransformer

# Clustering algorithms
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer,SilhouetteVisualizer


def log_tranform(x):
return np.log1p(x)


columns_to_scale = df.copy() # data
standard_scaler = StandardScaler() # scaler
function_transformer = FunctionTransformer(log_tranform) # functional

column_transformer = ColumnTransformer(
transformers=[
(
'standard_scalering',standard_scaler,indexes1
),
(
'functional_transformer',function_transformer,indexes2
)
], remainder='passthrough'
)

scaled = column_transformer.fit_transform(columns_to_scale)
scaled[:5] # not quite readable
Scaled data

Having transformed the dataset, the above array isn’t quite readable, hence we convert it into a Pandas DataFrame to better readability. Kindly refer to the code and image below which addresses this.

# converting the scaled data into a dataframe
scaled_frame = pd.DataFrame(scaled,columns=df.columns.to_list())
scaled_frame.sample(random_state=42,n=5)
Scaled data converted to Dataframe

Clustering Evaluation

To evaluate the cluster algorithm, the Elbow curve is employed. The Elbow Method is a technique used to determine the optimal number of clusters for the K-means clustering algorithm. K-means requires us to specify the number of clusters beforehand, which can be challenging. Using the Yellow brick package, we can easily inspect the number of clusters and determine which number of clusters would suit the dataset.

As seen from the Elbow curve below, five clusters would be the optimal way of grouping our data into five distinct categories or clusters.

from yellowbrick.cluster import KElbowVisualizer
km = KMeans(random_state=42)
visualiser = KElbowVisualizer(km,k=(2,10),metric='distortion')
visualiser.fit((scaled))
visualiser.show()
K-means evaluation using Elbow Curve

What is a Pipeline?

Quoting from Introduction to Machine Learning with Python by Andreas C. Müller and Sarah Guido

Pipeline is a class that allows gluing together multiple processing steps into a single scikit-learn estimator. The pipeline class itself has fit, predict and score methods and behaves just like any other model in scikit-earn.

Pipelines offers a seamless and convenient way of connecting each step of a machine learning process. Once the machine learning process has been tested and approved, you could embed most of the steps into a pipeline.

As seen below, having figured out the number of clusters, the column transformer and K-means algorithm have been tied together and embedded into a single pipeline.

# Bringing scaling and clustering together
from sklearn.pipeline import Pipeline
my_pipe_line = Pipeline(
[
("my_column_transformer",column_transformer),
('my_cluster',KMeans(random_state=42,n_clusters=5))
]
)

result = my_pipe_line.fit(columns_to_scale)
my_cluster = result.named_steps['my_cluster']


import sklearn
def cluster_plots(x:int,y:int, pipeline_cluster:sklearn.cluster._kmeans.KMeans):
# improved version for pipleine
sns.set_theme(rc=rc,style='whitegrid',palette='bright')
# data points
sns.scatterplot(x=scaled_frame.iloc[:,x],y=scaled_frame.iloc[:,y]
,hue=pipeline_cluster.labels_,palette='bright')

# centers
sns.scatterplot(pipeline_cluster.cluster_centers_[:,x]
,pipeline_cluster.cluster_centers_[:,y]
,marker='X',s= 80, label="centriods",color='red')

# styling
plt.xlabel(f"{scaled_frame.columns[x]}",fontdict=font_label)
plt.ylabel(f"{scaled_frame.columns[y]}",fontdict=font_label)
plt.title("Cluster plots",fontdict=font_title)

cluster_plots(0,1,my_cluster)
2D visualisation of clusters

We can also visualise the clusters in 3D format with the use of plotly as shown below;

3D visualisation of clusters
# Combining clusters with scaled data
cluster = scaled_frame.copy()
cluster['clusters'] = my_cluster.labels_

cluster['clusters'] = cluster['clusters'].astype('category')
cluster['encoded_class'] = cluster['encoded_class'].astype('int8').astype('category')
cluster['gender'] = cluster['gender'].astype('int8').astype('category')

cluster.sample(n=5,random_state=42)
Appending the cluster labels back into the scaled Dataframe

Conclusion

With the use of K-means algorithm, knowledge and hidden patterns have been discovered from the dataset. Moreover, cluster labels have been assigned to data points which are similar to each other.

Recommendations

  • Try using a different clustering algorithm
  • Try combining normalisation and PCA with a clustering algrithm
  • Try appending the cluster labels to the original dataset and perform some form of exploratory data analysis; box plots could show you some insights to make further observations.

References

--

--

Daniel Chiebuka Ihenacho

A Data scientist & Analyst — Always looking to learn and grow in the data field. Looking forward to connecting with you all