A Beginners Guide to Unsupervised Learning

Published in

Analytics Vidhya

5 min readAug 6, 2019

Whenever someone talks about Machine Learning they will always start with: Supervised learning, Unsupervised training, Reinforcement training are the main broad categories.

But what is really behind that technique called unsupervised? Come, let us see with some examples.

Most of the times we build models to predict or forecast something. This particular type of technique is very well known as supervised training. In this case, we know the labels and the patterns among the data.

But Unsupervised learning is a bit different from that, where we train our models to find the hidden patterns among the data to label the unseen items in the future based on the learning. We do this

without a specific prediction task in mind,
and sometimes to perform dimension reduction for large Data

Let me walk you through an example by using a “Fish Measurement” data set. This data set consists of:

species of fish
the weight of the fish
length1 (length of the fish from the nose to the beginning of the tail, in centimeters)
length2 (length of the fish from the nose to the notch of the tail, in centimeters)
length3 (length of the fish from the nose to the end of the tail, in centimeters)
height maximum height of the fish, in centimeters
width maximum width of the fish, in centimeters

The task here is to cluster the fishes into correct species. But unfortunately, we don’t have any idea about how each of these features is correlated so labeling the fishes with correct species type ourselves is a hard task. So what can we do?

This is where unsupervised learning comes in handy. There are many pre-defined algorithms like K-means clustering, Hierarchical Clustering, DBSCAN, and also we can build our own clustering models using Neural Networks as needed. In this article, I am not going to explain about those algorithms, for the simplicity we will use the K-Means clustering in our example.

Identifying the Possible Number of Clusters in Data

Before we need to train our model, we should know how many different types of species (clusters) we are going to label. So how can we figure it out? One way is we can simply get an idea from the use case. For example, if we are going to cluster apples and oranges based on some attributes, we know the clusters are 2. Likewise, we can get some idea from stakeholders. But the ideal way is to use the Inertia. Inertia is the sum of squared error for each cluster. Therefore the smaller the inertia the denser the cluster

First, we cluster the data with different number of clusters and plot the number of clusters vs.inertia graph.

ks = range(1, 6)
inertias = []for k in ks:
    # Create a KMeans instance with k clusters: model
    model = KMeans(n_clusters = k)# Fit model to samples
    model.fit(samples)# Append the inertia to the list of inertias
    inertias.append(model.inertia_)# Plot ks vs inertias
plt.plot(ks, inertias, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()

As we can see, the inertia is getting lower and lower as the number of cluster increases. So what will be the best number of clusters?

A good cluster should have tight clusters
But not too many clusters
A simple rule of thumb is to find the elbow of the graph

So, in our fish clustering case, there are 4 clusters, where the decreasing slope of inertia is lower.

Checking the Quality of Clusters

Since we know the number of clusters let’s build a model and visualize the result.

model = KMeans(n_clusters = 4)
model.fit(samples)labels = model.predict(samples)#lets see how the species are clustered based on the weight and height
xs = samples.iloc[:,0] #weight column
ys = samples.iloc[:,4] #height columnplt.scatter(xs, ys, c = labels, alpha = 0.5)
centroids = model.cluster_centers_
centroids_x = centroids[:,0]
centroids_y = centroids[:,4]plt.scatter(centroids_x, centroids_y, marker = 'D', s = 50)
plt.show()

Visually, it seems clustered perfectly. The markers in Diamond shapes are the center points(mean) of each cluster. But remember our eyes can cheat us!

Let us double check our quality by counting how species are clustered into each species. A simple way to do this is using the pandas crosstab method.

#lets create a dataframe of predicted labels and species actually the were
df = pd.DataFrame({'labels': labels, 'species': species})#lets do a crosstab evaluation to verify the quality of our clustering
ct = pd.crosstab(df['labels'], df['species'])
print(ct)

As we can see, the clustering is not perfect enough. Some of the fish from Bream species is clustered under the label 0,1 and 2. Moreover, the other species is also wrongly clustered into two or more different clusters (labels).

So what might be the problem here? Yes, you are correct — the features we have are scaled in different metrics, so the mean and variance among the features are different, which makes the clustering imperfect. So how can we solve this and improve the model?

Feature Scaling and Normalization

These are the two different techniques used to solve the above issues.

1 — Standardizing is the process where we bring down the standard deviation and mean of the features to 1 and 0, to the standard scale.

2 — On the other hand, Normalization is the process where we bring down all the feature values ranges between 0 and 1.

In our example, we are going to Standardize the data and see the cluster quality. There are many techniques available in Scikit learn to do this. We will use the “StandardScale” module to achieve our goal.

scaler = StandardScaler()
kmeans = KMeans(n_clusters= 4)pipeline = Pipeline([('Scaler',scaler), ('KMeans',kmeans)])
pipeline.fit(samples)labels2 = pipeline.predict(samples)df2 = pd.DataFrame({'labels': labels2, 'species': species})ct2 = pd.crosstab(df2['labels'], df2['species'])print(ct2)

Hooray, now the clusters seem much better than the previous one. We can still improve this by adding more data and tuning the Hyper Parameters of KMeans and using Feature Scaling techniques. I have talked about many things in this article. Get your hands dirty and try all the concepts out yourselves.

Oops, don’t forget to give a star for the Github repo.

A Beginners Guide to Unsupervised Learning

Identifying the Possible Number of Clusters in Data

Checking the Quality of Clusters

Feature Scaling and Normalization

Written by Mathanraj Sharma