Unsupervised learning with K-means
The Portuguese version of this article is available in Aprendizado não supervisionado com K-means
This article is about unsupervised learning, the machine learning methods which uses unlabeled data. In unsupervised learning there is a technique called Clustering used for cluster data that have similar characteristics. Sometimes these techniques can categorize data to enable the use for supervised learning. The grouping of data can be calculated in different ways but here I explain the K-means algorithm, a partitional clustering.
This is one of the most simple clustering methods. It tries to split data in K (a predefined number) clusters, according to the distance of each data point to something called centroid. A centroid is a sort of prototype for the cluster. In most techniques, a random data point is chosen to define a centroid’s initial coordinates, so that the position of this data point and the new centroid is the same. This is repeated until there are K centroids. Then each instance of data is attributed to the nearest centroid. In the next iterations the centroids positions are calculated through the average distance between all points attributed to that centroid in the last iteration. The algorithm ends when the centroids positions don’t change or the distance change is smaller than a pre-defined threshold.
To creating clusters with K-means in Python is very simple too. Let's see the code:
But, how to choose the correct number of clusters? The answer for this is subjective, there are a lot of methods for calculating the optimal number of clusters and two of this methods I'll explain bellow, Elbow and Average Silhouette.
In this method you have to run your clustering algorithm with some values, for example,1 up to 10 clusters. Compute the cost function, the within-cluster sum of squares, and plot in a graph. The optimal number of clusters is achieved when the addition of a new cluster doesn't significantly increases the cost function. This usually happens on the 'elbow' of the line.
The Silhouette Analysis measures how well a point fits into a cluster. In this method a graph is plotted measuring how close the points in some cluster are to points of the other clusters. When the Silhouette coefficient is next to +1 that means that the points are far way to the other clusters and when it is next to 0 it means that the points are very close or intersecting other clusters.
To calculate the Silhouette coefficient we need to define the mean distance of a point to all other points in its cluster (a(i)) and also define the mean distance to all other points in the closest cluster (b(i)). So, the Silhouette coefficient is:
s(i) = (b(i) - a(i)) / max(b(i), a(i))
To visualize a Silhouette Analysis, it's better to look to entire cluster. For this we can calculate the Mean Silhouette score taking the mean of all examples in the dataset.
Now, let's see the K-means and the method to evaluate the number of clusters in a real dataset. The data used here is about the Brazilian customers e-commerce consume. The code below normalizes the data and plots the graph:
So, let's try to use the k-means algorithm with 5 clusters:
So, this clustering is a bit strange, isn't? Let's evaluate this dataset with Elbow method:
The graph shows us that this dataset doesn't have an apparent "elbow". Maybe the correct number of clusters are between 4 and 6. But maybe the dataset that we built can't be splitted in this way… Let's see the Silhouette coefficient for some values of the number of clusters.
The output of this algorithm is below. The graphs are separated by amount of clusters. The left graphs show us the Silhouette score and the ones on the right has clusters visualization.
For n_clusters = 2 The average silhouette_score is : 0.24604273339845253
For n_clusters = 3 The average silhouette_score is : 0.20670607133321856
For n_clusters = 4 The average silhouette_score is : 0.188444914764597
For n_clusters = 5 The average silhouette_score is : 0.19090629903451375
For n_clusters = 6 The average silhouette_score is : 0.18047082769472864
As you can see, the Silhouette score about the clusters is not so good in any scenario that we plot. The best to do here is to rebuild the dataset with other features. Maybe delete a few of them can improve the results.
Unsupervised learning using Cluster Analysis can be very easy to produce and at the same time very useful and important to a lot of knowledge areas, for example, what segments of customers a company have, to know better ways to advertising themselves or to create better products for each of their segments, besides that, another example is to know about some species in biology, the researchers can cluster the animals, or cells, or "something" according to their characteristics.
It's important to remember that K-means is not the only technique for clustering data. There are other useful methods like Hierarchical Clustering, which is another easy to learn algorithm, or Expectation-maximization, a widely used method by Data Scientists.
The entire code of this article's experiences are in this Github repository.