Data mining with k-means clustering

--

Data mining is a process of analyzing and discovering hidden knowledge from large amounts of data. It provides the tools that enable organizations to extract useful information from their data assets.

Data mining techniques are used to uncover correlations between multiple attributes in large databases and interactions between various entities, augmenting the existing database schema.

One of the methods for data mining is clustering approach, in particular K-means clustering which will be the topic of this article.

Clustering of data is an important set of methods for analysis of data.

Clustering is one of the unsupervised machine learning techniques. It divides a large set of data into groups, where members of each group are very similar to one another, but different from those in other groups. This technique is not constrained by any target variable, as it does not create targets at all.

This article discusses some most common clustering algorithms and approaches employed by analysts and will also suggest when using which algorithms might be a better choice.

Business applications of clustering

Clustering is used in many different industries and fields. Most common business applications include:

  • segmentation of customers
  • categorization of inventory
  • segmentation of images
  • categorization of texts
  • recommender engines
  • detection of anomaly and fraud

The K-means algorithm is one of the most popular clustering algorithms out there, and its ease of implementation, along with its efficiency, makes it a great choice for beginners.

Of course, when we get into more advanced algorithms, their higher performance guarantees make them more attractive, but K-means is still a good place to start.

K-means algorithm

The K-means algorithm is an unsupervised machine learning algorithm that searches for the optimum segmentation division of data items into a specified number of clusters (called K), so that each data instance belongs to only one cluster.

K-means algorithm tries to distribute the data points in such way that makes the data items within clusters relatively close to each other but as distant as possible from other data items that may reside in other groups/clusters.

The algorithm for computing the final group of clusters consists from several steps:

1. choose the number of clusters — parameter K

2. select K points as the initial centroids, in random manner

3. assign each data item to the nearest cluster using the distance of the data item to the centroid as the criterion

4. compute the new coordinates of all clusters centroids by means of an average of coordinates of clusters members

5. repeat steps 3. and 4. until there is no change observed in the clusters members

Here is an example of clustering process computation (adopted from our company blog at https://www.alpha-quantum.com/blog/k-means-clustering/k-means-clustering-from-scratch/):

Drawbacks of the K-means algorithm

Although K-means algorithm is relative straightforward to implement and quite fast, it is important to also know that it has its drawbacks.

One is that our results obtained are to some degree sensitive to initialization of centroids. We also will not necessarily attain the solution that is global optimum.

Another drawbacks is the implicit assumption of the K-means algorithm that data in clusters are distributed spherically around the centre point of the cluster. This can result in non-optimal results when we are dealing with data that have non-spherical clusters or when dealing with with data sets that have outliers.

We recently had a case like this while working on 800,000 domains taken from the offline database.

Another useful case is for determining web site categories.

One of the drawbacks is also the need to provide in advance the number of clusters. The method is unable to learn this value from the data.

K-means algorithm is nevertheless an important tools among machine learning models that can be applicable in a wide range of cases. We are currently working on using it to find similar stores and search stores by category.

K-means algorithms can help us find answers to many problems, e.g. what is product category of given product.

In our next article — second part on the topic of k-means clustering, we will discuss these drawbacks of the K-means algorithm and how to address them.

--

--