CodeX
Published in

CodeX

K-Means Clustering

Photo by Billy Huynh on Unsplash

We can apply k-means clustering to partition data points into k different groups. Along with the data, the number of clusters “k” is an input to the algorithm.

What is K-Means Algorithm?

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number of pre-defined clusters that need to be created in the process. It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset belongs to only one group that has similar properties.

What is Unsupervised Learning?

As the name suggests, unsupervised learning is a machine learning technique in which models are not supervised using a training dataset. Instead, the models themselves find the hidden patterns and insights from the given data. It can be compared to learning which takes place in the human brain while learning new things. It can be defined as:

Unsupervised learning is a type of machine learning in which models are trained using an unlabeled dataset and are allowed to act on that data without any supervision.

Unsupervised learning cannot be directly applied to a regression or classification problem because, unlike supervised learning, we have the input data but no corresponding output data.

The goal of unsupervised learning is to find the underlying structure of the dataset, group that data according to similarities, and represent that dataset in a compressed format.

  • Example: Suppose the unsupervised learning algorithm is given an input dataset containing images of different types of cats and dogs. The algorithm is never trained upon the given dataset, which means it does not have any idea about the features of the dataset. The task of the unsupervised learning algorithm is to identify the image features on their own. Unsupervised learning algorithm will perform this task by clustering the image dataset into groups according to similarities between images.

The unsupervised learning algorithm can be further categorized into two types of problems:

Clustering: Clustering is a method of grouping the objects into clusters such that objects with the most similarities remain in a group and have fewer or no similarities with the objects of another group. Cluster analysis finds the commonalities between the data objects and categorizes them as per the presence and absence of those commonalities.

Association: An association rule is an unsupervised learning method that is used for finding the relationships between variables in the large database. It determines the set of items that occur together in the dataset. Association rule makes marketing strategy more effective. Such as people who buy X items (suppose a bread) are also tend to purchase Y (Butter/Jam) items. A typical example of the Association rule is Market Basket Analysis.

Below is the list of some popular unsupervised learning

  • K-means clustering
  • KNN (k-nearest neighbors)
  • Hierarchical clustering
  • Anomaly detection
  • Neural Networks
  • Principal Component Analysis
  • Independent Component Analysis
  • Apriori algorithm
  • Singular value decomposition

K-Means Clustering

The term “k-means” was first used by James MacQueen in 1967 as part of his paper on “some methods for classification and analysis of multivariate observations”. The standard algorithm was also used in bell labs as part of a technique in pulse code modulation in 1957. it was also published in 1965 by E. W. Forgy and typically is also known as the Lloyd-Forgy method.

k-means allows us to cluster the data into different groups and a convenient way to discover the categories of groups in the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm is to minimize the sum of distances between the data point and their corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats the process until it does not find the best clusters. The value of k should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

  • Determines the best value for K center points or centroids by an iterative process.
  • Assigns each data point to its closest k-center. Those data points which are near to the particular k-center, create a cluster.

Hence each cluster has data points with some commonalities, and it is away from other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

Step 0: Get the dataset

Step 1: Select the number K to decide the number of clusters.

Step 2: Select random K points or centroids. (It can be other from the input dataset).

Step 3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step 4: Calculate the variance and place a new centroid of each cluster.

Step 5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.

Step 6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step 7: The model is ready.

Not knowing the value of K

There is no way of knowing the number of clusters with K-means. So what you can do is start with K=1. Then increase the value of K (up to a certain upper limit). Usually, the variance (the summation of the square of the distance from the “owner” center for each point) will decrease rapidly. After a certain point, it will decrease slowly. When you see such behavior, you know you’ve overshot the K-value.

Bad initial guess

If your initial guess is bad, you cannot expect the algorithm to work well.

The best way is to run K-means on several random initial guesses. Then, pick the final centers which have the least variance.

Another trick is to pick centers in a certain manner:

  1. Place the first center on a data point
  2. Place the second center on a data point that is farthest from the first
  3. Place the third center on a data point that is farthest from the first and second
  4. So on.

Where can we apply K-means?

K-means algorithm can typically be applied to data that has a smaller number of dimensions, is numeric, and is continuous. Think of a scenario in which we want to make groups of similar things from a randomly distributed collection of things; k-means is very suitable for such a scenario.

Here are some of the interesting use cases of the K-means,

1. Document classification: Cluster documents in multiple categories based on tags, topics, and the content of the documents. This is a very standard classification problem and k-means is a highly suitable algorithm for this purpose. The initial processing of the documents is needed to represent each document as a vector and uses term frequency to identify commonly used terms that help classify the document. the document vectors are then clustered to help identify similarities in document groups.

2. Delivery store optimization: Optimize the process of good delivery using truck drones by using a combination of k-means to find the optimal number of launch locations and a genetic algorithm to solve the truck route as a traveling salesman problem.

3. Identifying crime localities: With data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association between the two can give quality insight into a crime-prone area within a city or a locality. With data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association between the two can give quality insight into a crime-prone area within a city or a locality.

4. Customer/Market Segmentation: Clustering helps marketers improve their customer base, work on target areas, and segment customers based on purchase history, interests, or activity monitoring. The classification would help the company target specific clusters of customers for specific campaigns.

5. Fantasy league stat analysis: Analyzing player stats has always been a critical element of the sporting world, and with increasing competition, machine learning has a crucial role to play here. As an interesting exercise, if you would like to create a fantasy draft team and like to identify similar players based on player stats, k-means can be a convenient option.

6. Insurance fraud detection: Machine learning has a critical role to play in fraud detection and has many applications in automobile, healthcare, and insurance fraud detection. Utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on their proximity to clusters that state fraudulent patterns. Since insurance fraud can have a multi-million dollar impact on a company, the ability to detect frauds is crucial.

7. Rideshare data analysis: The publicly available Uber ride information dataset provides a large amount of valuable data around traffic, transit time, peak pickup localities, and more. Analyzing this data is helpful not just in the context of Uber but also in providing insight into urban traffic patterns and helping us plan for the cities of the future.

8. Cyber-profiling criminals: Cyber profiling is the process of collecting data from individuals and groups to identify significant correlations. The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene.

9. Call record detail analysis: A call detail record is a piece of information captured by telecom companies during the call, SMS, and internet activity of a customer. this information provides greater insights about the customer’s needs when used with customer demographics. We can cluster customer activities for 24 hours by using the unsupervised k-means clustering algorithm. It is used to understand segments of customers with respect to their usage by hours.

10. Automatic clustering of IT alerts: Large enterprise IT infrastructure technology components such as network, storage, or database generate large volumes of alert messages. because alert messages potentially point to operational issues, they must be manually screened for prioritization for downstream processes.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Gursimar Singh

Gursimar Singh

Author @ freeCodeCamp | DevOps Consultant @ OmniTiim | DevOps | Cloud Computing | Data Science and more