K-Mean
Clustering Algorithm For Machine Learning
K-Mean Clustering is the wholesome idea of Machine Learning. Unlike the older concepts making a machine learning with some sort of data containing both the “x” and “y”, where “y” is the targetted value which can be further used to determine the next value. But there might be some use cases where the previously known “y” value is known.
For instance, imagine your mother sends you to buy some vegetables from the market telling you all the shops she mostly buys from. In such a case, your probability of buying the vegetable depends upon the previously visited shops.
In the second case, let us say you shifted into a new city. Your mother sends you again to buy vegetables this time giving you no previous data. Now you buy the vegetables by doing a self-analysis of the shops, like checking their hygiene, price, etc and creating a group of the shops which have a good price, the other which has fresh green vegetables and so on.
In both these cases buying vegetables is “x” in the first case when we already know the previous output “y” we use the traditional machine learning methods depending upon the dataset. On the other hand in case two we only know “x” and nothing else, here the concept of K-Means Cluster comes into play.
Before proceeding let's get clear with the two cases that probably occur in the data mining :
Supervised Learning (SL): Also known as SL is a concept when the ML model is trained using a set of inputs (predictors) and desired outputs (target).
Unsupervised Learning (UL): UL is used when the target is not known and the objective is to infer patterns or trends in the data that can inform a decision.
Let us study more regarding UL and K-Means , the same with it’s various other use cases.
What is K-Mean Clustering?
Technically, K-Mean clustering is an algorithm used to solve the unsupervised machine learning datasets which only have historical data containing only the input variables. Unsupervised learning doesn't depend upon the known outputs.
To solve such datasets, we divide the data into multiple homogeneous groups to draw an analytic approach of each group with somewhat similar properties. This structural division can be done based upon the features you looking into your data.
For the example mentioned above, we may categorize the shops by seeing and depending upon the price, hygiene level, etc. Also, notice the fact that one shop is categorized in one subgroup only. These categories are known as Clusters. Similarly, K-Means is an approach to let the computer machine categorize the data for us. The number of clusters, you want to categorize the data into is held in the value of K. If K =2, that is the number of sub-groups we have is two.
How the K-means algorithm works
So far we know where and why do we and where do we use the concept of clusters. K-means is the algorithm that is used to create these clusters.
The process begins with randomly selecting some data points, also known as centroids. The number of centroids depends upon the number of clusters you wish to bifurcate your data into. These are the initial clusters.
The next step is to calculate the distance of each data point in the dataset, from the chosen initial clusters. Then we assign the data points to the nearest cluster. The next step is to take the mean of the cluster and repeat the above process.
Since the made cluster may or may not be the most accurate thing, hence we do iteration by taking another random value. And choose the best clustering attempt.
How to decide the value of K?
The next obvious question that may pop inside your head is what decides the value of K. For some data set it may be clear that how many clusters are required but for some, it may not be. In such situations, we start taking the value from 1,2, and so on. The K=1 is the worst-case scenario which can be quantified by changing its value and comparing it with the previous variation.
Hence, “ K ’’ can be termed as a hyperparameter that can be adjusted to obtain the best possible result. This is also known as the ELBOW METHOD. The method to choose the optimum amount of clusters.
Advantages of K-Means Clustering :
K-Means offer many advantages while doing unsupervised data mining. The various advantages it may offer the user are :
- Fastest Algorithm for clustering data.
- Easy to implement and apply.
- Produce better and tighter clusters as compared to the clustering techniques.
4. It can handle more large data linearly.
5. Offers an effective way to initialize.
6. Successful usage in the domains of market segmentation, computation vision, fraud detection, etc.
K-Means and its use-cases
K-Means clustering with wider-intelligent wings can be used in lots of spheres. The concept of creating clusters can be helpful in solving fraud detection problems, document classification, criminal detection, customer segmentation, etc.
K-Means for documentation classification:
This algorithm can help in the classification of the data by creating separate clusters. This is also termed data documentation and can be achieved by K-Means. Unfortunately, any algorithm can not understand the texted data as of now. So the texted document is first converted into vectors of numbers which can be achieved by using Term Frequency-Inverse Document Frequency or TF-IDF.
After doing data preprocessing on the available dataset the K-Means clustering algorithm can be applied to the data. It can create clusters depending upon the similarity of data.
K-Means for CDR analysis.
CDR stands for the Call Detail Record. It is the record gathered by the telecom providers. This information contains all the information regarding the call details, SMS details, internet service used by the customers.
It holds the details of a phone call built through the telephone exchange, including an automated report of the length of each telephone call.CDRs are created by telephone exchanges’ billing systems. The CDRs are saved by the transmitter exchange until the call ends. This information renders more surpassing insights about the client's needs.
Most telecom corporations use CDR data for impostor detection by clustering the user profiles, reducing customer churn by usage activity, and targeting profitable customers by using RFM analysis.
The data once gathered by the API system is retrieved by the data analysts. The data is preprocessed by applying various optimization, error detection techniques, and EDA. Exploratory Data Analysis is the method of analyzing the data visually. It involves outlier detection, exception detection, missing values detection.
K-Means algorithm is used to create clusters that may resemble certain similarities in the records. K-means is applied among “total activity and activity hours” to find the usage pattern with respect to the activity hours. The Elbow method is used to find an optimal number of clusters to the K-means algorithm.
It tells the pattern the data is following in the given time duration. It is used to understand the segment of customers with respect to their usage by hours. For example, a customer segment with high activity may generate more revenue. Customer segments with high activity in the night hours might be fraudulent ones.
By practicing this clustering tool, you can discover the clusters making more traffic to the telecom network in the measure of total activity. Similarly, you can obtain more information like square grid and country code information to understand the square grid likely creating more revenue and more traffic to the telecom network and targeting high customers based on their geo-location.
Implementation
Let us begin the implementation of the KMeans Algorithm by applying it over customer data segmentation. Customer Analysis is a major part of the domain of market management. It gives an insight into customer behavior, like, dis-like, satisfaction, loyalty, etc. To devise a pattern from the customer data we use the unsupervised learning method, K-Mean.
In this implementation, we take a dataset that contains the level of satisfaction and loyalty experience of customers. The data preprocessing is done over the data. It includes visual analysis of data by the libraries. Then K-Mean Algorithm is used to create clusters. The entire code is available here.
K-Mean clustering can thus be used in many IT domains to analyze the upcoming traffic towards a website or to detect the pattern in the dataset. There are other clustering methods as well but undoubtedly K-Means proves the best among them.
Don’t forget to follow The Lean Programmer Publication for more such articles, and subscribe to our newsletter tinyletter.com/TheLeanProgrammer