# Unsupervised Learning using KMeans Clustering

**Can you group similar customers in the churn data? Hmm... maybe, but similar in what way?**

When choosing a telecommunication service provider, customers usually have many choices. They can choose any service provider and may move away from the current provider. The percentage of customers moving out and disconnecting the service is known as “churn”. It is very important to reduce churn for business growth and customer retention. If the churn is high, the business will continually be in search of new customers without a stable customer base. The performance of the business will be very unpredictable. Businesses try to keep the customers satisfied, to retain them as long as possible. However, in the real world, the customer churn can be as high as 25% annually in the telecommunication industry. Also, the cost of acquiring a new customer is 10 times more than the cost to retain an existing customer. This poses a serious challenge to business owners.

We will use customer churn data to apply unsupervised learning methods and will group them or clustering them into groups. Machine learning algorithm with determined these clusters and which customer lands in which group.

Data used for this analysis was obtained from Kaggle. The data ranges from demographic information to types of services being provided. Using this data clusters or groups of customers can be identified. We will then try to understand if those groups can be tied to some real-world characteristics and figure out in what way are those customers in a particular group similar, which leads the algorithm to group them together.

Note: Data for this analysis is available at https://github.com/microbhai/CustomerChurnAnalysis/blob/master/telecommunication_customer_churn.csv

Code available at: https://github.com/microbhai/CustomerChurnAnalysis/tree/master/UnsupervisedLearningClusteringKMeans

**Unsupervised Learning**

Unsupervised learning models are usually used on data when data scientists don’t have prior knowledge about the categories in the data. Any information pertaining to how the data can be grouped is not present. It is just data with several rows and columns with numerical values, but no information about what rows belong to what group, if any.

Based on the description of the customer Churn data, we know that our data is perfectly labeled. And we have prior knowledge about different categories in the data. For example, we know the customers who churned out, we know the customers who have fiber-optic internet, we know the customers who opted for streaming services for TV and Movies, customers who have tablets, customers who pay their bills by checks, etc. Our data is as labeled as it can be. Although the data provided to us is labeled, in this exercise we will be exploring an unsupervised learning mechanism to figure out clusters in the data.

Using unsupervised learning, solely based on how data points align themselves by the virtue of their values for different fields, we will determine clusters or groups in the data, if any. We may not know the reason behind the clusters and any properties associated with them until we do further analysis. However, the unsupervised learning model will give us multiple clusters and how our data points are labeled into those clusters.

The overall scope of the analysis can be summarized as:

1. Creation of an unsupervised learning model using the available customer churn data to label the data into multiple clusters

2. Analyze the clusters to determine if they align with any customer churn behavior

**KMeans**

Based on the problem summary, we need to implement an unsupervised learning algorithm and use it to label unlabeled customer churn data. One of the very popular clustering techniques is “KMeans” and is frequently used to solve clustering problems. In the section below, we will know how the KMeans algorithm works. Let’s start with a visual example to understand the basics. If the data is 2 dimensional, it is easy to plot it on charts and visually identify if certain points can be grouped or labeled in clusters based on the distance between the points. In this scenario, there could be a few points clustered together (being near to each other) with some space around it and the next cluster located nearby. So, based on the proximity we can visually label the points.

The images below show the visual grouping of data points in a 2-dimensional space.

KMeans is a popular unsupervised clustering algorithm designed to group data into clusters and label data points. It is widely used in applications such as market segmentation, document clustering, image segmentation, and image compression, etc. The algorithm is based on the distance between the points in N-dimensional space and relies on finding centroids for each of the numbers of clusters, we want to identify in the data. It shuffles and randomly chooses as many points as the number of clusters required to be identified and assign them as cluster centers. Afterward, iteratively, it calculates the distance between these cluster centers (or centroids) and data points to group the data points into clusters.

With each iteration, it reassigns the centroids by taking the mean of the points assigned to clusters and progressively calculates a better representation of cluster centers until no further improvement can be made in choosing cluster centers. Once the model is created, and labels are assigned to the existing points, the unsupervised learning model can then be used to label new data points, based on their distance from the centroids of the identified clusters.

For explanation and demonstration, I will generate data for clustering and apply the KMeans algorithm to label clusters and their centroids, in 2D and 3D space. We will generate 50 data points, such that the points inherently belong to 3 clusters. Then we will use KMeans to identify 4 clusters in the data and find their centroids. We will plot for in 2D and 3D space for visual presentation. In the charts below, the red points represent the cluster centers.

As can be seen that we started with a bunch of random points in 2D/3D space as input data. The data was biased to have 3 clusters inherently. I did this for demonstration. After running the KMeans, we ended up getting our random points labeled into 4 different clusters. Each of those 4 clusters has its center. And those were marked with red dots on the plots. We could visually see the centers being identified and data points being plotted with different colors to mark the cluster labels. However, in higher-dimensional space, such visualization is not possible. The concept remains the same, the computation logic and algorithm remain the same, but it works over many more dimensions than can be plotted on a chart.

**Data Preparation**

Data preparation starts with an understanding of the available data for analysis. Customer churn data has 40 fields. One of the important tasks is to determine which fields can be used for KMeans analysis. There are categorical data fields like Martial, Gender, etc., and continuous numeric data fields like Tenure, Age, etc. Some of these fields may not be important for analysis such as customer ID, interaction, and UID (which are related to customer service interactions). Some other fields which are not important for analysis are the Latitude and Longitude of the customer, case order (used as a serial number). We will also check data for null values and if found they will need to be handled appropriately.

**Categorical Variables**

In our Customer Churn data, a large number of columns are categorical such as gender, internet service, phone service, etc. don’t provide any values when it comes to the KMeans algorithm. As we can see that the KMeans algorithm is based on the distance between points, categorical data doesn’t fit the model well. We can convert the categorical data into 0 and ones, but it won’t add any meaningful value to an algorithm with works on the distance between points. Hence, we will drop all the categorical data fields from our analysis.

**Normalization**

KMeans algorithm is based on distance, and we need to normalize the values of data fields before these distances are computed. Normalization will bring the field values within a similar range for different dimensions. For example, if monthly payment for telecom service is between 100 and 400, age of the customers is between 10 and 90, bandwidth consumed per month is between 100–700 and income is 300 to 260000, all these fields are widely at different ranges of values and an attempt to compute the distance between such points will give distorted results. If we normalize the data and bring it to a similar range of values, computation of distance would make a lot more sense in this high dimensional space, and corresponding KMeans clustering will be more reliable and usable. The accuracy of prediction increases when these points are plotted using similarly scaled dimensions, instead of dimensions that vary greatly (orders of magnitude).

**Model Creation**

We have the following variables in customer churn data after the initial data cleanup. All categorical variables have been removed.

Continuous Numeric variables:

‘Population’, ‘Children’, ‘Age’, ‘Income’, ‘Outage_sec_perweek’, ‘Email’, ‘Contacts’, ‘Yearly_equip_failure’, ‘Tenure’, ‘MonthlyCharge’, ‘Bandwidth_GB_Year’

In supervised learning models, the usual practice is to take the available data and split it into training and testing sets. The model is trained on the training set, then predictions are made on the test set. Afterward, to check the accuracy of the prediction, the predicted values and actual values on the testing set are compared. If possible, tuning of the model is executed to increase the accuracy of prediction.

Contrary to supervised learning models, in unsupervised clustering models, there are usually no labels present in the data. It is the algorithm that will label the data with cluster labels. So, there is nothing to compare against for determining the accuracy. So, the step of splitting data into training and testing sets and comparison to assess the accuracy doesn’t hold good. It makes the task of clustering relatively simple. However, the basis of clustering is the distance between the points in N-dimensional space, and it may be hard to characterize these clusters in the real-world sense. For example, if you feed the customer data to the KMeans algorithm, and we get the clusters labeled, the real-world significance of those customers clustered together may be hard to determine. It will be hard to take actions based on these clustered customers without understanding the characteristics associated with them.

Luckily, our customer Churn data is not unlabeled. Even though we have dropped the data columns for analysis, we can use them to understand how customers in the created clusters may be related and if these clusters mean something with respect to the real-world categorization.

To cluster this data using a KMeans classification algorithm:

1. Choose a range of number of clusters (k)

2. Create a model for each choice of the number of clusters

3. Computer inertia for each of the models

4. Create a plot of inertia values and number of clusters

5. Determine the right number of clusters using the Elbow method

6. For the chosen number of clusters create the KMeans model and label the data

7. Additionally, run exploratory analysis to check if the cluster labels assigned by KMeans correspond to the existing label in the data (the labels which existed by were dropped for the purpose of KMeans analysis, the categorical variables)

**Analysis of the Model**

We chose a range of 1 to 10 for the number of clusters. For each of these, we created a KMeans model. And for each of models, we computed the inertia of the model. The inertia of the model is the sum-of-squares within the clusters. The inertia is at its maximum when the number of clusters is one (all the points grouped under a single cluster, not clustering of data into separate groups). And the inertia is 0 when we have as many clusters as the number of data points, such that each data point gets to be in its cluster and there is 0 sum of squared distance between the points with the cluster.

Between the maximum and minimum value of inertia, we can compute inertia for each of the intermediate number of clusters and plot them on a chart. This chart shows how inertia drops when the number of clusters increases. At a point somewhere on the chart, the increase in the number of clusters starts having a progressively lower impact on the drop of the value of inertia. At this point, we see an elbow-like shape of the curve being formed. This is what determines an optimum number of clusters that KMeans can find with the data points provide for fitting the model.

The k vs inertia plot for our customer churn data is provided below.

As can be seen from the plot, the elbow-like shape occurs at k=2. This means that KMeans is optimally able to find 2 clusters in the data. We can find more clusters but the drop in the inertia is most significant from k=1 to k=2, and if goes down gradually afterward.

**Model Summary and Implications**

The elbow method showed us that the KMeans algorithm suggests 2 distinctive clusters in the data. We can find the cluster centers and inertial values using the model. We can also find the cluster labels for various points in the data. Considering that we are dealing with unsupervised machine learning here, we don’t know what these 2 clusters correspond to. They have been found by the algorithm based on the proximity of the points. And an number of clusters is determined by the drop in inertia with respect to the number of clusters. How do they relate to the customers in the real world?

This is where domain knowledge of the data can shine some light. Knowing the data and the background, we understand that the data belongs to customer churn records. If KMeans tell us that there are 2 distinct clusters in the data, it is intuitive to try and correlate these 2 clusters with the customer churn, a cluster of customers who churned out, and another cluster of customers who stayed with the telecom company. To verify, if these clusters correlate with the customer churn, we will have to use the “Churn” data field which was previously dropped from the data. Taking the labels as predicted by KMeans, we can add them to the original data set, and run some queries to check how the correlation of label values with Churn field value presents itself.

The KMeans labels have values of 0 and 1. A quick query on the data shows that there are 2650 customers in the data who have a churn value of Yes. These are the customer who chose to change the telecom provider and moved out of the company. Another query tells us that there 2366 customers which have a Churn value of Yes along with a KMeans cluster label of 1. This shows that majority of the customers who churned out of the company were labeled by the KMeans clustering algorithm under cluster 1. This shows a strong correspondence of cluster 1 customers and their churn behavior. With this correspondence established, we can consider that the labels are predictions of customer churn behavior. Label value of 0 is the prediction for the customer to stay with the company and label 1 is the prediction to move out. Churn column values of Yes or No are the actual values to compare against and come up with prediction accuracy.

Using sklearn.metrics functions, we find that KMeans is accurately able to predict 70.81% of the churn behavior. If the focus of the analysis is the customers who are move out of the current service provider, KMeans predicted 2366 out of 2650. This gives us a churn prediction accuracy of 89.28%.

The KMeans clustering analysis of customer churn data shows, that this unsupervised learning method with no prior knowledge about the data was able to identify 2 distinct clusters in the data. One of the clusters closely corresponded with the customers who left the company changed their service provider. Another cluster corresponded largely to those customers who chose to stay with the company. Even without any churn-related information provided, the data told its own story when passed through this unsupervised machine learning algorithm. As such, this algorithm can be used for predictive analysis on customer churn data.

**Limitations of KMeans**

While KMeans is intuitive, easier to implement, and computationally faster compared to some other clustering mechanisms, it comes with some limitations. The KMeans clustering technique is an unsupervised learning mechanism (no prior labeling of the data). It identifies the clusters in the data based on the distance of points from each other. One of the important limitations of this method is that it doesn’t make use of categorical variables. We know from previous analysis on the customer churn data that some of the categorical variables have been very important in predicting churn behavior, such as StreamingTV, StreamingMovies, Contract, etc. All that information is lost when using Kmeans as the categorical variable plotted as 1s and 0s in N-dimensional space provides no valuable information. Understandably enough, Kmeans never promised to created clusters based on customer churn behavior.

As the number of dimensions in the data increase, the ratio of standard deviation to the mean of distance decreases. This makes Kmeans less effective at distinguishing between data points and consequently the assignment of cluster labels.

Also, centroids can experience have a high impact on the determination of their values, due to a few high deviation outliers. Additionally, KMeans makes assumptions about the data:

1. Clusters are spatially grouped — or once cluster can be confined within a sphere

2. Clusters are of a similar size

If any of the assumptions are violated, KMeans fails to create the right clusters.

The 2 images above show incorrect clustering (shown by color) by KMeans because the points are not spatially grouped.

Code available at: https://github.com/microbhai/CustomerChurnAnalysis/tree/master/UnsupervisedLearningClusteringKMeans