K-MEANS CLUSTERING USING ELBOW METHOD
K-means is an Unsupervised algorithm as it has no prediction variables
· It will just find patterns in the data
· It will assign each data point randomly to some clusters
· Then it will move the centroid of each cluster
· This process will continue until the cluster variation with in the data can’t be reduced any further
· The cluster variation is calculated as the sum of Euclidean distance between the data points and their respective cluster centroids
· We are going to create Iris data using scikit learn
· The data set contains 150 entries with 1 dependent variable and 4 o/p feature (just to compare the results)
· Using Elbow method and Inertia which is the sum of squared distances of the samples to their closest cluster center, we will try to find an Optimal no. of clusters required to segment the observations.
Source of the data:
Properties of the data:
· It has 150 entries with 1 dependent column and 4 feature columns
How to run this Application:
· Install Anaconda on the local Machine
· Create a Virtual Environment for the project to make sure version updates will not affect the project we are currently working on.
· Open Anaconda prompt:
· To create a virtual environment: conda create -n envname python=3.8
· Now that our virtual environment named ‘envname’ is created
· In order to activate the environment: conda activate envname
· Go to the project directory and Install the required libraries:
§ conda install pip
§ pip install -r requirements.txt
· This requirements.txt contains:
· We can edit the .txt file to the new libraries and its latest versions & run them automatically to install those libraries
· Finally, at the command line call ‘jupyter notebook’ to launch the Jupyter IDE where we build the model
Inside the IDE:
· Import Iris dataset
· Visualize the data using Matplotlib and Seaborn to understand the patterns
· Find the Optimal K value using Inertia and Elbow Method
· Create a model that can cluster the observations in our data
· Compare the results.
Proof of concept (POC) for libraries and packages:
POC for Iris data:
POC for Visualization:
POC for Model Building:
Building models for cluster 2.
Plotting clusters with centroid.
Building models for cluster 4.
Building models for cluster 5.
· Now let’s use the concept of Inertia which is the sum of the squared distances of samples to their closest cluster center.
· If the value of K is huge, then the no. of points within a cluster will be less and hence the inertia will be less
· Now we will implement ‘The elbow method’ on the Iris dataset. The elbow method allows us to pick the optimum no. of clusters for classification.
· Although we already know the answer is 3 as there are 3 unique class in Iris flowers
Elbow method :
Now we already know value of clusters are 3 so, lets apply model with value three. Also, verify clusters and labels for it.
We can now plot scatter for different values with centroid.
· We can get an absolute segmentation when we put higher K values but if the points with in each cluster are very less then the variation on the real data will be high leading it into over simplifying the data
· So, with K=3 we have obtained an optimal distortion/inertia with which we can segment the data into 3 different clusters with minimal error in segmentation
References: scikit-learn, Matplotlib.