MLearning.ai
Published in

MLearning.ai

K-MEANS CLUSTERING USING ELBOW METHOD

K-means is an Unsupervised algorithm as it has no prediction variables

· It will just find patterns in the data

· It will assign each data point randomly to some clusters

· Then it will move the centroid of each cluster

· This process will continue until the cluster variation with in the data can’t be reduced any further

· The cluster variation is calculated as the sum of Euclidean distance between the data points and their respective cluster centroids

· We are going to create Iris data using scikit learn

· The data set contains 150 entries with 1 dependent variable and 4 o/p feature (just to compare the results)

· Using Elbow method and Inertia which is the sum of squared distances of the samples to their closest cluster center, we will try to find an Optimal no. of clusters required to segment the observations.

Source of the data:

Link: https://www.kaggle.com/uciml/iris

Properties of the data:

· It has 150 entries with 1 dependent column and 4 feature columns

How to run this Application:

· Install Anaconda on the local Machine

· Create a Virtual Environment for the project to make sure version updates will not affect the project we are currently working on.

· Open Anaconda prompt:

· To create a virtual environment: conda create -n envname python=3.8

· Now that our virtual environment named ‘envname’ is created

· In order to activate the environment: conda activate envname

· Go to the project directory and Install the required libraries:

§ conda install pip

§ pip install -r requirements.txt

· This requirements.txt contains:

§ jupyter==1.0.0

§ lxml==4.4.1

§ matplotlib==3.1.1

§ pandas==0.25.2

§ Pillow==6.2.1

§ scikit-learn==0.21.3

§ seaborn==0.9.0

· We can edit the .txt file to the new libraries and its latest versions & run them automatically to install those libraries

· Finally, at the command line call ‘jupyter notebook’ to launch the Jupyter IDE where we build the model

Inside the IDE:

· Import Iris dataset

· Visualize the data using Matplotlib and Seaborn to understand the patterns

· Find the Optimal K value using Inertia and Elbow Method

· Create a model that can cluster the observations in our data

· Compare the results.

Proof of concept (POC) for libraries and packages:

POC for Iris data:

EDA:

POC for Visualization:

DistPlot:

Box plot:

POC for Model Building:

Building models for cluster 2.

Plotting clusters with centroid.

Building models for cluster 4.

Building models for cluster 5.

· Now let’s use the concept of Inertia which is the sum of the squared distances of samples to their closest cluster center.

· If the value of K is huge, then the no. of points within a cluster will be less and hence the inertia will be less

· Now we will implement ‘The elbow method’ on the Iris dataset. The elbow method allows us to pick the optimum no. of clusters for classification.

· Although we already know the answer is 3 as there are 3 unique class in Iris flowers

Elbow method :

Now we already know value of clusters are 3 so, lets apply model with value three. Also, verify clusters and labels for it.

We can now plot scatter for different values with centroid.

· We can get an absolute segmentation when we put higher K values but if the points with in each cluster are very less then the variation on the real data will be high leading it into over simplifying the data

· So, with K=3 we have obtained an optimal distortion/inertia with which we can segment the data into 3 different clusters with minimal error in segmentation

References: scikit-learn, Matplotlib.

--

--

--

Data Scientists must think like an artist when finding a solution when creating a piece of code. ⚪️ Artists enjoy working on interesting problems, even if there is no obvious answer ⚪️ linktr.ee/mlearning 🔵 Follow to join our 18K+ Unique DAILY Readers 🟠

Recommended from Medium

Become a Data Science Expert

Statistics: Describe your data!

Russian Bridges, Eulerian Circuits, and Genome Assembly?

Making a Sci-fi film for Success

From a Cook to a Data Analyst

Why You Should Probably Never Use pandas inplace=True

A Python approach to forecasting GPS sales

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Swapnil Bandgar

Swapnil Bandgar

Code is like humor. When you have to explain it, it’s bad. Connect with me on LinkedIn : https://www.linkedin.com/in/swapnil-bandgar-a4944313a/

More from Medium

Applying Classification algorithms on Palmer Penguin dataset

The Math Behind the K-means and Hierarchical Clustering Algorithm!

What I learned in a Clustering Project…

SMS spam classification using Naïve Bayes Classifier