K-means Clustering in Python

Step-by-step follow along | Data Series | Episode 8.2

Mazen Ahmed
Nov 26, 2020 · 4 min read

An explanation of the K-means clustering algorithm: Episode 8.1

Please consider watching this video if any section of this article is unclear.

Video Link

How to set up your programming environment can be found at the start of :
Episode 4.3

You can view and use the code and data used in this episode here: Link

Objective

Place the following data taken from iris plants into clusters to see if we can identify different plants given their petal width and sepal length:

https://commons.wikimedia.org/wiki/Main_Page
Image for post
Image for post

Importing and exploring our Data

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# read data into variable Iris_data
Iris_data = pd.read_csv("D:\ProjectData\Iris.csv")
#display first few rows of data
Iris_data.head()
Image for post
Image for post
  • Identifying the species of plants in our dataset
# See species of plants
Iris_data.Species.unique()
Image for post
Image for post
  • Store selected data: Sepal length and Petal width into variable X
X = iris_data[["SepalLengthCm","PetalWidthCm"]]# Display shape of data (no. rows, no.columns)
X.shape
Image for post
Image for post

Plotting our Data

We will now plot our data according to species, this can be done using the scatterplot function from the seaborn library. In this case our data is labelled which may not always be the case.

sns.scatterplot(data = Iris_data, x = "SepalLengthCm", y = "PetalWidthCm", hue = Iris_data.Species, palette = "coolwarm_r")
Image for post
Image for post

Implementing K-means Algorithm

# Perform K-means algorithm
from sklearn.cluster import KMeans
X = Iris_data[["SepalLengthCm","PetalWidthCm"]]
km = KMeans(n_clusters=3, n_init = 3, init = "random", random_state = 42)
km.fit(X)
y_kmeans = km.predict(X)
y_kmeans
  • y_kmeans gives an array of values which show which cluster each data point belongs to.

Plotting our Clusters and Centroids

  • To plot our clusters we will use the same code for the scatter plot before but simply change the hue to y_kmeans and plot the centres of each cluster.
# Plot clusters - this is done by colour coding the data points according to which cluster the data point belongs to
sns.scatterplot(data=Iris_data, x="SepalLengthCm", y="PetalWidthCm", hue= y_kmeans, palette = "coolwarm_r")
centers = km.cluster_centers_
# Plot centers
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha = 0.6);
plt.xlabel("SepalLengthCm")
plt.ylabel("PetalWidthCm")
Image for post
Image for post
  • We can see above that our k-mean clustering algorithm has produced 3 clusters fairly similar to our previous plot. We can now use these clusters and centroids produced to make predictions for new flower data. Comparing clusters 0, 1 and 2 to our previous plot:

Cluster 0 most likely refers to Iris-versicolor
Cluster 1 most likely refers to Iris-setosa
Cluster 2 most likely refers to Iris-virginica

Making Predictions

The clusters and centroids produced from our k-mean algorithm can be used to place any new petal width and sepal length data collected from new flowers into a cluster, essentially giving us a prediction of the flower type.

Let us say for example we recorded a flower to have a petal width of 0.8cm and sepal length of 4.8cm — what type is this flower?

Using our model:

new_data = [[4.5, 0.8]]
y_pred = km.predict(new_data)
y_pred
Image for post
Image for post

We expect this flower to belong to cluster or centroid 1, our middle cluster, which when comparing our two plots most likely belongs to the species iris-setosa.

Selecting number of clusters K

The Elbow Method

To evaluate the performance of our k-means algorithm we can take a look at the Inertia or objective function value. This is essentially the sum of squared distances our data points are away from their cluster centroid.

By looking at different Inertia values for different numbers of clusters (K):

intertia = []
K = range(1,15)
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(X)
intertia.append(km.inertia_)
plt.plot(K, intertia, marker= "x")
plt.xlabel('k')
plt.xticks(np.arange(15))
plt.ylabel('Intertia')
plt.title('Elbow Method')
plt.show()
Image for post
Image for post

The “elbow” of the above graph gives the optimum number of clusters for our data. This is the point before a roughly linear decrease in Inertia — which in this case is k = 3. This helpfully matches our number of Iris species.

Prev Episode _______ Next Episode

If you have any questions please leave them below!

The Startup

Medium's largest active publication, followed by +755K people. Follow to join our community.

Mazen Ahmed

Written by

Interested in Data Science? Consider giving me a follow for weekly lessons with video explanations.

The Startup

Medium's largest active publication, followed by +755K people. Follow to join our community.

Mazen Ahmed

Written by

Interested in Data Science? Consider giving me a follow for weekly lessons with video explanations.

The Startup

Medium's largest active publication, followed by +755K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store