K-means Clustering in Python
An explanation of the K-means clustering algorithm: Episode 8.1
Please consider watching this video if any section of this article is unclear.
How to set up your programming environment can be found at the start of :
You can view and use the code and data used in this episode here: Link
Place the following data taken from iris plants into clusters to see if we can identify different plants given their petal width and sepal length:
Importing and exploring our Data
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np# read data into variable Iris_data
Iris_data = pd.read_csv("D:\ProjectData\Iris.csv")#display first few rows of data
- Identifying the species of plants in our dataset
# See species of plants
- Store selected data: Sepal length and Petal width into variable X
X = iris_data[["SepalLengthCm","PetalWidthCm"]]# Display shape of data (no. rows, no.columns)
Plotting our Data
We will now plot our data according to species, this can be done using the scatterplot function from the seaborn library. In this case our data is labelled which may not always be the case.
sns.scatterplot(data = Iris_data, x = "SepalLengthCm", y = "PetalWidthCm", hue = Iris_data.Species, palette = "coolwarm_r")
Implementing K-means Algorithm
# Perform K-means algorithm
from sklearn.cluster import KMeansX = Iris_data[["SepalLengthCm","PetalWidthCm"]]
km = KMeans(n_clusters=3, n_init = 3, init = "random", random_state = 42)
y_kmeans = km.predict(X)
- y_kmeans gives an array of values which show which cluster each data point belongs to.
Plotting our Clusters and Centroids
- To plot our clusters we will use the same code for the scatter plot before but simply change the hue to y_kmeans and plot the centres of each cluster.
# Plot clusters - this is done by colour coding the data points according to which cluster the data point belongs to
sns.scatterplot(data=Iris_data, x="SepalLengthCm", y="PetalWidthCm", hue= y_kmeans, palette = "coolwarm_r")
centers = km.cluster_centers_# Plot centers
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha = 0.6);
- We can see above that our k-mean clustering algorithm has produced 3 clusters fairly similar to our previous plot. We can now use these clusters and centroids produced to make predictions for new flower data. Comparing clusters 0, 1 and 2 to our previous plot:
Cluster 0 most likely refers to Iris-versicolor
Cluster 1 most likely refers to Iris-setosa
Cluster 2 most likely refers to Iris-virginica
The clusters and centroids produced from our k-mean algorithm can be used to place any new petal width and sepal length data collected from new flowers into a cluster, essentially giving us a prediction of the flower type.
Let us say for example we recorded a flower to have a petal width of 0.8cm and sepal length of 4.8cm — what type is this flower?
Using our model:
new_data = [[4.5, 0.8]]
y_pred = km.predict(new_data)
We expect this flower to belong to cluster or centroid 1, our middle cluster, which when comparing our two plots most likely belongs to the species iris-setosa.
Selecting number of clusters K
The Elbow Method
To evaluate the performance of our k-means algorithm we can take a look at the Inertia or objective function value. This is essentially the sum of squared distances our data points are away from their cluster centroid.
By looking at different Inertia values for different numbers of clusters (K):
intertia = 
K = range(1,15)
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(X)
intertia.append(km.inertia_)plt.plot(K, intertia, marker= "x")
The “elbow” of the above graph gives the optimum number of clusters for our data. This is the point before a roughly linear decrease in Inertia — which in this case is k = 3. This helpfully matches our number of Iris species.