# K-means Clustering in Python

## Step-by-step follow along | Data Series | Episode 8.2

Nov 26, 2020 · 4 min read

An explanation of the K-means clustering algorithm: Episode 8.1

How to set up your programming environment can be found at the start of :
Episode 4.3

You can view and use the code and data used in this episode here: Link

# Objective

Place the following data taken from iris plants into clusters to see if we can identify different plants given their petal width and sepal length:

# Importing and exploring our Data

`import pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport numpy as np# read data into variable Iris_dataIris_data = pd.read_csv("D:\ProjectData\Iris.csv")#display first few rows of dataIris_data.head()`
• Identifying the species of plants in our dataset
`# See species of plantsIris_data.Species.unique()`
• Store selected data: Sepal length and Petal width into variable X
`X = iris_data[["SepalLengthCm","PetalWidthCm"]]# Display shape of data (no. rows, no.columns)X.shape`

# Plotting our Data

We will now plot our data according to species, this can be done using the scatterplot function from the seaborn library. In this case our data is labelled which may not always be the case.

`sns.scatterplot(data = Iris_data, x = "SepalLengthCm", y = "PetalWidthCm", hue = Iris_data.Species, palette = "coolwarm_r")`

# Implementing K-means Algorithm

`# Perform K-means algorithmfrom sklearn.cluster import KMeansX = Iris_data[["SepalLengthCm","PetalWidthCm"]]km = KMeans(n_clusters=3, n_init = 3, init = "random", random_state = 42)km.fit(X)y_kmeans = km.predict(X)y_kmeans`
• y_kmeans gives an array of values which show which cluster each data point belongs to.

## Plotting our Clusters and Centroids

• To plot our clusters we will use the same code for the scatter plot before but simply change the hue to y_kmeans and plot the centres of each cluster.
`# Plot clusters - this is done by colour coding the data points according to which cluster the data point belongs tosns.scatterplot(data=Iris_data, x="SepalLengthCm", y="PetalWidthCm", hue= y_kmeans, palette = "coolwarm_r")centers = km.cluster_centers_# Plot centersplt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha = 0.6);plt.xlabel("SepalLengthCm")plt.ylabel("PetalWidthCm")`
• We can see above that our k-mean clustering algorithm has produced 3 clusters fairly similar to our previous plot. We can now use these clusters and centroids produced to make predictions for new flower data. Comparing clusters 0, 1 and 2 to our previous plot:

Cluster 0 most likely refers to Iris-versicolor
Cluster 1 most likely refers to Iris-setosa
Cluster 2 most likely refers to Iris-virginica

# Making Predictions

The clusters and centroids produced from our k-mean algorithm can be used to place any new petal width and sepal length data collected from new flowers into a cluster, essentially giving us a prediction of the flower type.

Let us say for example we recorded a flower to have a petal width of 0.8cm and sepal length of 4.8cm — what type is this flower?

Using our model:

`new_data = [[4.5, 0.8]]y_pred = km.predict(new_data)y_pred`

We expect this flower to belong to cluster or centroid 1, our middle cluster, which when comparing our two plots most likely belongs to the species iris-setosa.

# Selecting number of clusters K

## The Elbow Method

To evaluate the performance of our k-means algorithm we can take a look at the Inertia or objective function value. This is essentially the sum of squared distances our data points are away from their cluster centroid.

By looking at different Inertia values for different numbers of clusters (K):

`intertia = []K = range(1,15)for k in K:    km = KMeans(n_clusters=k)    km = km.fit(X)    intertia.append(km.inertia_)plt.plot(K, intertia, marker= "x")plt.xlabel('k')plt.xticks(np.arange(15))plt.ylabel('Intertia')plt.title('Elbow Method')plt.show()`

The “elbow” of the above graph gives the optimum number of clusters for our data. This is the point before a roughly linear decrease in Inertia — which in this case is k = 3. This helpfully matches our number of Iris species.

Written by

Written by