# K-means Clustering in Python

## Step-by-step follow along | Data Series | Episode 8.2

An explanation of the K-means clustering algorithm: Episode 8.1

Please consider watching this video if any section of this article is unclear.

How to set up your programming environment can be found at the start of :**Episode 4.3**

You can view and use the **code** and **data** used in this episode here: **Link**

# Objective

Place the following data taken from iris plants into clusters to see if we can identify different plants given their **petal width **and **sepal length**:

# Importing and exploring our Data

importpandasaspdimportmatplotlib.pyplotaspltimportseabornassnsimportnumpyasnp# read data into variable Iris_data

Iris_data=pd.read_csv("D:\ProjectData\Iris.csv")#display first few rows of data

Iris_data.head()

- Identifying the species of plants in our dataset

`# See species of plants`

Iris_data.Species.unique()

- Store selected data:
**Sepal length**and**Petal width**into variable X

X = iris_data[["SepalLengthCm","PetalWidthCm"]]# Display shape of data (no. rows, no.columns)

X.shape

# Plotting our Data

We will now plot our data according to species, this can be done using the scatterplot function from the **seaborn** library. In this case our data is labelled which may not always be the case.

`sns.scatterplot(data = Iris_data, x = "SepalLengthCm", y = "PetalWidthCm", hue = Iris_data.Species, palette = "coolwarm_r")`

# Implementing K-means Algorithm

# Perform K-means algorithm

from sklearn.cluster import KMeansX = Iris_data[["SepalLengthCm","PetalWidthCm"]]

km = KMeans(n_clusters=3, n_init = 3, init = "random", random_state = 42)

km.fit(X)

y_kmeans = km.predict(X)

y_kmeans

- y_kmeans gives an array of values which show which cluster each data point belongs to.

## Plotting our Clusters and Centroids

- To plot our clusters we will use the same code for the scatter plot before but simply change the
**hue**to y_kmeans and plot the centres of each cluster.

# Plot clusters - this is done by colour coding the data points according to which cluster the data point belongs to

sns.scatterplot(data=Iris_data, x="SepalLengthCm", y="PetalWidthCm", hue= y_kmeans, palette = "coolwarm_r")

centers = km.cluster_centers_# Plot centers

plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha = 0.6);

plt.xlabel("SepalLengthCm")

plt.ylabel("PetalWidthCm")

- We can see above that our k-mean clustering algorithm has produced 3 clusters fairly similar to our previous plot. We can now use these clusters and centroids produced to make predictions for new flower data. Comparing clusters 0, 1 and 2 to our previous plot:

**Cluster 0** most likely refers to **Iris-versicolor****Cluster 1** most likely refers to **Iris-setosa****Cluster 2** most likely refers to **Iris-virginica**

# Making Predictions

The clusters and centroids produced from our k-mean algorithm can be used to place any new petal width and sepal length data collected from new flowers into a cluster, essentially giving us a prediction of the flower type.

Let us say for example we recorded a flower to have a petal width of 0.8cm and sepal length of 4.8cm — what type is this flower?

Using our **model**:

`new_data = [[4.5, 0.8]]`

y_pred = km.predict(new_data)

y_pred

We expect this flower to belong to cluster or centroid 1, our middle cluster, which when comparing our two plots most likely belongs to the species **iris-setosa**.

# Selecting number of clusters K

**The Elbow Method**

To evaluate the performance of our k-means algorithm we can take a look at the Inertia or objective function value. This is essentially the sum of squared distances our data points are away from their cluster centroid.

By looking at different Inertia values for different numbers of clusters (K):

intertia = []

K = range(1,15)

for k in K:

km = KMeans(n_clusters=k)

km = km.fit(X)

intertia.append(km.inertia_)plt.plot(K, intertia, marker= "x")

plt.xlabel('k')

plt.xticks(np.arange(15))

plt.ylabel('Intertia')

plt.title('Elbow Method')

plt.show()

The “elbow” of the above graph gives the optimum number of clusters for our data. This is the point before a roughly linear decrease in Inertia — which in this case is** k = 3**. This helpfully matches our number of Iris species.