Practical Approach to KMeans Clustering — Python and Why Scaling is Important!

Ajay n Jain
Analytics Vidhya
Published in
7 min readNov 8, 2019

Learnt K Means Clustering and now you want to apply in real life applications?
Applied clustering algorithm but not satisfied with the results?
Get started with the easiest dataset you’ll ever see and know how Scaling affects Clustering and how a small change in data results in totally different Clusters!

Image by arielrobin from Pixabay

Prerequisites

  1. You should know what clustering is.
  2. You should know the KMeans algorithm.
  3. Basics of Python

Let’s get into Coding

Step 1: Import the required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Step 2: Read the dataset

movies = pd.read_csv('movies.csv')

Step 3: Understand the dataset

# View the first five rows of the dataset
movies.head()
# If you want to view first n rows of the dataset:
# movies.head(n=number)
First five rows of the dataset
#View summary of the data
movies.info()
Movie Information

We see that there are total of 10 rows and 6 features(columns).
Two of the features are integer type and four of them are of the object type.
We can see that it shows 10 non-null which means there are total of 10 non-null entries and totally there are 10 rows implying there are no null values in the data.

We can delete the ‘Id’ feature as it does not provide us any relevant information.

del movies['Id']

Let’s view the complete data as there are only 10 rows.

#View the complete data
display(movies)
Complete movie data

This is a small dataset and we see that, there are many movies with same actors or directors but we can use groupby to learn more about the data.

#Grouping the movies by actors and displaying them
actors = movies.groupby('Actor')
for actor,rows in actors:
display(actor,rows)
Actor groupby result
Actor groupby result

We see that there are totally four actors: Christian Bale, Hugh Jackman, Joaquin Phoenix and Tom Cruise.
Christian Bale has highest number of movies in dataset i.e. 5, Joaquin Phoenix and Tom Cruise have two movies each and Hugh Jackman has a single movie.

Let us now group by Director

#Grouping the movies by directors and displaying them
directors = movies.groupby('Director')
for director,rows in directors:
display(director,rows)
Director groupby result
Director groupby result

There are six directors, Christopher Nolan and David O Russell have only worked with Christian Bale whereas James Mangold has worked with 2 actors, Hugh Jackman and Tom Cruise.

Let us now group by Genre

#Grouping the movies by genre and displaying them
genres = movies.groupby('Genre')
for genre,rows in genres:
display(genre,rows)
Genre groupby result

There are 3 unique Genre. Only Tom Cruise has acted in Action Movies, Christian Bale and Joaquin Phoenix have acted in both Comic-Book and Drama movies.

Step 4: Preprocessing

Before Modeling we have to change the data to numeric format as KMeans does not work with categorical variable.
Hence we have to create dummy variables for Actor, Director and Genre.
We don’t create dummy variable for movie name as like id it is also an identifier and is not useful in clustering and hence we delete it too.

Note: It is recommended to make a copy of original dataset and make modifications to copy and not to the original dataset.

#Creating dummy variables
actor_dummy = pd.get_dummies(movies['Actor'])
director_dummy = pd.get_dummies(movies['Director'])
genre_dummy = pd.get_dummies(movies['Genre'])
#Concatenating the dummy variables to the original dataset movie_dummy_set=pd.concat([movies,actor_dummy,
director_dummy,genre_dummy],axis=1)
#Deleting categorical variable from the dummy set
del movie_dummy_set['MovName']
del movie_dummy_set['Actor']
del movie_dummy_set['Director']
del movie_dummy_set['Genre'].
Data after Preprocessing

Step 5: Modeling

Let’s start with importing the library required for modeling

#Importing KMeans
from sklearn.cluster import KMeans

Let k be equal to 2 i.e. we want two clusters for the data.

#Modeling
kmeans = KMeans(2)
kfit = kmeans.fit(movie_dummy_set)
identified_clusters = kfit.predict(movie_dummy_set)

We have stored the identified clusters in a new variable and we will add this to original dataset.

#Appending the identified clusters to the original data
clustered_data = movies.copy()
clustered_data['Cluster'] = identified_clusters
#Viewing the data with clusters
display(clustered_data.sort_values(by='Cluster'))
Data with Two Clusters

Before going ahead, analyze the table and try to figure out, based on which feature(s) did the clustering happen.

Well if you figured it out great!, let’s go ahead and plot the clusters

#Plotting the graph
plt.xlabel('Year')
plt.ylabel('Cluster')
plt.scatter(clustered_data['Year'],clustered_data['Cluster'],c=clustered_data['Cluster'])
Year vs Cluster Plot

We see that Clusters are based on Year. Cluster 0 contains movies which were released before 2012 and Cluster 1 contains movies which were released after 2012.

But why was it based on Year and not on other features Actor, Director and Genre.
Did we forget something? I think Yes, we forgot to scale the data.

Remember that all the data is numeric and the data in Year is very large in terms of magnitude. It is in range of 2000s whereas the data in other features is either 0 or 1 as they are dummies.

This might be the reason why the clustering was based on Year as values in Year are significantly larger and hence maximizing it’s value.

Let’s Scale the data and then apply KMeans.

#Importing the library
from sklearn import preprocessing
#Passing the values of the dataset to Min-Max-Scaler
movies_values = movie_dummy_set.values
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(movies_values)
movies_scaled = pd.DataFrame(x_scaled,
columns=movie_dummy_set.columns)

We have scaled the data and stored in ‘movies_scaled’ variable. Let’s apply KMeans with k = 2 and see whether we get different results or the same.

#Modeling
kmeans = KMeans(2)
kfit = kmeans.fit(movies_scaled)
identified_clusters_scaled = kfit.predict(movies_scaled)
#Appending the identified clusters to the dataframe
clustered_data_scaled = movies.copy()
clustered_data_scaled['Cluster'] = identified_clusters_scaled
display(clustered_data_scaled.sort_values(by='Cluster'))
Data with Two Clusters after Scaling

The results are surprising aren’t they!, the Clusters are now based on Actor, Cluster 0 contains the movies in which Christian Bale has starred whereas Cluster 1 contains all the other movies.

Let’s see what happens when the number of clusters is 3 i.e k =3.

Data with Three Clusters

Again the Clusters are based on Actor, Cluster 0 contains Tom Cruise starring movies, Cluster 1 contains Christian Bale starring movies and Cluster 2 contains the other movies.

Let’s see what happens when the number of clusters is 4 i.e k =4.

Data with Four Clusters

Well the Clusters are now based on two features, Actor and Director.
Cluster 0 contains Tom Cruise starring movies. Cluster 1 contains Christian Bale starring movies, directed by David O Russell. Cluster 3 contains Joaquin Phoenix and Hugh Jackman starring movies and Cluster 4 also contains Christian Bale starring movies but directed by Christopher Nolan.

Step 6: Conclusion

We saw how Scaling affected the clustering. Before Scaling, two clusters were based on Year but after Scaling, the clusters were based on Actor.
We also saw how changing the number of clusters changes the grouping of data and it all depends on the Data Scientist to decide the number of clusters.

You might have several questions such as what’s the right amount of clusters or how to decide the ‘k’ value. One of the approach for deciding the ‘k’ value is the Elbow Method.

There might be another question you’d have is that why most of the times clustering was based on Actor feature, why not Genre. There were 3 Genres in the dataset so when k was 3, it makes more sense to do clustering based on Genre rather on Actor. Well the answer is that it depends on the data too.

Let’s see an example and see how clustering depends on data.

I made just one change in the dataset, I changed the Year of Batman Begins from 2005 to 2014 (Yes it’s wrong but the result is surprising!)

When Batman Begins was released in 2014

We see that now Clustering is based on Genre!, just changing a single value in the dataset leads to a major change on clusters.

Imagine a real life dataset with large number of rows and columns and lots of missing values. Missing values can be filled with various techniques.
Different techniques would lead to different Clusters!

Try out different number of Clusters and different Algorithms and see how it changes the way, the clusters are formed.

Repository Link: https://github.com/njain9104/Movies-Clustering

Last Step: End

--

--

Ajay n Jain
Analytics Vidhya

Frontend Engineer! I observe, I write, follow for my deductions. I hope to be a Sherlock in Engineering