Machine Learning: Hierarchical and K-Means Clustering with Python

Published in

Analytics Vidhya

17 min readJan 2, 2022

A Data Science approach covering the most commonly used models to perform cluster analysis and customer segmentation in Python.

The project

We are about to take off on a python journey exploring a data set in which we’ll deep dive into a database segmentation pre-processing the features accordingly to achieve the best possible results.

Under the Unsupervised Learning umbrella, we’ll be performing a Hierarchical and K-Means Clustering to identify the different customers’ segments that exist in our client’s database.

Although the data set doesn’t have any missing values to deal with, we need first to pre-process the data encoding all variables numerically, i.e, replacing all categorical features with numbers, to be able to build the segmentation models. The first thing is to get to know the data.

About the data set

The data set contains information about customers collected when purchasing on a physical store through the loyalty card they use at checkout. For this article we’re not going to identify the company, products or the store, and the volume of the data set is restricted to 2000 individuals identified by a unique code number ID for protection of the customers’ privacy.

Machine Learning: Hierarchical and K-Means Clustering with Python

The project

About the data set

Written by Gonçalo Guimarães Gomes