Machine Learning: Hierarchical and K-Means Clustering with Python

Gonçalo Guimarães Gomes
Analytics Vidhya
Published in
17 min readJan 2, 2022

--

A Data Science approach covering the most commonly used models to perform cluster analysis and customer segmentation in Python.

Image by Gonçalo Guimarães Gomes

The project

We are about to take off on a python journey exploring a data set in which we’ll deep dive into a database segmentation pre-processing the features accordingly to achieve the best possible results.

Under the Unsupervised Learning umbrella, we’ll be performing a Hierarchical and K-Means Clustering to identify the different customers’ segments that exist in our client’s database.

Although the data set doesn’t have any missing values to deal with, we need first to pre-process the data encoding all variables numerically, i.e, replacing all categorical features with numbers, to be able to build the segmentation models. The first thing is to get to know the data.

About the data set

The data set contains information about customers collected when purchasing on a physical store through the loyalty card they use at checkout. For this article we’re not going to identify the company, products or the store, and the volume of the data set is restricted to 2000 individuals identified by a unique code number ID for protection of the customers’ privacy.

--

--

Gonçalo Guimarães Gomes
Analytics Vidhya

Portuguese Digital Marketing Analyst. Data-oriented, fully involved, and passionate about Analytics and Data Science. https://www.linkedin.com/in/goncaloggomes