Clustering on Mixed Data Types in Python

Ryan Kemmer
Analytics Vidhya
Published in
5 min readJan 25, 2021

--

Image by the Author

During my first ever data science internship, I was given a seemingly simple task to find clusters within a dataset. Given my basic knowledge of clustering algorithms like K-Means, DBSCAN, and GMM I thought that I could easily get this task done. However, as I took a closer look into the dataset, I realized the data contained a mixture of categorical and continuous data, and many common methods of clustering I knew would not easily work.

Categorical data consists of multiple discrete categories that commonly do not have any clear order or relationship to each-other. This data might look like “Android” or “iOS”.

Continuous data consists of real numbers that can take any value. This data might look like “3.14159” or “43".

Many datasets contain a mixture of categorical and continuous data. However, it is not straightforward how to cluster datasets with mixed data types. So how do we cluster on data that has both categorical and continuous features? Lets take a look at two simple ways to approach this problem using Python.

Dataset

In this post, I am going to cluster a small dataset I created that has a mixture of categorical and continuous features. My fake data represents customer data that might be used to understand customers of an E-commerce website/app. Our fake…

--

--