KMeans With Wine Qualities

Dominic Graziano
INST414: Data Science Techniques
2 min readDec 2, 2023

Data and Collection

Overall this project seeks to cluster both red and white wine datasets based upon their alcohol percentage and fixed acidity. From a business perspective they could target a popular wine and by finding the cluster, other similar products could be advertised, in hopes of selling more to consumers. I found this dataset on kaggle which had an individual csv file for red and white wine, and the corresponding code can be found in a Jupyter Notebook. In this notebook I used the libraries of Pandas, Matplotlib, and Scikit-Learn’s standard scaler and kmeans within Python to reach these insights.

Data Cleaning

This dataset was previously cleaned and so the main thing I did was use Scikit-Learn’s standard scaler to scale all of the values within the dataset. This was done on the features that I selected within the dataset to be visualized.

Analysis

Before producing a visualization of the clusters, I needed to find the number of clusters, so I used the elbow method and for each of the datasets the optimal number of clusters was 5. At this used Scikit-Learn to create and fit the clusters on the data for the red and white wine individually, and then added a column into both datasets with the corresponding cluster id. From this point I was able to plot these clusters using matplotlib and made sure that each cluster was featured in a different color.

Red Wine Clusters
White Wine Clusters

Limitations

Overall, I there are limitations in choosing these two attributes to get clusters from, as there were a number of attributes that could have been chosen. Clusters won’t necessarily taste the same or be similar products this just tells us based on the acidity and alcohol percentage. I also find the visuals to be quite condensed and maybe a sample should have been taken instead of visualizing all of the data.

Github Link

--

--