Clumps of Wine

Teja Tammali
INST414: Data Science Techniques
5 min readMay 12, 2022

It is no surprise that different types of wine can produce hundreds of various results, especially when factored in temperate, fermentation duration, storage area, etc. To see the different clusters of wine, I used a data set from the University of California, Irvine’s machine learning repository. For this particular dataset, the results of the data are a chemical analysis of wines grown in the same region of Italy but derived from three different cultivars. This analysis determines the quantities of 13 constituents found in each of the three types of wine. For this dataset, the attributes are:

  • Alcohol
  • Malic Acid
  • Ash
  • Alcalinity of ash
  • Magnesium
  • Total phenols
  • Flavonoids
  • Nonflavanoid phenols
  • Proanthocyanins
  • Color intensity
  • Hue
  • OD280/OD315 of diluted wines
  • Proline

From this particular dataset, a non-obvious insight I wanted to extract is to section off wines into different clusters by picking out attributes to link similarities to each cluster. This insight has the possibility to inform winemakers of consistency on ways to make wine the certain way they want.

The initial step is to read the data from the CSV file which holds the 13 attributes and values. From using the code below, we can conclude that all the data types are numerical, either float or int.

data.info()

After getting in the data types of the 13 attributes, we can check out the skewness of the dataset by simply typing the code below. To be simple, the results will be printed out as a table instead of visualizations.

data.skew()

In order to check for outliers, we can create boxplots for easy answers. For example, it was found that the attribute, color_intensity, was a couple of outliers after 10 units.

Since we want to look for clustering, we are going to use the K-means algorithm. To find the proper k value we want and to set the data clusters, I used sklearn as the framework to facilitate my analysis, while importing StandardScaler to remove the mean and scaling to unit variance. In addition to using StandardScaler, I also used PCA (principal component analysis). PCA is a technique used to emphasize variation and bring out strong patterns in a dataset. Simply, it makes data easy to explore and visualize. Since the wine dataset is not 2D, we gain a lot using PCA because it has the potential to remove noise by reducing a large number of features to just a couple of principal components. Essentially, there will be two principle components we are looking for. The first one is computed so that it explains the greatest amount of variance in the original features. The second one is orthogonal to the first, and it explains the greatest amount of variance left after the first principal component. That being said, we want to find the variance explained by 2 principal components, which was found to be 55.41%.

Next comes the clustering part. We need to find the proper K value to use, so to do that, we will use the Silhouette Score technique and the Elbow analysis.

As we can see from the Silhouette Score, the best score for clustering is K = 3. From our K value, we can now make a clustered data graph. There will be two data points scattered: the results of PCA as 2; centroid of PCA. Centroid of PCA calculates the mean for each variable and is a point in n-dimensional space. The centroid of PCA will be marked as an X while the three clustered areas are color-coded as seen below.

As mentioned in the introduction, there are three different types of wines that start from grapes grown in three different cultivars. That being said, the three clusters represented in the graph above are representing those three different cultivars. I find it super interesting how these three cultivars are so different from each other, even though they are grown in the same area. One small change factor can produce the slightest, unique taste in wine.

One major bug I encountered while coding this project is figuring out how to get the silhouette score as a graph. I have never done this before, so this was also the most challenging. I knew I had to incorporate sklearn into matplotlib in order to use plt. A key aspect I was missing though was that I had to set a range to the length of the silhouette variable. There weren’t really any limitations from this dataset as it was already clean to use and had proper continuous variables and no categorical. My biggest takeaway is how cluster analysis can be done from a multi-dimensional dataset. There are just so many data points that can be taken into describing and showing the value of the story behind the data.

--

--