Classifying wine type by clustering

Abigail Chen
INST414: Data Science Techniques
3 min readMay 12, 2022

Most college students are approaching the age of 21 when they can drink alcohol, and I wanted to try to see if it would be feasible to determine the level of alcohol by its data.

My data comes from wine identification data on a website called scikit-learn. Although most of the information is very new to me, I would like to try to use this existing content for a cluster to see if I can find some interesting data through analysis.

First, I looked carefully at the main variables contained in this data. The main variables include following columns

  • Alcohol
  • Malic acid
  • Ash
  • Alcalinity of ash
  • Magnesium
  • Total phenols
  • Flavanoids
  • Nonflavanoid phenols
  • Proanthocyanins
  • Color intensity
  • Hue
  • OD280/OD315 of diluted wines
  • Proline

After searching, I decided that the first four might have an important role in determining the wine category. In order to better determine the content of this data, I ran several sets of random seaborn graphs to determine the change in this chart before and after the cluster.

Next, I need to choose a suitable K value for my data analysis for comparison. To achieve this goal, I used the Elbow method. In the beginning, I didn’t quite understand the principle of this operation until I searched and found that Elbow means Elbow. This method is suitable for relatively small K values. When the selected k value is smaller than the true one, the cost value decreases dramatically for every one increase in k. When the selected k value is larger than the true one, the change in cost value is not so obvious for every one increase in k. Thus, the correct k value will be at this turning point, similar to the Elbow.

The following figure.

By drawing a graph of K versus cost function, as shown above, the value of the Elbow (cost function drops quickly at the beginning and starts to flatten out at the Elbow) is taken as the value of K. K=3. Although not all problems can be solved by drawing an elbow graph, I was fortunate that this graph allowed me to determine that K=3 quickly.
Unfortunately, I had difficulties calculating the K values, and I must admit that I was unable to finish this step of the cluster because I was too optimistic about the final at the end of the period.

The limitation of my data is that I didn’t finish the final cluster operation, which definitely made the whole analysis halfway. But I did realize from this assignment, or from the class as a whole, that I should take every assignment seriously, even if it’s very forgiving in terms of dues date. But once I got closer to the end of the period, my stress and energy would make it more difficult for me to complete this assignment. I learned the lesson of this assignment deeply
As a python class, this class was definitely a huge challenge. And the fact that I waited until the last days to do the study definitely taught me a huge lesson. But I’m still glad that I at least learned the Elbow Method.

--

--