Can Machine Learning model taste wine?

Aleksandra Rojek
5 min readApr 6, 2020

--

Since “drinking a small glass of wine a day is good for your long-term health” (Daily Mail) and we obviously want to live long and healthy why don’t we use data to optimise our choice?

This article will answer following questions:

1. Where does the quality wine come from?

2. More expensive the wine the better quality it is — Truth or myth?

3. How to choose good quality wine wisely?

4. Can predictive model identify the quality of wine?

To explore these problem statements, I have used Wine Reviews dataset which consists of around 130k wine descriptions, country of origin, variety, price and score.

1. Does quality wine need Mediterranean weather?.

To find out where do the quality wines come from, I have calculated the average wine rating for each in the countries in the dataset.

Considering the average score, wines from England are in lead with India, Austria & Germany shortly behind. On the other hand, countries like Ukraine, Egypt and Peru produce least quality wine.

The somehow surprising results might be due to sample size. In the dataset, there are only 63 wines (0.05%) from England, 8 wines (0.007%) from India, 3034 wines (2.5%) from Austria, 1992 wines (1.7%) from Germany.

Let’s get a better view of the quality of wine by country.

From here we can see that even though the mean is the highest for England, the maximum score that was achieved by this country’s wine is 96! We can also note that countries with wines that have scored the maximum points are: Italy, Portugal, US, France and Australia.

Based on maximum scores (excluding outliers), US, Germany and Austria are in the lead with 98 points. From the above graph, we can also see that US has the widest spread out of all countries — this is expected since wines from this country make up for around 40% of the dataset.

2. Does the price of the wine depend on its quality?

The price range in the data set sits between $4 and $3300 per bottle of wine. It definitely makes me wonder if the price is indeed an indicator of its quality. To that end, I have grouped the wines based on their price range and plotted the distribution of the scores for each of the groups:

We definitely can see that overall, the average score for each group increases when the price increases. We can confirm that in majority of cases the price does reflect on wine’s quality. However, it’s worth noting that in most of the categories we see some outliers, i.e. we can find a very good quality wines in the lower price range or poor-quality wines even when spending over $100 per bottle! So be aware, price is not the only indicator of wines quality!

The best example to demonstrate is to have a closer look at those $4 and $3300 wine bottles:

Let’s think twice before spending an average monthly salary on bottle of fermented grapes!

3. How can I choose wine based on data?

In the previous section, we concluded that there is a link between price and wine’s quality, but it’s not the only indicator. Now we will try to find out what else can help us in choosing the best wine. To achieve this, let’s first look at correlation between few fields in our dataset:

Working our way through the plot, we can see that there is a correlation between price and number of points given. On the other hand, it seems that vintage of the wine does not indicate its quality. However, interesting fact that we can draw from the graph is that the length of the description seems to have an impact on the score that wine achieves. Let’s have a closer look:

Did you know that the more descriptive the bottle is the bigger chances it’s a good quality wine?

4. Do you need to try the wine to see if it’s top-shelf candidate?

Have you ever bought and opened a bottle of wine just to find out it’s barely good enough to cook with? Did you spend loads of money on wines you didn’t even like? Yup, that makes two of us.

So what if there was an algorithm that could help you asses the quality of wine? You’d never have to drink bad wine again. Sounds good, right?

Well, I have attempted to build a simple classification model that will assess if the wine is above or below given score threshold — in this case average score. The model uses various features to make the decision:

· Price range

· Variety

· Country of origin

· Description length

· Vintage

· Winery

· Description of the wine

Some of the features were numeric so could be fed to the model as they were. Other features were categorical, therefore creating dummy variables was necessary to include them in the model. Finally, description of the wine is a string and to use it in the model, I have tokenized, removed the punctuation and stopwords, lemmatized it and created vectors — so called bag of words.

The result of this model should be compared to baseline of 53% as the percentage of wines that above the average is 47%. (Model that predicts that none of the wines is above threshold (average score) will achieve 53% accuracy.)

We can see that the model scored much better than a well-informed guess.

Top 3 important features for the model were:

· Description length

· Price range [40,60)

· Price range [60,100)

Numerous measures were taken to clean the data, including imputing missing values, cleaning the text and if necessary removing the observations. This dataset has potential to test out numerous ML models. Next steps could involve implementing multiclass model where exact number of point is predicted (or a points range). Deep learning could also be used to predict the wine variety based on description.

To conclude, as Simon Hoggart states: “life is too short to drink bad wine” — use ML to asses its quality.

Github repository:
https://github.com/Aleksandra-Rojek-p/wineReviewAnalysis

--

--