How to Choose the Best Bottle of Wine With Scikit-Learn and a Bit of NLTK?

Published in

The Startup

5 min readOct 29, 2020

Have you ever found yourself in front of a wine shelf trying to figure out which bottle goes well with Dungeons and Dragons? Or have you ever wondered if anyone’s going to notice that you’ve spend 5$ on a wine for the family dinner? Well, a sommelier could help, but unfortunately they rarely exist in grocery stores nowadays.

A sommelier (/ˈsɒməljeɪ/ or /sʌməlˈjeɪ/; French pronunciation: [sɔməlje]), or wine steward, is a trained and knowledgeable wine professional, normally working in fine restaurants, who specializes in all aspects of wine service as well as wine and food pairing.

Don’t worry, I’m here to solve your problem!

What if we could define a model that would give us a list of taste descriptors based on ‘year_made’, ‘country’ and some other features of a bottle. Or maybe you’d simply like to know if a bottle is good/bad.

STEP 1. Get to know your data

After a couple hours of internet surfing I have found a proper dataset on Kaggle. It included 130k rows of wine reviews and descriptions. I have converted points column into binary 0/1 recommend column, which is my first target.

STEP 2. NLP

Now we need to take a closer look at ‘description’ column to understand if we could extract any taste descriptors from it. To do so let’s prepare the dataset for NLP.

NLTK is an awesome library which has all kinds of cool tolls for Natural Language Processing.

Preparation steps:

punctuation removal
tokenisation
stop-words removal
part of speech tagging

I am not going to go deep into any of these steps here, but you can find all of the necessary information here.

STEP 3. Extract meaning

I assumed that wine tastes are going to either be adjectives or nouns. So I extracted and lemmatised those and made a list list of most common words.

After filtering those by hand I finally got a more or less good looking list.

STEP 4. Prepare target

For the sake of the experiment I decided to divide all these words into 4 groups of 6 (the 6th descriptor is ‘other’ for all NaN values) to try and compare 4 different models.

STEP 5. Feature engineering

This part is more of an art than a science, so we can go as creative as possible. But do not forget about bias/variance tradeoff.

Some ideas that I have implemented:

extract more meaningful words and hot-encode them
extract year the wine was made from the ‘title’

To keep it simpler I have also reduced cardinality of ‘country’, ‘region’, ‘variety’

STEP 6. Build taste model

6.1 Quadratic discriminant analysis

We have relatively many observations and so I decided why not try QDA and see what happens.

For this model I hot-encoded all of the categorical variables, imputed and scaled data and also selected top 50 features with SelectKBest.

And here are the results:

As you can see model couldn’t even beat the baseline

I would not actually recommend QDA or LDA for large dataset with many features. As with more features it gets harder for the model to distinguish between classes.

6.2 Logistic Regression

LR is still not that computationally expensive but usually much more efficient than any other regression models for classification.

In terns of preparing data for the model I followed the same steps as for the QDA. And here are my results:

Accuracy score increases drastically and the numbers for precision and recall seem to be much higher on average.

6.3 Ridge Classifier

Let’s see if we could increase our model’s performance by adding alpha parameter. Same preprocessing steps.

This one is great for datasets where the number of features is larger than the number of observations or where the number of observation is just to low. With an extra alpha parameter model generalises much better.

6.4 Random Forest Classifier

Finally we got to the tree-based model. These are worth trying when the pattern of the data is non-linear. Tree models are good at capturing peculiarities thus it is highly possible to overtrain the model.

I decided to go with the OrdinalEncoder for this one. It is much faster for larger datasets and doesn’t really matter for a tree model if the values are on the same scale or not. Also Randomised search might help you increase the performance.

And voulait we have increased model’s performance one more time.

STEP 7. Is the wine good?

How can one talk about good predictive models without ever trying XGBoost?So its time to build one.

XGB is great for reducing the variance that RandomForest is not able to reduce and for building up very gradually.

Even without tuning hyper-parameters result looks great. But be careful XGB is computationally much more expensive and on a weaker machine it might be a better idea to go for a less complicated model.

So I think our ‘sommelier’ model is done and it is time to go and check how it actually works :)

Code is available here