How to avoid bad wines without particular knowledge on the subject?

Choosing a wine seems do be extremely difficult for someone who has no particular knowledge on the subject because the number of varieties that you can find is enormous and study all of them would require more than a lifetime.
Is there a way to minimize the chances to pick a bad wine? Is there anything that can guide a little to choose something decent or even excellent?
That is what we will try to figure out.
To do it, we use a wine review data base extract from kaggle (link) and use the CRISP-DM method to progress.
Which are the countries that produce the most different varieties of wine?
First, we need to evaluate the number of varieties and the proportion of variety produced worldwide in order to be more familiar with what we are more susceptible to see at the wine shops.
If we plot the proportion of wine varieties produced by each country of the database on a world-map, we obtain the following result :

- The united states are leading the number of different varieties on this database followed by France and Italy. Together, they produce about 72% of the total number of varieties of wines.
- Medium producers such as Chile, Argentina, South Africa, Spain, Germany, Austria, Australia, and New Zealand represent about 26% of the different varieties of wine.
- The rest of the producing countries represents less than 2% of the varieties of wines.
- Most of the wine producers are localized in Europe.
You will most likely have to choose between wines of big and medium producers if you go to a wine shop.
What are the most used adjectives to describe a wine?
The first thing that you will probably try to decode is the description of the wine. How a wine is describe? And is it really representative of the quality of the wine?
Let’s see first what are the most used adjectives to describe a wine.
The result is shown on the following word cloud. Among adjectives, only the top 200 is shown.

How the description of the wine is representative of the quality of the wine? Can we trust it?
First, we classify the wines into bad, average, and good categories.

The points attributed to the quality of the wine follow a normal distribution law and expand from 80 to 100 points. We split the wines into the following categories using the average of the normal distribution and its variance :
- Bad : Wines with points bellow the variance of the normal distribution
- Average : Wines within the variance of the normal distribution
- Good : Wines above the variance of the normal distribution
Now that we have our 3 categories, we observe the most used adjectives for describing the wines in each of the category. To do that, we used Natural Processing Language tools on the joined descriptions of wines for each category. The results are represented in the following word clouds.



The description of the wine is representative of its quality. The words are indeed appearing at different frequencies for each category. For example, we do observe that some adjectives such as “black”, “rich”, “great”, “fine”, “ripe” appears more frequently in good wines that in bad ones. Looking at this 3 word clouds and the adjectives inside a description could give already a good guess on its category among bad, average and good.
Does the length of the description a good indicator of the quality of the wine?
We did have seen that the words employed were a good indicator. But what if we look simply the length of the description?

The description length average raises linearly with the points. It is a good indicator of the quality of the wine! So a first look when we are in a hurry could be just the length of the description!
How well can we predict the rating of a wine taking into account the price and the length description?
For this part, scikit-learn was used to get a linear model of the quality of the wine, with the price and the description length as variables :
points = a * price + b*description_length
We obtain a linear regression coefficient of 0.38 which means that the quality of the wine is far from being perfectly explained with a linear that takes only the price and the description length as variables. We have seen earlier that the relation between description length and quality of wine is on the average linear. But if we look the relation between price and quality, we get the following boxplot :

This plot shows clearly that the quality is not linear with the price. Furthermore, the y-axis of the plot is in logarithmic scale. That is one of the reason why our model does not fit. One could use non linear model and categorical variables such as the country of the wine or the variety to have a more accurate model…
Conclusion
The major producers of wines varieties are United States, France and Italy and represent together 72% of the worldwide production of wines.
The adjectives used in the description and the description length are good indicators of the wine quality. Moreover, the description length scales linearly with the quality of the wine.
Even if the higher is the price, the better is the wine on the average, the quality of the wine does not increase linearly with the price. The prices increase drastically for good quality of wine. It is a reason why a linear regression model is not very effective to predict the quality of the wine.
Acknowledgement
The code corresponding to this study is available on github.