Predicting Expensive Wine Grape Varieties

Brenner Swenson
4 min readOct 7, 2018

I’ve always been into wine, and have wondered what factors make certain wines more expensive than others: e.g. points, winery, country of origin, etc..

I found a great data set on Kaggle with over 200k unique wines. The data was scraped from WineEnthusiast. The data includes a wine’s country, description, points, price, province, region, title, variety, and winery. Here is the data set if you’re curious.

I mainly wanted to focus on the more expensive wines, so in addition to cleaning the data and encoding missing values, I omitted any wines from the data set that had a price lower than $75.

Here are three questions that I ultimately want to answer:

  1. On average, which type of grape is most expensive?

2. Which country produces the highest proportion of expensive wines?

3. Which characteristic is most important when classifying the type of grape?

Let’s take a look at the first question: On average, which type of grape is most expensive? To be able to calculate stats on each grape variety on its own, I grouped all data points together based on their grape variety and then calculated some summary statistics.

The standard deviation and mean of each grape’s points, and prices. The right most column shows the amount of time this particular variety occurred in the data set.

It looks like the most expensive wines aren’t very well-represented, which makes sense. The top five wines above, represented by median price, don’t necessarily tell a lot as their mean is very close to the median, if not the same. We can see the Tinta de Toro variety has a lot more data points, and are consistently priced highly. I would conclude that this variety of grape reliably produces wines that are on the more expensive side.

The second question: Which country produces the highest proportion of expensive wines? I took a similar approach here to the first question, but instead of grouping by grape, I grouped by country of origin. The results are below.

The standard deviation and mean of each country’s points, and prices. The right most column shows the amount of time each country occurred in the data set.

Interestingly enough, I thought France would be the highest contributor to the expensive wine market, but the US almost doubles France’s representation in the dataset. Compared to the US and Italy, France’s avg_price is much further away than its respective median.

France’s price standard deviation is 3.5x that of the US, meaning that France produces a much larger variety in terms of price. (This affirms my preconceptions)

We never looked at the distribution of our data. Below is a plot that shows the distribution of price vs points, observing the trend between the two. On each axis you can see the distribution of each individual data series. The right hand distribution (price) shows that there is a large spike in the lower side of the price range, which makes sense. The points distribution seems to follow a normal distribution around a central mean. There is also a linear regression line plotted over the points.

A plot of points vs price, with each respective distribution

Now let’s take a look at question 3: Which characteristic is most important when classifying the type of grape?

I used a random forest classifier after converting all of the text data to numeric, so the model can understand all of the data. After optimizing the model using various methods, and training it using a split of 80%, I was able to obtain accuracy of nearly 75% on my testing data.

The top 10 most important features (characteristics) of a particular wine when classifying its grape variety

From the graph above, we can see that price is the feature with the highest importance when it comes to classifying the grape variety, closely followed by points, which we observed to be closely related in the jointplot above. This makes sense, as very expensive wines tend to be with more rare varieties.

All of the data and my analyses can be found on my GitHub here.

--

--