Predicting Expensive Wine Grape Varieties
I’ve always been into wine, and have wondered what factors make certain wines more expensive than others: e.g. points, winery, country of origin, etc..
I found a great data set on Kaggle with over 200k unique wines. The data was scraped from WineEnthusiast. The data includes a wine’s country, description, points, price, province, region, title, variety, and winery. Here is the data set if you’re curious.
I mainly wanted to focus on the more expensive wines, so in addition to cleaning the data and encoding missing values, I omitted any wines from the data set that had a price lower than $75.
Here are three questions that I ultimately want to answer:
- On average, which type of grape is most expensive?
2. Which country produces the highest proportion of expensive wines?
3. Which characteristic is most important when classifying the type of grape?
Let’s take a look at the first question: On average, which type of grape is most expensive? To be able to calculate stats on each grape variety on its own, I grouped all data points together based on their grape variety and then calculated some summary statistics.
It looks like the most expensive wines aren’t very well-represented, which makes sense. The top five wines above, represented by median price, don’t necessarily tell a lot as their mean is very close to the median, if not the same. We can see the Tinta de Toro
variety has a lot more data points, and are consistently priced highly. I would conclude that this variety of grape reliably produces wines that are on the more expensive side.
The second question: Which country produces the highest proportion of expensive wines? I took a similar approach here to the first question, but instead of grouping by grape, I grouped by country of origin. The results are below.
Interestingly enough, I thought France would be the highest contributor to the expensive wine market, but the US almost doubles France’s representation in the dataset. Compared to the US and Italy, France’s avg_price
is much further away than its respective median.
France’s price standard deviation is 3.5x that of the US, meaning that France produces a much larger variety in terms of price. (This affirms my preconceptions)
We never looked at the distribution of our data. Below is a plot that shows the distribution of price vs points, observing the trend between the two. On each axis you can see the distribution of each individual data series. The right hand distribution (price) shows that there is a large spike in the lower side of the price range, which makes sense. The points distribution seems to follow a normal distribution around a central mean. There is also a linear regression line plotted over the points.
Now let’s take a look at question 3: Which characteristic is most important when classifying the type of grape?
I used a random forest classifier after converting all of the text data to numeric, so the model can understand all of the data. After optimizing the model using various methods, and training it using a split of 80%, I was able to obtain accuracy of nearly 75% on my testing data.
From the graph above, we can see that price
is the feature with the highest importance when it comes to classifying the grape variety, closely followed by points
, which we observed to be closely related in the jointplot above. This makes sense, as very expensive wines tend to be with more rare varieties.
All of the data and my analyses can be found on my GitHub here.