The war of the (wine) worlds

Kenneth Bollen
6 min readJan 28, 2018

--

An often heard debate among the circles of wine enthusiasts is whether the vineyards of the so called “new world” have started to match their more traditional counterparts in their ability to produce good quality wine. The debate stems from differing philosophies to the wine making process where the “old world”, composing of the western European wine regions, have for centuries produced wine from selected regions with the intention to be consumed as a daily beverage, with emphasis on its ability to pair well with meals. While the likes of the U.S.A, South Africa and Australia have adopted more new age scientific methods to meet the tastes for contemporary wine drinkers. However, in an age of globalisation and sharing of best practises, can one really make a distinction in their preference between these two worlds?

To test this theory I have turned to one of the biggest wine retailers in the U.K, Majestic Wine, to web scrap data from over 700 wines.

Data Collection

To collect the data I leveraged Python’s go-to web scraping package, Beautiful Soup, and wrote a simple for-loop to collect data based on their region, grape and type of wine.

Collecting URLs

Using the website’s headers (h3) i looked through their section of “Browse Wines” to select all the different types of wines and their respective URLs, which was append to a list called “wines”. After which I looped through the list of wine URLs to search through their <a> tags for hyperlinks (href) for wines but avoid duplicating the links that were already gathered from the first loop. This was repeated for collecting URLs for wine countries and regions.

Each type of wine would have a number of webpages catalouging the different products. Therefore I needed to write a piece of code to determine how many webpages to loop through to collect all the data. The class button button — small contain information on the number of webpages. Therefore, using the max() function, I could determine what the last webpage for the loop range.

Finding the range of web pages to loop through

Data Preparation

Some of the data cleansing issues I faced when collecting the data related to extraneous information for prices and splitting the prices from the rest. This was handled as follows:

Cleaning price data

Once the data was cleansed, I began the process of creating dictionaries for all the data and merging them together to create one data frame with pandas. The dataset can be found in the following Kaggle repository: https://www.kaggle.com/kennethbollen/majestic-wine-data

Data Analysis

The data collected provided with me with a comprehensive look at a vineyard’s wine price, region and most interestingly the number or people who repurchased that bottle of wine, which I will use as a proxy for the quality of the wine. An initial look at the wines from different countries relative to their ratings provided the following:

Swarmplot of Country by Ratings
Average Price (£) and Ratings

As seen from the table, Australia scored the highest in terms of the number of people who would buy a wine from that region again at 92%, while Italy scored the lowest at 84%, despite being the second most expensive region on average, demystifying the myth that the price of wine is an indication of its quality. Going closer to our initial question, I categorised countries into new world and old world. Where the likes of Australia, Argentina, New Zealand, South Africa and Chile were grouped into the new world and Spain, Portugal, France and Italy were grouped into the old world.

Swarmplot of Rating by ‘World’
Average Price (£) and Score by ‘World’

Interestingly it shows that on average 89% of buyers would buy a wine from the new world, which is 2% more favourable when compared to people who bought wine from the old world. However, is this difference statistically significant?

H0 Null: µ(new world)= µ(old world) The country of origin plays no difference in the favourability of the wine

H1 Alternative: µ(new world)≠ µ(old world) The country of origin has an impact on the favourability of the wine

Setting up the hypothesis test, my approach is to utilise a significance level of 5% with an unpaired two-sample t-test as the data meets the conditions of being independent and normally distributed.

Data for country’s rating score is normally distributed

Applying Python’s Scipy stats package, I compute a p-value of 2.3%, which falls below the significance level of 5%, allowing the rejection of the null hypothesis and state that a wine’s country of origin has an impact on whether someone will enjoy that glass of wine.

SciPy Stats Package

Wine connoisseurs will however argue that the average person’s palate isn’t sophisticated enough to discern the quality of the wine producer. This may be true, which may necessitate a further look into industry body awards and ratings to provide further insight into this debate.

Predictive Analysis

Lastly, I wanted to find out how well the data allowed me to predict whether someone would like a particular glass of wine based on the country, price and grape. The target variable here would be customer satisfaction with 1 indicating favourable and 0 indicating unfavourable, which would be solved through a classification model.

The data-set contained various scales of data making my first step transforming the data into a standard scale. My approach was to create a dummy variables and scale the data into binary values to allow us to answer the question of favourable or unfavourable.

Creating dummy variables

Using the Support Vector Classification model for it’s ability to handle moderate to large data-sets with many features, I split the data-set into 3 sets using the validation set to tune the parameters of the model before training the model.

Splitting the data into 3 sets

After which I created a parameter grid for the C and gamma for the GridSearchCV function to tune through and find the optimal parameters to train the model with.

Searching for parameters

This process allowed me to create a model with a training score of 0.93 and test score of 0.9, indicating a robust model. However, as indicated by the confusion matrix, the model picks up significant amount more of True Positives (116) than True Negatives (10).

Confusion Matrix

This is largely because we’re working with an imbalanced data-set where we have more data on consumers that have rated favourable wines than unfavourable wines. To balance this bias, I want to increase the model’s recall of class 0 (unfavourable wines) at the cost of the models precision.

Classification Report

To help me determine the most ideal precision-recall trade-off, I plot the various thresholds and their corresponding precision-recall score.

Precision-Recall Curve

With the threshold zero being to the far left of the curve and my goal of increasing the recall of class 0, I have scope to increase the threshold before the cost in recall becomes detrimental.

Increasing threshold

This provided me with the best model to conduct predictions on user satisfaction given the imbalanced the dataset

Analysis found on GitHub: https://github.com/kennethbollen/wine_price

--

--