Wine Review PT3 — EDA— ML

Nelson Punch

Published in

Software-Dev-Explore

5 min readJul 9, 2021

Introduction

EDA (Exploratory Data Analysis)

Exploring the dataset that we have in our hand is important step in machine learning because it help me understand dataset. Machine Learning is also involved Data Science.

I can use Pandas and Matplotlib to assist me to understand dataset much better.

I am now focusing on exploring points and price from dataset.

Dataset

Kaggle

We will use winemag-data-130k-v2.csv dataset for machine learning.

Source Code

Code in google colab

Task

Definition of points and price
Distribution and Standard deviation of points and price
Relationship between points and price
Correlation between points and price

Definition of points and price

From Kaggle dataset, I can see definition of points and price

Points: The number of points WineEnthusiast rated the wine on a scale of 1–100
Price: The cost for a bottle of the wine

Distribution and Standard deviation of points and price

I would like to see the distribution and standard deviation of points and price in graph.

Distribution of data: Frequency of all possible values e.g the frequency of point 1, the frequency of point 2
Standard deviation of data: Measure the degree of spread of data

To create a graph that can represent distribution of points and price, I use histogram and matplotlib also provide hist method.

points

I find the mean and standard deviation of points then use matplotlib to plot a histogram chart with line of mean and standard deviation.

We can see that most of wine had earned points range between 80 ~ 100 compare to the scale of points 1~100. If we consider wine which has point above 60 and 70 then most of wine in dataset are pretty decent.

The graph shows that there are quite a lot of wine earned points between 85 ~ 94. However there are few of them above 94 and between 80 ~ 84.

The line min and max define the standard deviation from the mean. It also tell me most of data are spread in this range.

Now I would like to see this in box plot

I use seaborn for box plot.

In box plot, it is much clear to see most of data is between 85 ~ 91. Only two data points are consider as outliner but are in scale range 1 ~ 100. These two outliers also lead to a skewed distribution (the vertical line in box is moving a bit of left). Despite of skewed distribution, the graph had already tell me what I need to know about points data.

price

Similar to points we can replicate the steps that we have done for points to the price data.

This graph of price clearly show that distribution of data is skewed significantly. The distribution is tend to skewed when there are many outliers in dataset. From the plot, I can understand that there are many wines in price range between 4 ~ 200 and few of them are above price 200.

Box plot

Box plot had told me that fair amount of wine has price above 500. I can’t believe that the price of a bottle of wine can reach above 3000 to 4000.

Relationship between points and price

I would like to find out is there any relationship between points and price? For example, the higher the point a wine earn, the higher the price it sell.

First I only retrieve the data of points and price from entire dataset and group them by points and then find out average price for each points. It look like this in a table form.

I create a plot that use points against price

Here I can see a pattern. Pattern that point goes up price goes up. Between points 80 ~ 94, the price goes up gradually, in contrast, price goes up dramatically from 95 ~ 100.

This graph also shows that there is a positive correlation between points and price.

Positive correlation: A relationship between two variables in which both variables move in tandem. One variable goes up another variable goes up and vice versa
Negative correlation: A relationship between two variables in which one variable increases as the other decreases, and vice versa

points and price in scatter chart

I would like to plot point and price in scatter chart

With scatter chart I can see each data points clearly.

Correlation between points and price

I know there is a positive correlation between points and price, but is there a measurement to measure correlation? Correlation coefficient

Correlation coefficient is a numerical measure of some type of correlation, meaning a statistical relationship between two variables.

What does these values mean? They all assume values in the range from −1 to +1, where ±1 indicates the strongest possible agreement and 0 the strongest possible disagreement

In simple words, value between 1 ~ -1 where 1 mean strongest while -1 mean weakest and 0 mean no correlation.

From the table points to points has 1 the strongest relationship while points to price has 0.4 between non to medium relationship.

Let’s see in heatmap (a matrix form)

Conclusion

To understand points and price from dataset, I can use pandas and matplotlib, as well as, seaborn to help me. I have seen how points and price data spread and their distribution. In addition, I find out there is a slight positive correlation between points and price.

Part 4