Wine Review PT2 — Data cleaning — ML

Nelson Punch
Software-Dev-Explore
7 min readJun 13, 2021
Photo by Edgar Chaparro on Unsplash

Introduction

In this part of Wine Review journey, I am going to download dataset from Kaggle and start to clean dataset and fill missing values.

I am inspecting dataset and try to find out detail information about the dataset and then base on the information I gain from dataset inspection, I can clean dataset and fill missing values.

In real world, dataset doesn’t always include data we are expecting. You might have a dataset where data is missing/blank at certain row and at certain column or data type of particular column is not we are looking for. In order to feed our data into machine learning model, we must understand the data we have then processing data or transform data.

Dataset

Kaggle

We will use winemag-data-130k-v2.csv dataset for machine learning.

Source Code

Code in google colab

Tasks

  • Inspecting dataset
  • Remove row or columns if necessary
  • Fill missing value
  • Detect outliners

Inspecting dataset

First we need to import libraries that we are going to use Pandas, Numpy, Matplotlib, Seaborn.

Load our dataset

DATA_DIR is a variable point to directory where the data file located.

Overview dataset

To understand our dataset in general, I can use Pandasinfo() method

wine_df.info()

In the dataset/DataFrame we know

  • Rows/Entries: 129971
  • Columns: 13
  • Type of columns: object mostly, points column is int64 and price column is float64

Drop columns

Dataset might contain data that is negligible to us. My purpose of Wine Review is to train machine to understand the taste of each wine, later machine learning model can predict wine titles that related to new description. description data is quite significant to our machine learning and title, variety(type of grape) are relatively import to use as well. Reset of data we can just ignore.

Later I am going to do EDA(Exploratory Data Analysis), therefore, some of data I need to keep for good use.

I decide to drop columns:

  • designation
  • region_1
  • region_2
  • taster_twitter_handle
  • taster_name

Define a function to drop unwanted column

Unwanted columns are removed.

Inspecting missing values

Dataset is not always perfect and ready to be used. Now I am going to see if there are any missing values, then later I can decide to fill missing one with value or drop the rows.

wine_df.isna().sum()

isna() return a DataFrame where each rows & columns fill with boolean value that indicate whether it contain value or not

sum() return integer for each columns, integer represent how many rows under this columns with no value

e.g 63 rows with missing value under column country

I decide

  • Drop the the row where variety is missing
  • Fill missing value for rest of column

Drop the the row where variety is missing. Variety(type of grape). Each wines is made from certain grape, therefore, sommelier is tasting certain grape and giving the description when sommelier blind tasting wine. This also indicating input feature is description and label is variety.

Drop the missing variety

The row is gone. Excellent.

Fill missing value

To fill missing value I have 2 options

  1. Pandas
  2. Imputer from Scikit-Learn

I choose Scikit-Learn eventually.

Pandas

Define a function for filling missing data

Pandas fillna() enable me to fill all missing value in the rows under certain column. fillna

I gave key/value pair dictionary to fillna.

  • Key: the column where underlying rows have missing value
  • Value: the value I want to fill for those missing value

For price I fill missing value with mean() (Add all price together and divide number of rows)

For the rest of columns they are text base so I fill Missing as value

At the end I make sure that price and point columns are both float and integer type.

df['price'].apply(lambda x:round(x,0))

I use Pandas Series’ apply() method apply which allow me to loop through each value in rows under price column. I then provide a lambda function to round each value to integer.

Scikit-Learn

Define function for filling missing data

SimpleImputer an object for imputation

Imputation: In statistics, imputation is the process of replacing missing data with substituted values

Parameter strategy I use constant for columns with text and mean for column with number such as price

ColumnTransformer transform column with transformer

ColumnTransformer: Applies transformers to columns of an array or pandas DataFrame

For each columns I define imputer for them. Order is important. ColumnTransformer take a list of tuple(Transformer/Imputer). Tuple which you define (name of imputer, imputer to use , name of columns). I don’t want to mess up the order of columns in original DataFrame, thus, I define imputer for each columns.

By specifying remainder='passthrough', all remaining columns that were not specified in transformers will be automatically passed through

As ColumnTransformer return a 2d array/Matrix if transform by calling fit_transform() then I need to convert it to DataFrame again.

At the end I make sure that price and point columns are both float and integer type.

df['price'].apply(lambda x:round(x,0))

I use Pandas Series’ apply() method apply which allow me to loop through each value in rows under price column. I then provide a lambda function to round each value to integer.

This is not an ideal way to do it with Scikit-Learn.

The better way to do it

Instead of defining imputer for each columns, here I define imputer only for columns that is object type and numerical type(non-object type). It’s less code and flexible. The only draw back is that the order of columns is not as same as original DataFrame but it’s not a big deal.

I choose Scikit-Learn

No missing values at all. Nice

Outliner for points and price

Wikipedia

Outliner: In statistics, an outlier is a data point that differs significantly from other observations

points and price are number as human being I am visual animal. I can’t see if a data point that is differs significantly from others. In a list of number, I can’t tell different. In a data table I can’t tell different.

I can tell different in graphical representation.

Points

I use matplotlib’s subplots to create a figure and ax(sub figure inside main figure). I then use seaborn’s boxplot to create a boxplot.

What is boxplot

It looks like we have 2 outliners on right side.

From Kaggle we can see description for points

The number of points WineEnthusiast rated the wine on a scale of 1–100 (though they say they only post reviews for

Points’ scale is 1 ~ 100, these 2 outliner are in scale range. I conclude them as not outliners

Price

Again it looks like a lot of outliners but a cost for a bottle of wine can be quite expensive 4000, 5000 thousands. I conclude them as not outliners.

Conclusion

Most of time dataset that was given is messy and not what we are expecting. We have tools help us to inspect dataset and understand it in detail, even locate missing value.

Importantly, we have to decide which columns/features are essential for training model. To find out what features to keep or remove, we have to understand what problem we are solving. E.g predict future stock price of a company then feature of stock price history data in dataset for the company become important.

If an important feature missing value for a row then we could remove it or fill value for the missing one. We don’t want to feed data that has missing value to our machine learning model. Garbage in garbage out.

Sometime data point is differs significantly from others. Plotting data points in graphical way, we can spot outliners immediately and follow by decision whether should we remove them from dataset or not.

Part 3

--

--