Origin of wine part 2

Published in

Software-Dev-Explore

4 min readNov 2, 2023

Introduction

Data is important when training a model. Collecting data is the first step before building and training a model. Luckly the data collection task had been done for me.

With insufficient of data, model wouldn’t perform well on prediction since model unable to extract pattern from data. Fortunately I have 150k samples in dataset and it is sufficient for model to learn.

Data preprocessing is an essential step before data been fed into model for training. Data preprocessing is used to deal with problems in data for example missing data, text data, noisy data, unrelated variables in data and so on and so forth.

Code

Notebook with code

Load dataset

First I load dataset with Pandas and display few samples for further inspection so I can have clear idea about data.

In this dataset, I have 10 columns. Column description can be our input to the model. According to the image, I notice sample 1 and 4 at column region_2 have NaN value which also known as missing value. However I suspect there are more NaN values in this dataset.

Data cleaning

Clean the dataset before feeding it into the model for training. From above I know I need to deal with missing value problem. In addition there are 2 columns region_1 and region_2 is not in one column and I want them to be in column. I also want to know how many missing value in the dataset.

This give me a different way to inspect the dataset

Firstly, there are total 150930 entries also known as samples. Secondly, columns where Non-Null Count is not equal to 150930 means there is missing value. Finally, points is int64 and prices is float64 then rest of columns are type of object which is text.

Another way to see number of missing value for each columns.

A better visualization for missing value.

Drop samples

In my opinion both coutry and province is essential for people who are looking for wine’s origin. Without those two information I wouldn’t know where I should go. From image above there are 5 missing values in both country and province. Therefore I decide to drop those samples.

I collect all samples that need to be dropped.

Then drop them from dataset.

Make sure they are dropped.

yes they are gone

Fill missing values

designation, region_1 and region_2 can be filled with Unkown for missing value. price here I would like to fill missing value with median price.

For designation, region_1 and region_2.

For price.

Make sure there are no missing values

Conclusion

I inspect dataset, drop samples and fill missing value to make sure entire dataset is clean. However this cleaned dataset is not ready enough to train the model.

Description from dataset is input data to the model but it is in form of text and the model not accept anything but numerical data thus text need to be transformed into numerical data. Moreover description contain number, symbol and punctuation. These might impact the model performance.

Before transform text into numerical data , I need to use Natural Language Processing method to preprocess description.

part 3