Wine Review PT8 — Prepare dataset — ML

Published in

Software-Dev-Explore

5 min readSep 27, 2021

Photo by Sangga Rima Roman Selia on Unsplash

Introduction

In the part7, we know that we have to preprocess data before training our model. We will define a function for loading dataset and do the following.

Remove duplicated data, fill missing data, and remove samples associated with particular variety(grape) who has insufficient samples for training.
Natural language processing
Encode target(variety/grape), which also know as label

In addition, we will define another function to handle imbalanced dataset and data augmentation(text augmentation in our case)

Dataset

Kaggle

We will use winemag-data-130k-v2.csv dataset for machine learning.

Source Code

Code in google colab

Task

Loading dataset and preprocessing
Handle imbalanced dataset and text augmentation
Plot word vector (optional)

Imbalanced dataset

First we need to see if there is a issue of imbalanced dataset.

We can see the maximum number for a variety is 13272 compare to the minimum number 1.

Remove samples

Define a function to remove samples associated with variety whose number of samples is under defined threshold

The function take DataFrame, col_name(column name) and threshold. Of course, column name variety will be pass to col_name.

In the function, we first find number of samples for each variety and then compare number to threshold, in order to find all varieties whose associated samples will be remove.

vc_cut_off is key/value pair, where key is name of variety and value is number of samples associated with that variety

Once we have variety name and number of associated samples then we find out the indices for those samples by comparing variety name and we together these indices as an array.

Finally, we use this array of indices to remove samples from DataFrame. Of course, we need to make sure reset index for DataFrame.

Data/Text augmentation

The purpose of this function is to produce synthetic data and make sure dataset is not imbalanced.

First install and import necessary moduels

Define a function for text augmentation

Here we use Synonym method

If you need different text augmentation methods refer here.

Next define a function for handling imbalanced dataset while using function above to do text augmentation.

The function take 2 parameters

text_feature: an array of text (will be used for text augmentation)
target: an array of target(labels, variety in our case) mapped to each text in text_feature

And return list of mapped text and target(variety) as tuple.

First, we create a DataFrame from those parameters that was provided and then extract target column follow by finding out the max number of samples associated with variety and this number is our max number for producing synthetic samples.

Second, for each classes(variety) we produce right amount of synthetic samples. Before second for loop, we are finding out how many synthetic samples to produce and gather all existing samples which will be used for text augmentation. In the second for loop, we duplicate sample randomly and do text augmentation. Last, we append all these synthetic samples back to DataFrame.

Finally, reset index for DataFrame.

We will not use this function while loading dataset, instead it will be after loading dataset and before transform dataset into numerical form.

Plot word vector (optional)

To see word vector in a 2d graph

Loading dataset and preprocessing

Define a function for loading and preprocessing dataset. The function can save preprocessed dataset and ability to do NLP with either Lemmatisation or Stem.

At very beginning, we check if there is existing saved dataset and return it, otherwise, load csv file.

In Referenced dataframe

Remove duplicated data
Drop samples whose description and variety have missing data
Remove all samples whose associated variety has insufficient samples
Fill missing data.

This is our DataFrame include all information that we can reference to after model’s prediction.

In Transformed dataframe, we drop columns that are not going to be our features for training our model, in our case the feature is description. And we do natural language processing on description.

Remove number and punctuation
Lemmatisation or Stem
Remove stop words

In addition, here I also use key word extractor to extract important keywords from text. We can either use Rake or Gensim. However, this step is optional.

Rake

Gensim

Last, we replace all description by processed description and counting frequency for each term in entire processed description.

This DataFrame included processed description which is our features.

In Processing labels (y), we need to encode our target/label(in our case is variety) into numerical form. Here we use LabelEncoder which is able to encode text into integer.

We add encoded labels to Referenced dataframe so later we can use integer from our model’s prediction to retrieve all information from DataFrame.

Finally, we pack all data into a dictionary and save it to certain location.

With this function, we can just load dataset into our memory for later usage.

This function only prepare and preprocess dataset, it does not handle imbalanced dataset and transform features in our dataset into numerical form.

Conclusion

These functions we have written help us taking care of loading dataset, preprocess dataset and handle imbalanced.

To recall what we have done.

Have a function that is going to produce synthetic samples if dataset is imbalanced, moreover it use text augmentation to produce synthetic samples in the correct way(not from word vectors but transform the words)
Have a function that is able to remove samples associated with variety whose number of samples is under threshold
Have a function that is able to load and process dataset. Use NLP to process description and encoding variety into integer.

Preprocessing dataset is an essential and big section in machine learning. The strategies are variant depend on what kind of data we have and not only variety of methods we can choose from but also involve different techniques.

Experiment is important as each methods that was involved in preprocessing dataset can impact performance of machine learning model.

Fortunately, we have done this in functions, thus we can swap, delete or reorder them.

Next we are going to find out which estimator(model) fit for our problem.