Wine Review PT7 — Preprocessing dataset — Strategy— ML

Published in

Software-Dev-Explore

6 min readSep 18, 2021

Introduction

Prepare and preprocessing dataset is a big section. We can’t just feed raw data we have into machine learning model and then tell it to learn patterns from the data. Instead, we must prepare and preprocessing raw data before feed it to our model in order to train our model.

More often you will find there are data missing in a sets of data, therefore, we must prepare our data before hand. Data come in many different form such as image, voice, text …. and with preprocessing data that we can make sure preprocessed data is compatible with form of input that machine learning model is asking.

Always remember machine learning model only accept numerical form as input hence preprocessing dataset is essential.

There are some problem we might encounter before training our model.

First, data duplication. We need to drop datas that are duplicated but only remain only 1 sample.

Secondly, fill missing data. Fill data to a sample whose data is missing if possible or we can just drop the sample.

Third, imbalanced dataset. More samples mean we can train our model to recognise the pattern in data much better than only 1~2 samples. In categorical problem, it is a good idea to have sufficient samples for each categories for training a model otherwise a model might have bias toward certain category.

Finally, transform data. In our case we need to transform our data from text form into number and this also known as text feature extraction.

Optionally, we can incorporate a technique data augmentation. Data augmentation is a technique which is widely used for training neural network with images as input data. It simply flip, zoom, scale, sheer and rotate image to produce variety of images.

Data duplication and fill missing data

Remove duplicated data is very straightforward, especially with Pandas.

Fill missing data, this we have already done earlier and we have made 2 different functions, one with Pandas and an other one with Scikit-learn.

Transform data

The dataset is in text form which is not numerical input for machine learning model, therefore we have to turn it into numerical form.

There are 2 techniques to transform text data into numerical form.

Tf-idf : also known as Term frequency — inverse document frequency
Word embeddings : you might know Word2Vec

Regardless of which technique you end up with, they both produce vectors as result. In other words, vector is used to represent a word in a text or a sentence. For example a vector like this

[0.58149261, 0.38149261 , 0.81355169]

may represent a word “car”. And this is Text Feature Extraction.

We know a vector is a point in 3D space from origin ([0.0, 0.0, 0.0]) hence a set of words in 3D space will look like this. That is awesome that looks like stars in a galaxy.

Tf-idf and Word embeddings are different. Tf-idf means term frequency times inverse document frequency whereas Word embeddings learned representation for text where words that have the same meaning have a similar representation. For example, in Word embeddings word “man” and “boy” have similar representation in vector.

Compare Tf-idf with Word embeddings

Imbalanced dataset

Imbalanced dataset happen when dataset contain samples that is not uniform distributed or number of samples for each categories is in different scale. For instance, 100 samples for category A compare to 1 samples for category B.

Imbalanced dataset result in our model to have bias in the result of prediction. Example from the image, 99% the model will predict No Fraud. We certainly don’t want this to happen.

To solve this problem, we can introduce a number of technique Under-sampling and Over-sampling. Under-sampling to reduce samples in dataset to reach balanced dataset and Over-sampling to increase samples .

Under-sampling: Reduce samples
Over-sampling: Increase samples by creating synthetic sample

There are a set of algorithm in each domain. For over-sampling we have SMOTE, ADASYN, BorderlineSMOTE and more. For down-sampling we have EditedNearestNeighbours, AllKNN, TomekLinks and more.

You can even combine both over-sampling and under-sampling together. There is a library imbalanced-learn to help us to do it. For more detail.

Our goal is to make sure we have similar amount of samples in each classes, in addition, new samples we produced must not duplicated but synthetic.

Problems that I have encountered.

Excessive memory usage lead to program crash. There are 130k samples in our dataset. And after over-sampling it will produce even more samples for minority classes, thus the dataset might excess memory capacity.
Producing unexpected synthetic sample. For instance, a sample with description “smoke acidity bitter” which describe grape A and another sample with description “sweet fruit creamy” which describe grape B then after over-sampling a new synthetic sample might be “smoke acidity bitter sweet fruit creamy” which describe grape A. In fact, that synthetic sample is describing grape A and B. Ooops that is not the result we want.

That means imbalanced-learn library is not going to help us that much.

Solution

I have been researched for solutions for a while and I came up with strategy to tackle the problem, although it is not the best.

Excessive memory usage

We are going to remove the classes(grapes) from dataset and the classes which have insufficient samples for training our model. For example, remove all classes(grapes) where number of samples is under 1000. The reason is that our model is not going to learn well on minority classes(grapes).

Producing unexpected synthetic sample

Data augmentation is here to rescue. There are 2 ideas I found in order to do data augmentation(in our case, it is text augmentation).

They must be done before turning text into vectors.

Language translation. You can translate text from one language to another and then translate text back to original language. With this process, some of words in text will become different but the same meaning. And then we can create new sample from the new text. Different sample but still describing particular grape.
nlpaug. This library can help us to do text augmentation. The words in text will be augmented (replaced by different words without losing the meaning of the words) by applying nlpaug.

Conclusion

There might be more steps need to be done in preprocessing dataset but it depend on what dataset you have. Regardless of dataset you have, it is necessary to transform dataset into numerical form as input for machine learning model.

It require planning and experiments in order to figure out strategy in preprocessing dataset. In our case, data duplication, fill missing data, imbalanced dataset and transform data.

Imbalanced dataset. It involve Over-sampling, Under-sampling or both. Due to the problems we are facing, we cam up with a strategy which consist of removing the classes(grapes) which have insufficient samples and data augmentation(text augmentation).

Transform data. Turn text into numerical data as input for machine learning. There are 2 options for us either Tf-idf or Word embeddings. And we know they both they both extract features from text and then turn them into vectors as representation.

Last, any of techniques involved in preprocessing dataset will have an impact of performance of our model.

Next we will implement them in part8.