Origin of wine part 3

Published in

Software-Dev-Explore

3 min readNov 2, 2023

Introduction

To turn descriptions in form of text into numerical, I can use TfidfVectorizer , which is a technique of feature extraction to transform descriptions in our dataset from text to numerical data then I can feed it into a machine learning model for training.

Before I am going to do feature extraction, I will need to do Natural Language Processing on those descriptions. The main reason is to improve performance of information retrieval in text data. In addition to reduce noise in text data and increae overall performace of machine learning model. Here I am removing number and punctuation in text then remove stopwords follow by either stem or lemmatize words.

Code

Notebook with code

Natural Language Processing

All functionalities for natural language processing are enclosed in one class. And the class is a custom transformer for transforming dataset. There is 2 reasons to do so.

Usability and cleaner code
Later I am using Pipeline from Scikit-Learn to train model.

Here is an articale on how to create a custom transformer from scratch.

I used NLTK to achieve natural language processing task. And complete code as below

Remove number and punctuation

Lines 79 ~ 96 is where removing number and punctuation happend.

For each descriptions, it simply replaced number and punctuation with white space by using Regex. And then it split entire text into an array of words and rejoin them.

Remove stopwords

It is a good idea to eliminate frequent used words from a text because those words don’t represent the meaning of entire text oftenly. An example of wine description below

Full-bodied and complex with spicy cherry aromas and flavors enhanced by a savory and herbal old world style character

Full-bodied, spicy, cherry, aromas, savory, herbal are the key words in the description. Therefore removing any other words from the description can lead to better information retrieval later.

Lines 99 ~ 114 is where it remove stopwords from each given text.

For each text, it iterate each words in text and filter the word if it appear in stopword list.

Stem and Lemmatization

The common idea for stem and lemmatization is to reduce a word into it’s root form. For example played, playing will be reduced to play. However their mechanism of reducing a word to root form is different.

Here to see the different between stem and lemmatization and their pro and con.

For speed I can choose stem. For accuracy I can choose lemmatization.

Lines 137 ~ 165 is lemmatization.

Lines 167 ~ 189 is stemming.

They both start with word tokenization then for stem it simply just stemming a word to its root form but for lemmatization, it use a word’s position in text to reduce it to its root form.

Put it together

Lines 60 ~ 76 is where text transformation happend. The order of transformation of a list of text as blow.

Remove number and punctuation
Stem or Lemmatization
Remove stopwords

Transformation

Now I can transform the description text data to a standardized text data by using the transfomer I just created.

Feature extraction

To extract feature from text and turn it into numerical data I can use TfidfVectorizer .

The data I get from TfidfVectorizer is a sparse matrix which most of element is zero. 149835x21016 means 194835 samples and 21016 variables(columns) for each samples.

I can find out the total vocabularies.

Conclusion

All the effort in natural language processing is for better feature extraction. Later I those two technique will be incorporated into training pipeline.

Now I have numerical dataset to be fed into machine learning but the dataset didn’t provid labels. In order for machine to learning we need to provide training data and labels.

One solution to create labels is to use KMeans from Scikit-Learn to generate labels for the dataset.

part 4