Wine Review PT5 — NLP — ML

Published in

Software-Dev-Explore

3 min readSep 9, 2021

Introduction

After part 4, we now need to analyse the descriptions that was given by sommeliers. Sommeliers comprise words into sentence and sentences into paragraphs. We can try to find out what words had been used mostly in descriptions and for particular wine and for variety(grape). And these words will become the features for machine to learn.

To find out what words is popular for description of wine, we need use NLP(Natural Language processing) to analyse each description.

Dataset

Kaggle

We will use winemag-data-130k-v2.csv dataset for machine learning.

Source Code

Code in google colab

Task

Use NLP to analyse descriptions

NLP

NLP stand for Natural Language Processing. It is also a machine learning model for processing human language such as stop words.

Take stop words for example, stop words is able to eliminate all unnecessary words while keep key words in a sentence. Imaging a sentence “It is an apple and taste delicious” and then after we apply stop words to the sentence, it become “apple delicious”.

NLP understand both apple and delicious words are important in the sentence. As human we also understand the key points in the sentence just like NLP.

Good news is there is a toolkit call NLTK ready for use and we don’t need to train Natural Language Processing model by ourself.

3 steps to processing natural language.

Remove numbers and punctuations
Remove stop words
Stem/Lemmatised words

Remove numbers and punctuations

Sentences might involve number of punctuation or digits, therefore we have to remove them before we move to stop words.

re is a python module/library that is used to substitute words in a given string and it can be used with regex (Regular expression).

What the hell does this mean? (‘(?:\w+))|\\r\\n|\\n|\\r|[^a-zA-Z] in regular expression. For testing regular expression online we use regex101.

We notice that | symbol appear in regular expression and it mean OR. For instance, case1 | case2, in other words either case1 or case2.

(‘(?:\w+)) The outer parenthesis is a capture group and inside the group we have ‘ and (?:\w+). ‘ means to match ‘ symbol. (?:\w+) means match any words one to unlimited times aka \w+ and (?:…) means match everything enclosed. For example, “I’m fine” then everything after “I” and before space between “I’m” and “fine” will match ’m

\\r\\n Match Return(keyboard enter) and Newline all together.

\\n Only match Newline

\\r Only match Return(keyboard enter)

[^a-zA-Z] Match everything except character from a~z both lower and upper case

For flags parameter we can refer to here

After words substitution, we convert string into lower case. It is a good idea to keep all characters in lower case before we use them in NLP.

Remove stop words

Here we need to remove all stop words from given string.

We first download stopwords package that contain all pre-defined stop words. Secondly, we use english stop words. Finally, we add our own stop words.

In the function, we split string list of words and then filter out all the words that appear in stop words list.

Stem/Lemmatised words

This is final step of NLP. So far we remove numbers and punctuations and then remove all stop words. However, the word itself can appear in different form depend on context. For example, “I play piano” and “I am playing video game” both play and playing denote verb play but it has two different form.

Both Stem and Lemmatisation are used for words normalisation but they are a bit different. What are the different we can refer here.

Stem: Better performance and straightforward to implement but result is less accurate

Lemmatisation: The result is much better but worse in performance

For Stem

For Lemmatisation

Here I choose lemmatisation.

PS: we can try to stem/lemmatisation before stop words.

Analyse descriptions with NLP

We have our NLP functions ready to use then we can start to use them on descriptions.

Conclusion

Even with helper from NLTK, it still take a while to get everything ready for processing natural language.

NLP allow us to extract important information from sentences that was given by other human. NLP is a part of machine learning and just like human reading through sentences and understand contexts then extract key information from contexts.

Part 6