Automated NLP with Prevision.io (Part1 : Naive Bayes Classifier)

Published in

Prevision.io

6 min readNov 10, 2020

In this post we will expose some new features integrated in Prevision.io platform that covers Natural Language Processing.

Textual features is usually more tricky and harder to process than the linear/categorical features. In fact we need to transform texts into machine readable format which requires a lot of pre-processing.

The main challenges in text based processing are the following:

Data cleaning : in fact it not about how to find linear outliers or to handle missing values.. texts may contain useless symbols that have to be removed : urls, html tags, emoticons, punctuation, spelling errors which require a lot of work😱
It is not about how to transform words to numerics as much as which is the best way to do it; because you can just map each word to a unique index and that’s it 🙈: however it would not be an efficient representation, and the ML-algorithms would end up with disappointing performances.
Unlike categorical features a textual feature can quickly reach a huge amount of vocabulary; so an encoding such as one-hot-encoding would be very sparse and inefficient 🐌
Words context is a meaningful information that needs to be captured: example
“he does not eat sea food” => It is not about about not eating food but rather not eating sea food

Introduction

Fortunately, Prevision auto ml makes it easy for us, all the mentioned transformations above are automatically operated, and also you have the choice to pick up the operations you want to apply. The textual feature engineering supported by the platform are the following:

Statistical-analysis based transformation (TF-IDF) : words are mapped to numerics generated using tf-idf metric. The platform has integrated fast algorithms making it possible to keep all uni-grams and bi-grams tf-idf encoding without having to apply dimension reducing
Word embedding transformation : words are projected to a dense vector space, where semantic distance between words are preserved: Prevision trains a word2vec algorithm on the actual input corpus, to generate their corresponding vectors.
Sentence embedding : Prevision has integrated BERT-based transformers, as a pre-trained contextual model, that captures words relationships in a bidirectional way. BERT transformer makes it possible to generate more efficient vectors than word Embedding algorithms, it has a linguistic “representation” of its own. To make a text classification, we can use these vector representations as input to basic classifiers to make text classification.

In this article we will dig deeper in the very first feature NLP operator followed by a naive Bayes classifier. First we will test it out with a python code on a kaggle dataset called Real or Not? NLP with Disaster Tweets then we will launch an auto ml use case with prevision within the same dataset.

1- DIY method (Do It Yourself):

Step1 : Text cleaning : 🧹

Regardless of the EDA step that can bring out the uncleaned elements and help us to customize the cleaning code, we can apply some basic data cleaning that are recurrent in tweeters such as removing punctuation, html tags urls and emojis, spelling correction,..

Below a python code that can be be reproduced in other similar use cases 😉

Step 2: Tokenization:

Split the initial text input into small subtext using 2-gram representation : the unique unigrams + bigrams tokens define the dataset vocabulary.
Example :

“ oh my god there is another earthquake”

=> The uni-gram representation are ['oh', 'my', 'god', 'there', 'is', 'another', 'earthquake'], the bi-grams representation would be['oh my', 'my god', 'god there', 'there is'....]

Step 3 🤖: TF-IDF

Once we have converted our text samples into sequences of words, we need to turn these sequences into numerical vectors. One very common text encoding is the tf-idf : term-frequency inverse document-frequency.

The idea behind this metric is to take into account the occurrence frequency of a given token within a tweet while penalizing common tokens that occur very frequently in all tweets.

Step 4 🧺: Feature Selection OR dimension Reducing

Once the text converted into unigrams and bigrams, the resulted amount of tokens will certainly be very large and might slow down the training of the model later. We can either apply a dimension reducing (using SVD or PCA), or choose to drop less important tokens using f_classif or chi2 metrics.

In our example we will use sklearn.feature_selection.SelectKBest to select top 10K tokens, and we will use f_classif to compute tokens’ importance.

Python code from here

Step 5 : Naive Bayes Classifier

Theory🤓:
The algorithm is based on the Bayes theorem while assuming the naive assumption that each token in the corpus is independent of all the others.

Bayes’ theorem :

Where :

y : the target variable
x_i : i’th token (i from 1 to n=10K tokens)

Using the “naive” assumption, it can be simplified as follows:

As the P(x1,…,xn) is a constant term, we can use the following classification rule:

https://scikit-learn.org/stable/modules/naive_bayes.html

We will test 2 implementations of naive bayes classifier : Gaussian Naive Bayes and Bernoulli Naive Bayes (here an exhaustive list of scikit learn naive bayes classifiers). These algos differ by how they compute the likelihood P(xi/y)

Gaussian based Naive Bayes : it assumes that the likelihood is Gaussian:

Bernoulli based Naive Bayes:

It will calculate the proportion of how many tweets that include the term by the total of all tweets.

We will test a stratified cross validation performances: each cv-fold keeps the same proportion as the original dataset

Lets checkout the results:

Performances results are almost similar, Bernoulli based classifier has a better precision but smaller recall, whereas the Gaussian classifier has balanced performances.

Prevision.io Naive Bayes Classifier:

Lets check out Prevision auto ML performances; we had already launched a classification usecase using the following settings:

Normal profile
Only Naive Bayes model
Only tf-idf transformation

Please note that only the cleaning step has been applied upstream before launching the use case: all the steps that follows are FULLY AUTOMATED. And the training process has taken about 2 minutes!

We get the following output:

The model perform better than the Bernoulli naive Bayes classifier but slightly less than Gaussian naive Bayes classifier.

But the best asset of the automl is the execution speed plus the fact that you don’t have to worry about all the cleaning and the pre-processing staff.

Conclusion :

In this post I showed you how it has to be rough and time consuming to apply efficient textual pre-processing. You can end up with similar or better performances using an automated tool, which can save you a loot of effort and you can get focused on how you can enhance performances by adding new features.
Furthermore, Prevision allows you to perform additional textual feature engineering such as word2vec and Bert-based Transformers which shows amazing performances😍!
I am about writing another post for these two features and I will explain how it can be done with a python code and compare results with Prevision solution.

Until then you can test and combine our feature engineering transformations with different types of models (even blends of models!) to help you carry out your machine learning projects. All you have to do is to log in here