NLP With Classical Machine Learning

9 min readDec 7, 2022

Introduction

In this blog, we will use the most well-known dataset for NLP, the IMDB Reviews dataset, for sentiment analysis. The goal of this blog is not just to perform sentiment analysis, but also to investigate all viable methods for doing so. So, we’ll go from basic to sophisticated simple models like logistic regression and boosting models, so let’s get started.

What Is NLP?

Natural language processing (NLP) is the study of linguistics, computer science, and artificial intelligence to develop digital systems that can interpret and act on what people say.

It essentially enables robots that solely comprehend binary languages (0 and 1) to analyse human languages such as English.

Natural language understanding (NLU) and Natural language generation (NLG) are the two main subsets of NLP. The former translates human languages into a machine-readable format for artificial intelligence study. After analysing it, NLG develops a suitable answer and returns it to the human user in the same language.

Pre-processing

Preparation is critical in any NLP problem because it may contain undesirable content that is irrelevant to our model. In our case, we used IMDb to collect movie reviews. So, there are some URLs and emojis in these reviews that we do not want to be included in our research, so we will remove those as well. Punctuation marks aren’t required, so we’ll skip them too. We will tackle this by creating the following functions.

Removes Punctuations

import redef remove_punctuations(data):punct_tag=re.compile(r'[^\w\s]')data=punct_tag.sub(r'',data)return data

Removes HTML syntaxes

def remove_html(data):html_tag=re.compile(r'<.*?>')data=html_tag.sub(r'',data)return data

Removes URL data

def remove_url(data):url_clean= re.compile(r"https://\S+|www\.\S+")data=url_clean.sub(r'',data)return data

Removes Emojis

def remove_emoji(data):emoji_clean= re.compile("["u"\U0001F600-\U0001F64F"  # emoticonsu"\U0001F300-\U0001F5FF"  # symbols & pictographsu"\U0001F680-\U0001F6FF"  # transport & map symbolsu"\U0001F1E0-\U0001F1FF"  # flags (iOS)u"\U00002702-\U000027B0"u"\U000024C2-\U0001F251""]+", flags=re.UNICODE)data=emoji_clean.sub(r'',data)url_clean= re.compile(r"https://\S+|www\.\S+")data=url_clean.sub(r'',data)return data

Lemmatization

We will execute lemmatization after cleaning the textual data, thus let us first define lemmatization. The purpose of lemmatizations is to reduce a word’s inflectional forms and, in certain cases, derivationally related forms to a common base form.

As an example: car, cars, car’s, cars’

So the lemmatization process will convert cars, car’s and cars’ to car. The end result of this text mapping will look like this: the boy’s cars are different colors → the boy car be differ color.

Now, the flavour of the lemmatized output may vary, but this helps to reduce the amount of our model’s vocabulary.

import nltknltk.download('stopwords')nltk.download('wordnet')nltk.download('omw-1.4')
from nltk.corpus import stopwordsstop_words = stopwords.words('english')def lemma_traincorpus(data):lemmatizer=WordNetLemmatizer()out_data=""for words in data:out_data+= lemmatizer.lemmatize(words)return out_data

TFIDF

To comprehend TFIDF, we must first understand what TF, or term frequency, is. It is calculated by dividing the number of repeated words by the total number of words in the sentence.

For Example : Consider a document with 100 words. This text could be a single sentence or a paragraph. There are 12 instances of the term cat in the document, hence the TF for that word cat as follows.

TF = 12/100 = 0.12

The inverse document frequency, often known as IDF, is a logarithmic ratio of the total number of sentences in the data to the number of sentences containing that word. TFIDF = TF x IDF is calculated by multiplying these two variables together.

For example: Consider the fact that the corpus we have contains 10,000,000 documents. If we have 0.3% of documents that include cat words, then

IDF = log(10,000,000/300,000) = 1.52 will be determined.

Then [TFIDF](cat) = 0.12 * 1.52 = 0.182

Building a classical ML model using TF-IDF

In this section, we will look at some of the most commonly used machine learning models for sentiment analysis, such as logistic regression, random forest, and some boosting models. Before diving into the practical implementation of these models, we will see some of the basics of these models.

1. Logistic Regression

One of the most often used Machine Learning algorithms, within the area of Supervised Learning, is logistic regression. It is used to predict the categorical dependent variable from a group of independent factors.
A categorical dependent variable’s output is predicted using logistic regression. As a result, the conclusion must be categorical or discrete. It may be Yes or No, 0 or 1, true or False, and so on, but instead of presenting the exact values like 0 and 1, it presents the probability values that fall between 0 and 1.
The Logistic Regression is quite similar to the Linear Regression, with the primary difference being how each method is used. When trying to solve regression difficulties, linear regression is the method of choice, but logistic regression is used when attempting to solve classification issues. In logistic regression, rather of fitting a regression line, we fit a “S” shaped logistic function, which predicts two maximum values. This is in contrast to traditional regression, which fits a straight line (0 or 1).

model=LogisticRegression()
model.fit(train_x,train_y)
pred=model.predict(test_x)
print("Evaluate confusion matrix for LR")
print(confusion_matrix(test_y,pred))
print(f"Accuracy Score for LR with C=1.0  ={accuracy_score(test_y,pred)}")

2. Naïve Bayes

The Nave Bayes method is a supervised learning technique that uses the Bayes theorem to solve classification issues.
It is mostly utilised in text classification with a large training dataset. The Nave Bayes Classifier is a simple and efficient Classification method that helps in the development of rapid machine learning models capable of making quick predictions. It is a probabilistic classifier, which means it predicts based on an object’s likelihood. Spam filtering, sentiment analysis, and article classification are some prominent applications of the Nave Bayes Algorithm.

What exactly is Bayes’ Theorem?

Bayes’ theorem, often known as Bayes’ rule or Bayes’ law, is a mathematical formula used to calculate the probability of a hypothesis given past information. It is determined by the conditional probability. The formula for Bayes’ theorem is as follows:

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.

model=MultinomialNB()
model.fit(train_x,train_y)
pred=model.predict(test_x)
print("Evaluate confusion matrix for NB")
print(confusion_matrix(test_y,pred))
print(f"Accuracy Score for NB ={accuracy_score(test_y,pred)}")

We’ve seen how baseline models function with our data; now, we’ll use common ideas like cross-validation and K-folds to boost the strength of our statistical models.

K-Fold Cross-Validation

The input dataset is split up into K groups of samples that are all the same size when using the K-fold cross-validation methodology. Folds are the name given to these samples. The prediction function makes use of k — 1 folds for each learning set, while the remaining folds are reserved for usage with the test set. Because it is straightforward to understand and produces results that are less skewed than those produced by other approaches, this strategy is one of the most widely used approaches.

The steps for k-fold cross-validation are

Split the input dataset into K groups

2. For each group

Consider using one of the groups as a test data set or reserve.
Make the remaining groups part of your training dataset.
After the model has been calibrated using the training set, the performance of the model may then be evaluated using the test set.

3. Decision Tree Classification Algorithm

Decision Tree is a Supervised learning approach that may be used for both classification and regression issues, however it is most often employed for classification. It is a tree-structured classifier in which internal nodes contain dataset attributes, branches represent decision rules, and each leaf node represents the result.

Gini Index

In the CART (Classification and Regression Tree) technique, the Gini index is a measure of impurity or purity that is used throughout the process of generating a decision tree.
When compared to an attribute that has a high Gini index, one that has a low Gini index is the one that should be chosen.
The Gini index is used by the CART algorithm to make binary splits, which is the only kind of split that this method can produce.
The Gini index may be determined by following the formula that is shown below

4. Random Forest

Random Forests is a meta estimator that employs averaging to increase the predicted accuracy and control over-fitting. It does this by fitting a number of decision tree classifiers on different subsamples of the dataset. If the bootstrap=True flag is set, the max samples parameter is utilised to determine the size of each sub-sample; if it is false, the whole dataset is used to the construction of each tree.
When constructing a tree, the optimum way to divide each node is determined by selecting either all of the input features or a random selection of features with a size equal to or greater than max features. (For more information, please refer to the directions for modifying the parameters.) These two different sources of randomness have been included in the forest estimator in an effort to bring down its variance. In point of fact, individual decision trees usually display a high degree of variation and have a tendency to overfit.
The randomness that is inserted into forests produces decision trees that have prediction mistakes that are slightly disconnected from one another. If you take the average of such projections, you can eliminate part of the potential for inaccuracy.
Combining trees from a wide variety of species allows random forests to reach a lower variance, however this strategy may sometimes result in a tiny increase in bias. In actual reality, the variance reduction is often rather large, which ultimately results in a superior model.

5. Gradient Boosting

Gradient Boosting is a key component of ensemble modelling in sklearn. The purpose of ensemble techniques is to integrate the predictions of numerous base estimators created using a specific learning process in order to increase generalizability / robustness over a single estimator.

In boosting approaches, base estimators are produced successively, and the bias of the total estimator is attempted to be reduced. The goal is to merge numerous weak models into a strong ensemble. AdaBoost and Gradient Tree Boosting are two examples.

Model Testing Using Cross Validation

We will be training and verifying all of our models in a single code by using a for loop in Python. Within this for loop, we will visit all of the machine learning models stated above, as well as their cross validation scores, in order to evaluate their performance.

#KFold and cross validation on tfidf baseline
models=[]
models.append(('LogisticRregression',LogisticRegression(C=1.0,penalty='l2')))
models.append(('KNearestNeighbors',KNeighborsClassifier()))
models.append(('DecisionTree',DecisionTreeClassifier(criterion='entropy')))
models.append(('GradientBoostClassifier',GradientBoostingClassifier(learning_rate=1e-2, loss='deviance',n_estimators=100)))
models.append(('AdaBoostClassifier',AdaBoostClassifier(learning_rate=1e-2,algorithm='SAMME.R',n_estimators=100)))
models.append(('ExtraTreesClassifier',ExtraTreesClassifier(n_estimators=10, max_depth=None,min_samples_split=2)))
models.append(('BagClassifier',BaggingClassifier(KNeighborsClassifier(),max_samples=0.5, max_features=0.5)))
model_result=[]
scoring='accuracy'
print("Statistical Model TFIDF- Baseline Evaluation")
for name,model in models:
    kfold=KFold(n_splits=10)
    results=cross_val_score(model,train_x,train_y,cv=kfold)
    print("=======================")
    print("Classifiers: ",name, "Has a training score of", round(results.mean(), 2) * 100, "% accuracy score")
    model_result.append(results.mean())

Conclusion

We observed the impact of using a statistical classifier on non-semantic TFIDF vectorized data and also performed a parallel investigation of the accuracy of the various techniques. The implication of utilising these statistical models is that they give a starting benchmark that must be improved further by experimenting with alternative models. This gives a fast introduction of how a conventional classifier may be used for non-semantic classification, and in the next, we will use semantic embeddings (vectors) with these traditional classifiers.

Then we’ll move on to deep learning (neural network) models like LSTM, encoder-decoder, and so on.