An Introduction to NLP : Implementing the aforementioned algorithms

Smit Shah
Code Dementia
Published in
3 min readJul 17, 2019

Hey everyone!! Welcome to the second part of the tutorial series. Here, we’ll start with basic implementation of the algorithms in the previous article. If you have missed the concepts, take a glance back. Here, we’ll be using nltk for the implementation of the algorithm and we’ll use sklearns algorithms to test the code. TL;DR - If you just want the code, here you go.

Also, a side note, I just covered some of the most important topics in the previous article. For future references, POS is Part of Speech, i.e. nouns, verbs, adjectives, etc.

We’ll start by importing a few libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk
import re
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
Raw Text Data

The above is the raw text data of IMDB reviews for movies. We’ll be using the dataset provided with 1000 Reviews. You can get the dataset from my github repo or just google it. We’ll need to preprocess the data and vectorize it. Here, we’ll be using TF-IDF Vectorizer from sklearn. You can also compare the performance of Count Vectorizer with this.

def remove_stopwords(text):
words = word_tokenize(text)
output_text = []
for w in words:
if w not in stop_words:
output_text.append(w)
output_text = ‘ ‘.join(output_text).replace(‘ , ‘,’,’).replace(‘ .’,’.’).replace(‘ !’,’!’)
output_text = output_text.replace(‘ ?’,’?’).replace(‘ : ‘,’: ‘).replace(‘ \’’, ‘\’’)
return output_text
def preprocessing_data(raw_data, lemmatizer):
preprocessed_data = []
for data in raw_data:
data = data.lower()
data = remove_stopwords(data)
data = lemmatizer.lemmatize(data)
data = ‘ ‘.join(re.sub(“(@[_A-Za-z0–9]+)|([⁰-9A-Za-z \t])|(\w+:\/\/\S+)”,” “,data).split())
preprocessed_data.append(data)
return preprocessed_data
def pos(tokenized_tweet):
return nltk.pos_tag(tokenized_tweet)
def pos_data(pre_data):
posData = []
for d in pre_data:
pdata = pos(d.split())[:]
for i in range(len(pdata)):
pdata[i] = pdata[i][1]
posData.append(‘ ‘.join(pdata))
return posData
text_vectorizer = TfidfVectorizer(
tokenizer = None,
preprocessor = None,
decode_error = 'replace',
stop_words = None,
ngram_range = (1,3),
max_features=10000,
min_df=5,
max_df=0.75,
norm = None
)
pos_vectorizer = TfidfVectorizer(
tokenizer=None,
lowercase=False,
preprocessor=None,
ngram_range=(1, 3),
stop_words=None,
use_idf=False,
smooth_idf=False,
norm=None,
decode_error='replace',
max_features=5000,
min_df=5,
max_df=0.75,
)

The preprocessed data will look something like this

Preprocessed Data

We’ll be using both logistic regression and decision tree to check the score.

pre_data = text_vectorizer.fit_transform(pd.Series(pre_data)).toarray()
pos_data = pos_vectorizer.fit_transform(pd.Series(pos_pre_data)).toarray()
train_data = np.concatenate((pre_data, pos_data), axis = 1)
train_x, test_x, train_y, test_y = train_test_split(train_data, labels)
log_reg = LogisticRegression()
log_reg.fit(train_x, train_y)
print(log_reg.score(test_x, test_y))dec_tree = DecisionTreeClassifier()
dec_tree.fit(train_x, train_y)
print(dec_tree.score(test_x, test_y))

We got a score of 0.75 in logistic regression and 0.67 in a decision tree. This is because the logistic regression assigns weights to the vectors, whereas the decision tree makes comparisons to get to the prediction.

So, here you have it guys, the first simple code on sentiment analysis. In the next article, I’ll be introducing Recurrent Neural Networks and different types of RNNs like LSTMs, GRUs, etc. Until next time. Cheers!

--

--