Text Classification for Beginners in NLP with codes

Published in

Data Science in your pocket

11 min readJun 8, 2020

I am done with a lot of theoretical posts on various algorithms used in NLP for tokenization, parsing, POS Tagging, etc.

Time for some….

Before starting, I guess a basic understanding of neural networks is required

What I will be covering?

Text Normalization
Word Normalization (Stemmer vs Lemmatizer)
Custom word embeddings using gensim (optional)
Loss function to choose
A sample LSTM using TensorFlow v2

I will try to walk through all the possible steps (including preprocessing). Hope that you are aware of the basics of pandas & python

Deciding the dataset

Text Classification sounds cool!! but which dataset?

I will be picking up Movie sentiment analysis dataset from kaggle:

Movie Review Sentiment Analysis (Kernels Only)

Classify the sentiment of sentences from the Rotten Tomatoes dataset

www.kaggle.com

Some points to note:

The dataset has 2 major TSV files: train.tsv.zip,test.tsv.zip

The 3rd file can be ignored

Note: As a CSV stands for comma-separated values, TSV stands for tab(\t) separated values.

if you have observed, both the files are zipped. Hence, we 1st need to unzip them & then read them using read_csv()

import zipfile
import pandas as pdtrain=zipfile.ZipFile('train.tsv.zip')test=zipfile.ZipFile('test.tsv.zip')x=pd.read_csv(train.open('train.tsv'),delimiter='\t')y=pd.read_csv(test.open('test.tsv'),delimiter='\t')

zipfile.ZipFile() helps us to extract the content of the zipped files train.tsv & test.tsv.
The delimiter=’\t’ is mentioned in pd.read_csv helps us read a TSV file. If it had been a CSV file, we don't need that argument

As you have observed, x & y represent train & test data (sorry for the poor nomenclature)

Before going ahead, we must have a look in our training dataset:

PhraseId is a unique key. SentenceId refers to the Sentences from which the phrase is generated. Phrase is the text for which the sentiment has to be predicted.

Sentiment ranges from 1–5 i.e 5 classes
Many phrases are generated from the same sentence after segmenting the sentence in different ways. As can be observed in the above picture, many phrases have the same SentenceId i.e 1, hence taken from the same sentence. They are also overlapping at times.
According to the description on kaggle, The five classes correspond to negative, somewhat negative, neutral, somewhat positive, positive.

I will be skipping EDA for now !!

The first & foremost step is to decide over the preprocessing steps. Some of the most commonly used preprocessing steps used are:

Normalization: Normalization refers to bringing everything to a Canonical(standard) form. For eg:

‘I am a hero’ to ‘ i am a hero’ or ‘I AM A HERO’.
It is generally required at two levels:

Text Normalization: Bringing the entire text as a whole to a canonical form as in the above example. This can include removing punctuations, numbers, etc.
Word Normalization: Converting each word used in a canonical form. For eg: ‘likes’ to ‘like’, ‘dancing’ to ‘dance’, etc.

But why?

Consider a case where we get the sentences:

‘The boy is dancing so well !’ & ‘She danced beautifully in the concert’.

Now, if these sentences are fed to an ML model,

It will take ‘dancing’ & ‘dances’ as 2 different words; ‘The’ & ‘the’ also as different words. But if normalized, these words be taken as one. Hence, reducing complexity for the model. Lesser the complexity, the more chances your model learns better.

I will be converting each phrase to lower case first:

x['Phrase']=x['Phrase'].transform(lambda value:value.lower())y['Phrase']=y['Phrase'].transform(lambda value:value.lower())

If you wish to remove punctuations, use the below command which will cover both lowering the text & removing punctuations:

import rex['Phrase']=x['Phrase'].transform(lambda value:re.sub(r'[^\w\s]','',value.lower()))y['Phrase']=y['Phrase'].transform(lambda value:re.sub(r'[^\w\s]','',value.lower()))

Here,

re.sub(‘expression_to_replace’,’new_replacement’,’sentence’) removes all digits & punctuations (except underscore:_) using the expression r’[^\w\s]’

Leading to:

What's next?

Word Normalization!!

This can be done using Stemming or Lemmatization.

Stemming: It uses a rule-based system to bring a word to its canonical form. Like removing ‘ing’ from ‘dancing’ to form ‘danc’ or ‘ticked’ to ‘tick’. As you can see, stemming might not produce a dictionary word all the time after normalization.
Lemmatizer: It is a more intelligent system that keeps a dictionary on its side while normalizing words. Hence it will normalize ‘dancing’ to ‘dance’ & not ‘danc’ as done in stemming.

So should I always choose Lemmatization over Stemming?

Not really!!

Lemmatization is slow. If you have a huge dataset, it won’t be the best choice.
Lemmatization also needs some more information apart from the word like POS Tag of the word for best results. So, for best results, we need to calculate POS Tags & then feed them to the Lemmatizer.

For now, I will be using Stemming using PorterStemmer from nltk.

from nltk.stem import PorterStemmerps=PorterStemmer()x['Phrase']=x['Phrase'].transform(lambda value:' '.join( [ps.stem(word) for word in value.split(' ')]))y['Phrase']=y['Phrase'].transform(lambda value:' '.join( [ps.stem(word) for word in value.split(' ')]))

Explanation: the lambda function used splits the string using space:’ ’, applies porter stemmer on the word using list comprehension & final joins the entire sentence using ‘ ’.join()

This much of preprocessing can be enough for now.

Moving ahead,

Have you heard of word embedding?

Note: This step is skippable.

As we know text can’t be directly fed to a neural network, we need to convert these into a numerical representation. Word embedding is this numerical representation of words in an array of continuous values.

How are they calculated?

Basically, these numerical representations can be calculated using a number of ways like skip-gram, common bag of words, glove, etc.

But can’t we use One Hot Encoding?

These methodologies can also be used but they won’t be yielding us good results for a number of reasons

For a vocabulary(unique words) of say 2000 unique words, each word will be represented by an array of 2000 using OHE which is a waste of memory & very sparse data might not be yielding good results. Also, training such data will take ages comparatively.
OHE doesn’t show any relation amongst words that are similar. For example: Words like ‘King’ & ‘Queen’ should have numerical representations quite close to each other as they are similar in their sense. Word embeddings have this advantage over OHE & hence yields better results.

The word embeddings produced using Word2Vec are such that the numerical vectors have cosine distance very low for similar words (like king & queen) but high cosine distance between unrelated words (like king & classroom)

I will be using gensim’s Word2Vec in python to get these embeddings

from gensim.models import Word2Vecall_sentences=list([sentence.split(' ') for sentence in x['Phrase']])+list([sentence.split(' ') for sentencein y['Phrase']])all_sentences=[x for x in all_sentences if str(x)!='nan']  w2v=Word2Vec(all_sentences,size=128,min_count=1,iter=20)vector=w2v.wv.vectors

Going line by line

Importing Word2Vec from gensim
all_sentences variable stores each sentence from both train & test data tokenized using ‘ ’. Test data is also considered so while testing, there should be no alien word for the model.
As we dropped punctuations & numbers, some of the sentences went NaN(they comprised of only punctuations). Hence, dropping them.
Using Word2Vec().

Word2Vec: It is a set of related ML models used for generating word embeddings. I will be focussing more on the parameters for now

Input:all_sentences variable
Size: It refers to the length of each embedding. If we have 1000 unique words in all_sentences & size=128, the output word_embedding will be (1000,128)
min_count: Deciding the threshold frequency for which embeddings will be generated. If the min_count=5, embeddings will be generated only for words with words having a frequency of at least 5 in the corpus passed. It has been set to 1 as I wished to include every word of the corpus
iter: similar to epochs in neural networks

For a better understanding, this post can be referred to.

To get the generated embeddings, store w2v.wv.vectors in some variable say vector

The vector variable looks something like the below image with the shape=13906 x 128 where 13906 is the total count of unique tokens & 128 is the dimension of embedding of each token:

Now, we have word embeddings as well which will be directly used in our neural network as weights for the Embedding layer in Keras.

But I guess we discussed this before that we can’t feed text data to neural networks directly and if the embedding calculated above will be used in Neural Network as weights, how to convert text data i.e the input to numerical form. So we will be using Keras’s Tokenizer to fulfill this requirement.

maxlength=256from keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence,textt=Tokenizer(split=" ")t.fit_on_texts(list(x['Phrase'])+list(y['Phrase']))x['Phrase']=t.texts_to_sequences(x['Phrase'])y['Phrase']=t.texts_to_sequences(y['Phrase'])train=sequence.pad_sequences(x['Phrase'],maxlen=maxlength)test=sequence.pad_sequences(y['Phrase'],maxlen=maxlength)

Importing Tokenizer & creating an object. split=’ ’ basically means tokens/words in the input are space-separated.
fit_to_texts() creates a dictionary with each word corresponding a unique number/index. Like for input ‘look for him’ & ‘he is smart’, it can be dict[‘look’]=1, dict[‘for’]=2, dict[‘him’]=3, dict[‘he’]=4,dict[‘is’]=5,dict[‘smart’]=6
texts_to_sequences() substitute each word corresponding to the index in the dictionary is created in the previous step & create a sequence of numbers. If the input is like ‘he look smart’, text_to_sequences() will produce =[4,1,6] (refer from the previous point).
All sentences don’t have the same length of words. But the input (sentences) to a neural network should have the same dimension. Hence, padding (adding 0s) is done for keeping the dimension of every sentence equal; pad_sequences() assists in this; the maxlen parameter deals with the maximum length of each sentence. Like if a sequence is 64 tokens long & maxlength=256, 192 0s will get appended in the sequence. if a sentence with more than 256 tokens found, it will be truncated.
‘maxlength’ variable can be any number that can be decided using hit & trial. 256 has been completely my choice.

Output after text_to_sequences():

& after padding 0s:

As you can see, 0s gets appended as a prefix. If you wish to add 0s to end, add parameter padding=’post’. Similarly, if ‘maxlen’ is set to a value lower than the maximum length of a sentence in the corpus, truncating is done from the beginning of the sequence for that sentence by default which can be changed by truncating=’post’.

Long way to go!!

Assuming you know what cross_entropy loss is, let’s move ahead. If not aware of it, go through this link

If you remember, our target (Sentiment) is something like 1,2,3,4 & 5. Now, there exist two ways to make them usable:

Either use OHE on label to create labels like [1,0,0,0,0],[0,1,0,0,0],[0,0,1,0,0],[0,0,0,1,0] & [0,0,0,0,1] & use ‘categorical_crossentropy’ as loss function
Don’t do anything and use ‘sparse_categorical_crossentropy’ as loss function

What is the difference between the 2 loss functions?
Which one to use?

Both the loss functions can go with multi-class classification without any difference in performance
Though, when we deal with multi-label problems (problems such as the genre of movie which can be multiple), we can’t use sparse_categorical_crossentropy
Sparse_categorical_crossentropy preserves memory.

As we are dealing with multi-class classification, I will be using sparse_categorical_crossentropy & hence not encoding labels. If you wish to encode, you can use the commented line. The below code is used for creating a train & validation dataset.

from sklearn.model_selection import train_test_split as ttslabel=x['Sentiment']#from tensorflow.keras.utils import to_categorical
#label=to_categorical(x['Sentiment'],num_classes=5)xtrain,xvalid,ztrain,zvalid=tts(train,label,train_size=0.8)

Importing important libraries we might use in building our model

Note: Tensorflow 2.0 has been used in the codes

import tensorflow as tffrom tensorflow.keras.layers import LSTM, Embedding, Dropout, Dense, Flattenfrom tensorflow.keras.models import Sequential

Now, below is a sample neural network that can give you decent results

model = Sequential()model.add(Embedding(vector.shape[0], vector.shape[1],weights=[vector],input_length=maxlength))model.add(LSTM(64,dropout=0.2,recurrent_dropout=0.2,return_sequences=True))
model.add(Flatten())model.add(Dense(256, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(5, activation='softmax'))model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])model.summary()

Highlighting major layers only :

Embedding() layer is the 1st layer that takes in input, the padded sequences we generated using padded_sequences(). Understanding the different parameters.

vocab: It is the number of unique tokens in the corpus. It is equal to the number of rows in vector(word embedding) calculated above
2nd parameter is the output_length per token i.e 128. This is equal to the number of columns in the vector calculated above.
input_length: length of each sequence. if you remember, we set this to 128 during the padded_sequences() function call.

Note: It must be noted that calculating word embeddings using Word2Vec isn’t a necessary step as the Embedding layer itself calculates such embeddings. Hence, just to improve on results, it has been calculated separately.

If it wasn’t calculated earlier, vocab=number of unique tokens in corpus & output_length can be any other number & depends on the user how long embeddings he/she wishes to have per token.

I hope you are aware of the basic knowledge of LSTM & GRU. If not, do explore this.

2. Understanding the LSTM() used,

64 is the number of units
Read the below excerpt for understanding the use of return_sequences.

Let’s say we have an input with shape (batch_size, sequence_length, units). If we don’t set return_sequences=True, our output will have the shape (batch_size, units), but if we do, we will obtain the output with shape (batch_size, sequence_length, units).
Dropout & recurrent_dropout: Try to imagine the structure of an LSTM cell. Regular dropout is applied to the inputs and/or the outputs. Recurrent dropout drops the connections between the recurrent units i.e between LSTM units used. They help in avoiding overfitting
Flatten() has been used as the output of the LSTM layer is 3 dimensional i.e (batch_size, sequence_length, units). Hence, to convert it in a 2-dimensional structure(Batch_size, units) & compatible for future layers, it has been used. There exists a GlobalAveragePooling2D/GlobalMaxPooling2D layer which can also be used.
Loss functions used is sparse_categorical_crossentropy, optimizer=adam & metrics=accuracy. Many other metrics are also available which can be explored here.

Model summary can be seen below

Explaining the output dimensions:

Note: None always represent batch size

embedding layer: (None,256,128):256=length of sequence, 128 corresponds to the length of embedding corresponding each token
lstm:(None,256,64): As explained if return_sequences=True, output for each token is returned. Hence 256 x64. If return_sequences=False, output dimnesion=(None,64)
Flatten:(None,16384): 256 x 64=16384
The rest are self-explanatory.

Beginning with the training:

checkpointer = tf.keras.callbacks.ModelCheckpoint(filepath='weights.best.hdf5', 
                               verbose = 1, 
                               save_best_only = True)model.fit(xtrain,ztrain,validation_data=(xvalid,zvalid),epochs=5,batch_size=128,verbose=1,callbacks=[checkpointer])

ModelCheckpoint helps us to save the best weights, giving the best results on the validation set, of the network on a desired location.

And if you something like this: