Data Preprocessing and EDA for Natural Language Processing.

Bhupender saini
Geek Culture
Published in
7 min readMay 2, 2021

It is important to know enough insights about your data before implementing any machine learning model or performing statistical hypothesis testing. Exploratory data analysis provides that insight into the data to you.

In this article, we will be discussing the respective python packages available and different exploratory tasks in the natural language processing domain one should perform. For the article, I am using Kaggle IMDB sentiment analysis data and using Google colab platform. So, let's start…

Installing and Importing Packages

We gonna use these packages for our analysis. References of packages are mentioned in the end.

!pip install texthero
!pip install -q kaggle # to download the dataset
!pip install wordcloud
!pip install -U textblob
!python -m textblob.download_corpora
# importing
import pandas as pd
import matplotlib as plt
import seaborn as sns
import texthero as hero
from texthero import stopwords
import os
from wordcloud import WordCloud
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from textblob import TextBlob,Word

Reading Data

Downloading and reading dataset from Kaggle. Right now, only considering 5000 rows for the exploration.

!kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
!unzip 'imdb-dataset-of-50k-movie-reviews.zip' -d kaggle_IMDB
#reading data
file_name='kaggle_IMDB/IMDB Dataset.csv'
#reading dataframe
df= pd.read_csv(file_name, header='infer',nrows=5000)
df.head()
Kaggle IMDB dataset

Data Preprocessing

There are various tasks that need to perform in data processing like lowercasing, removing stop words, removing digits, removing URLs, HTML tags and many more.

For basic cleaning and lemmatization, I am using the texthero and nltk package which offers features related to text preprocessing as well as data exploration. Code is as follows:

def lemma_per_pos(sent):
'''function to lemmatize according to part of speech tag'''
t = TextBlob(sent)
t_dict = {"J": 'a',"N": 'n',"V": 'v',"R": 'r'}
w_n_t = [(w, t_dict.get(p[0], 'n')) for w, p in t.tags]
lemmatized_list = [w.lemmatize(t) for w, t in w_n_t]
return " ".join(lemmatized_list)
def df_preprocessing(df,col_name):
default_stopwords = stopwords.DEFAULT
# adding some stop words as for movie review,so removing it
custom_stopwords = default_stopwords.union(set(["movie","film"]))
df[col_name]=[text.replace('<br','') for text in df[col_name]] # cleaning
df[col_name]= (
df[col_name]
.pipe(hero.clean)
.pipe(hero.remove_html_tags)
.pipe(hero.remove_brackets)
.pipe(hero.remove_urls))
# lemmatization
df[col_name]= [lemma_per_pos(sent) for sent in df[col_name]]
df[col_name]=hero.remove_stopwords(df[col_name],custom_stopwords)
return df

As we see in preprocessing, I am removing some most common words like movie, film. The reason behind this can be seen in the below graph, there are the most common words in text data that play no role in sentiment analysis. That’s why analyzing data is an important prerequisite in any text processing and later, you can alter the preprocessing steps as per your requirements.

Most common words (generated without removing common words)

Also, there are words like make, made, films, film, characters, character words are treated as a different word. Hence, lemmatization is important to resolve this.

Lemmatisation (lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word’s lemma, or dictionary form. (source: Wikipedia)

There are many different lemmatization packages available. It’s always better to compare those techniques to verify what you want can be achieved or not. For example, as shown in the below image, one is lemmatization according to part of speech tagging(POS) and the other is without POS. You can clearly see the difference. It’s a good practice to check your lemma outputs and use them as per your need.

And, after removing the most common words and lemmatization, the distribution looks like this.

Most common words after stop words, most common words removal and lemmatization.

Sentence level analysis

Text statistics include sentence length distribution, minimum, maximum, and average length.

To check the sentence length distribution. Code and output are as follows:

df['len']= df['review'].str.len()print('Max length: {}, Min length: {}, Average Length :  {}'.format(max(df['len']),min(df['len']),df['len'].mean()))df['len'].hist()
Length distribution

According to distribution, most of the article lengths are in the range of 0–1000 and maximum length, minimum length, and average length are 6136, 45, 757 respectively. The point to note here is, sentence length includes spaces between words. If you want distribution without space, you can use the below code.

df['len']= df['review']str.split().map(lambda x: len(x))
df['len'].hist()

Word level analysis

For word-level analysis, we need a complete text sample together as one text document and then splitting into words.

text= ' '.join(t for t in df['review'])
words_list= text.split()

Creating a dictionary with the word as a key and their count as value and then creating a dataframe of words.

word_freq= {}
for word in set(words_list):
word_freq[word]= words_list.count(word)
#Creating dataframe of words
df_word= pd.DataFrame(word_freq.items(),columns=['word','count'])

After creating a dataframe for word, now we can add the length of each word.

df_word['word_len']= df_word['word'].map(lambda x: len(x))# sorting values 
df_word=df_word.sort_values('count',ascending=False).reset_index(drop=True)
df_word
word dataframe

And, the 50 most common words are :

df_top= df_word.head(50)sns.barplot(df_top['count'],df_top['word'])
Most common words

Now, the length distribution.

df_word['word_len'].hist()
Word length distribution

Wait!, something is weird here. There are words with a length of more than 15. We have to verify, are these actually English words or not?

Let’s check…

df_word[df_word['word_len']==max(df_word['word_len'])]
Non-English word

Sometimes, the user just types random characters in the review comment, we need to find and remove them. We can do that by checking the word is actually an English word. I have added that verification in preprocessing code.

nltk.download('words')
from nltk.corpus import words
setofwords = set(words.words())
lemmatized_list = [w.lemmatize(t) for w, t in w_n_t if w in setofwords]

After resolving the issue,

Word length distribution

Now, the longest words are shown below, which seems correct English words.

That's the beauty of analysis, we find issues before facing them in later stages.

Topic Modelling

Which words represent the text corpus, lets plot those using the wordcloud package in python.

wordcloud = WordCloud(background_color=’white’,max_words=100,
max_font_size=40,
scale=3,
random_state=1)
.generate(text)
plt.axis(“off”)
plt.imshow(wordcloud)

Finally, you can see words in the text sample which represent the movie reviews and these are the most common words used for reviewing a movie.

The sentiment of reviews

Let's check how sentiment is distributed in our text data. We will use textblob again for this task.

TextBlob’s sentiment outputs a tuple of polarity and subjectivity. Polarity range [-1.0,1.0] where -1.0 is negative polarity, 1.0 is positive, and 0.0 is neutral. Subjectivity range [0.0,1.0] where 0.0 means highly objective, and 1.0 means very subjective.

df[‘review’][8]

i.e.

“positive look forward watch bad mistake see truly one bad awful almost every way act song lame country tune less four time cheap nasty boring extreme rarely happy see end thing give score far best performance least make bite effort one”

TextBlob(df[‘review’][8]).sentiment

As per TextBlob, the given sentence has negative sentiments and highly subjective which makes sense if we read the review comments.

Now, complete sentiment distribution in the text corpus.

def polarity(x):
if TextBlob(x).sentiment[0]<-0.25:
return 'Negative'
if TextBlob(x).sentiment[0]>0.25:
return 'Positive'
return 'Neutral'
df['sentiment']= df['review'].map(lambda x: polarity(x))
df['sentiment'].hist()
sentiment distributiion
Sentiment distribution without mapping to classes

As per sentiment distribution, most of the reviews comment are neutral and a very less amount of reviews has negative sentiments

Further exploration

Using tf/idf and principal component analysis we can visualize our data in a scatter plot and check how it looks in 2D scatter plot.

df[‘pca’] = (df[‘review’].pipe(hero.tfidf).pipe(hero.pca))hero.scatterplot(df, ‘pca’, color=’sentiment’, title=”sentiment”)
Scatter plot

Now, we try to find the hidden cluster in the text corpus using k-means clustering.

df[‘kmeans_labels’] = (df[‘review’].pipe(hero.tfidf)
.pipe(hero.kmeans, n_clusters=2)
.astype(str))
hero.scatterplot(df, ‘pca’, color=’kmeans_labels’, title=”K-means”)
K-means plot

Final Comment

In this article, we discussed different data preprocessing and exploratory data analysis packages and their implementation. Packages like texthero, textblob, and nltk can simplify complex tasks, save time, and effort.

For sentiment analysis, I found the above-mentioned steps enough. However, there are further analysing tasks like n-grams distribution, Part-of-speech, readability test, and entity recognition which can be achieved by mentioned python packages(for readability test you can use Textstat). Please go through the references for further information.

Hopefully!, this article came out useful and you learned something new.

Thank you and Happy Analyzing!

References:

  1. https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
  2. https://github.com/jbesomi/texthero
  3. https://textblob.readthedocs.io/en/dev/index.html
  4. https://www.nltk.org/
  5. https://amueller.github.io/word_cloud/
  6. https://pypi.org/project/textstat/

--

--