Predicting the type of Event based on comments using Natural Language Processing.

Aishwariya Gupta
Voice Tech Podcast
Published in
4 min readJun 19, 2019

Hey data enthusiasts,

very recently I came across a data set (which is confidential) for which I built a machine learning model.

In this blog I am going to share the approach I took for predicting the results. The data set cannot be shared but I can assure you that this approach can be used in handling a NLP task using Python 3.

The analysis can be found on my GitHub https://github.com/aishwariya96/Predicitng-Event-based-on-comments-using-NLP

1. Reading and Getting to know about the data set.

The data set can be read using the Pandas library by the following command:

import pandas as pd
train = pd.read_csv(‘path_of_file’)
print(train.columns) #to know the column names
print(train.shape) #to know how many rows and columns are present
print(train.info()) #to know the basic stats of your data set.

2. Visualize your data set before going ahead.

Your data set can be visualized using Seaborn or Matplotlib libraries.

It is important to know basic stuff about your column values before going ahead.
Commands like these can be used:

train[‘column_name’].value_counts() #checking number of unique values

The counts can be simply displayed in a bar graph by using seaborn library

plt.figure(figsize=(12,4))
sns.countplot(x = ‘column_name’, data=train)

Bar plot

3. Check the missing values.

It is very necessary to see if there are any missing values present since that can affect the predictions and the model’s accuracy.
Commands like these can be used:

train[‘column_name’].isnull().sum()

4. Data Preprocessing using Natural Language Tool Kit (NLTK)
Since this is a NLP task, I used the NLP libraries which can be found on my Jupyter Notebook.

The toolkit provides stopwords (words which are completely unnecessary ) which can be used to remove them from the text. The stop words can also be updated according to your requirement.

It is a good practice to make functions for these kind of tasks so that it can be used again and again when required (for this purpose I have again simply called the function I made for my test data). This will make your code look beautiful.

Function for getting to know the most frequent words

def freq_words(x, terms=30):
all_words= ''.join([text for text in x])
all_words=all_words.split()

fdist= FreqDist(all_words)
words_df = pd.DataFrame({'word':list(fdist.keys()), 'count':list(fdist.values())})

Good practice is to visualize it too along with making functions

Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com

d = words_df.nlargest(columns="count", n=terms)
plt.figure(figsize=(20,5))
ax = sns.barplot(data=d , x="word", y="count")
ax.set_label(s="Count")
plt.show()

Function to remove Stop Words

def remove_stopwords(com):
comm_new = " ".join([i for i in com if i not in stop_words])
return comm_new

train['comment'] = train['comment'].apply(lambda x: ' '.join([w for w in x.split() if len(w) > 2]))

comments = [remove_stopwords(r.split()) for r in train['comment']]

comments = [r.lower() for r in comments]

I have also deleted all the words whose length is 2 or less than 2 because they will give me no use for the analysis or model building.
The functions can be called very easily when required.

For further cleaning, python offers us a very useful library of regex expressions which can be imported by the command import re. Please check my Jupyter notebook for all the expressions I used.

5. Building the Model

Machine learning models from the Scikit-Learn libraries can be used. I went ahead with Logistic Regression Model since it gave me 89.34% accuracy when I tried on my training data after splitting. You can also use Naive Bayes and Support Vector Machines for this purpose.

These are the essential Sklearn libraries which need to be imported for this model.

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression

Always remember to use only those columns which are necessary to get trained in the model. In this problem I used only one column called comment for training purposes.

# Specifying the traget variable and the feature which will be trained
train_data = train.comment
test_data = test.comment
target = train.EVENT_TYPE

The target is the predictor variable (the only you want to predict on your test data)

Machine Learning pipelines are very useful and hence I went ahead with building a pipeline.

#Using Model Logistic Regression
#Defined a Pipleline for the same
logreg = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', LogisticRegression(n_jobs=1, C=1e5)),
])
logreg.fit(train_data, target)

5. Insights and Word Cloud

I love visualizing my data. It gives me a perspective to know it more. NLP toolkit provides the methods like Tokenization and Stemming.

Word Cloud is basically a representation to display the most common words present in our text in a cloud format. It is way too cool. We can also give it various colors or change the figure size according to our requirement.

Here is the word cloud I produced:

Word Cloud (background=black)

The inference of this word cloud was that it contained most of the financial terms.

Do check out my Github repository (link on the top) for full code and reach out to me for any doubts by either commenting here or on Twitter (https://twitter.com/_aishwariya_)

Have a great data day ahead!

--

--