Extract the right Phrase From Sentence

Jitendra Dash
Analytics Vidhya
Published in
17 min readJan 25, 2021

A Natural Language processing Project using Deep Learning

This is a quite unique project for me . As , here i need to extract phrase from a sentence base on some sentiment .Previously i have worked on those project where i have to just predict the sentiment given some data but in this project there is an addition to the whole task . Lets understand the whole project ….

1 . What is business Problem i am solving :

This is first think we should ask to ourselves. Because we can achieve our goal.

With all of the tweets circulating every second it is hard to tell whether the sentiment behind a specific tweet will impact a company, or a person’s, brand for being viral (positive), or devastate profit because it strikes a negative tone. Capturing sentiment in language is important in these times where decisions and reactions are created and updated in seconds. But, which words actually lead to the sentiment description

This case study is about capturing the sentiment or meaning behind a tweet .

1.1 : Now lets talk about the machine learning problem that i am about solving :

we need to pick out the part of the tweet (word or phrase) that reflects the sentiment.

example :

here is a tweet : “ my boss is bullying me…”

sentiment of the tweet is : “negative”

Words defining that it is a negative tweet : “bullying me” (this is something we want to find in this case study)

1.2 : Now the question is which kind of ML problem it is (classification or regression or something else ) ?

the objective is clear/ simple : Given a text and the sentiment we have to predict the selected text (which is the a word or phrase of the text).

Answer such questions like what kind of problem is it (classification regression)? — This is a little bit difficult to answer but i would say it is a classification task but in a different way (because we have to classify the phrase of the text )

1.3 : where can we find the data :

We can easily get the data from kaggle re is the link :

https://www.kaggle.com/c/tweet-sentiment-extraction

1.4 : Before going ahead it is good to think about objective and constraints:

  • No low-latency requirement (but it should be reasonable)
  • Interpretability is important(but it is going to difficult as we will working on deep learning model but we can do it by something called post analysis of our ml models)
  • It should pick right phrase from the text maximum time (as well as possible)
  • Another thing we need to look how many sentences coming per minute

1.5 : For the given task what should be my KPI (key Performance Indicator) :

As this was a competition on kaggle they have chosen Jaccard similarity as their KPI and from my point of view it is the right metric we should care about.

Jaccard Score is a measure of how similar/dissimilar two sets are. The higher the score, the more similar the two strings. The idea is to find the number of common tokens and divide it by the total number of unique tokens. Its expressed in the mathematical terms by,

source : https://www.kaggle.com/parulpandey/eda-and-preprocessing-for-bert
source : https://www.kaggle.com/parulpandey/eda-and-preprocessing-for-bert

1.6 : What are the existing approach to this problem:

People have tried various Transformer model to get the best results like distillbert , Roberta etc

1.7 : My approach to this problem:

What I have done in this project is not to jump directly to large/complex models.

I have gone from the simplest model to the complex model just to understand how Machine Learning / Deep Learning models are behaving and giving me results.

2 . Exploratory Data Analysis

EDA is not just to plot different kind of graph and just to look good . its about what are the things we can get from data or more precisely what are the information we can get from our data (it should be our focus) .

import necessary libraries :

2.1 : Reading the data and Basic statistic

  • in the above what i have done is — i read the train and test data and then displaying train data (4 row)
  • There are 4 filed in the training data text_ID , the actual text , selected text , sentiment of the particular text
  • like that test data just contain 3 field textID , the actual text, sentiment of the given text

Output :

shape of the training data :  (27481, 4)
Number of data points in the train data : 27481
Number of feature in the train data : 4

Output:

shape of the test data :  (3534, 3)
Number of data points in the test data : 3534
Number of feature in the test data : 3

output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27481 entries, 0 to 27480
Data columns (total 4 columns):
textID 27481 non-null object
text 27480 non-null object
selected_text 27480 non-null object
sentiment 27481 non-null object
dtypes: object(4)
memory usage: 858.9+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3534 entries, 0 to 3533
Data columns (total 3 columns):
textID 3534 non-null object
text 3534 non-null object
sentiment 3534 non-null object
dtypes: object(3)
memory usage: 82.9+ KB

Observation :

  • We have 2 type of data train and test data where there are 27481 train and 3534 test data points present and 4 train and 3 test feature present.
  • In the test data one feature is missing which is selected text(sub set of text) , so we have to predict the selected text . (Selected text column is our target column)
  • There is one missing value present in the train data which we will remove in the clean section

2.2 : Distribution of text based on sentiment

Observation :

  • The distribution is not uniform
  • The distribution of the text in the train data based on sentiment are : neutral (40.5%) , positive (28.3%) ,negative(31.2%)
  • The distribution of the text in the test data based on sentiment are : neutral (40.5%) , positive (31.2%) ,negative(28.3%)
  • Another thing train data positive has 28.3% where as test data negative has 28.3 % and vice versa.

2.3 : Check duplicate and null value

(27481, 4)
textID text selected_text sentiment
314 fdb77c3752 NaN NaN neutral
Empty DataFrame
Columns: [textID, text, selected_text, sentiment]
Index: []

Observation :

There was one missing value present which being removed.

3.4 Data Cleaning

  • convert to lower case
  • remove text in square brackets
  • remove links,
  • remove punctuation
  • remove words containing numbers

2.5 : Discovering new feature to analysis

  • word length in text and selected data
  • difference in word length(text and selected data)

2.6 : plotting based on word difference

Distribution of Number of word in text and Number of word in selected text

Observation :

  • we got two right skewed plot.
  • As we can see the length of text is more than selected text that is obvious as selected text is a subset of text
  • similarly some overlap in quite well region

2.7 : Distribution of +ve and -ve and neutral (difference between text and selected text)

Observation :

  • i am unable to interpret the avobe plot .
  • there is a peak at zero as the difference in text and selected text is almost zero in every case in neutral sentiment.
  • lets plot only +ve and -ve word difference

Observation:

  • two plot are right skewed
  • two plots are almost overlap each other except there are some small gap here and there.
  • from this above plot i cant see that there are much difference in word between +ve and -ve sentiment (based on selected text)

2.8 : violin plot of word difference

Observation :

  • here also there is less amount of information to analyse

2.8 Box plot of word difference

Observation :

  • the positive and negative box plot are look identical except the median
  • from this plot also we cant find any difference in +ve and -ve points

2.9 : frequency of words (lets build a word cloud) :

Observation :

  • if you observe the positive word cloud some words occurring more like good , fun, love , mother , great , amazing these are indication of positive tweet
  • in negative tweet we will find don’t, cant, miss ,sad , sick ,sucks etc
  • in neutral there are mix of words which are some time look like negative and some time positive word but which cant be decide based on only word , we have to look at the sentences.
  • lets look at the count , that how many time these word are occurring

2.10 : Most Frequent Word :

most frequent word in positive sentiment (text)
most frequent word in positive sentiment (selected text)
most frequent word in negative sentiment (text)
most frequent word in negative sentiment (selected_text)
most frequent word in neutral sentiment (text)
most frequent word in neutral sentiment (selected_text)

Observation :

  • word like good , mother , happy , great ,thanks are occurring more on positive tweets
  • not , sad , sorry, sick ,hate are occurring more on negative sentiment

2.11 : Unique words in tweet(selected text)

positive data :

negative data:

Observation:

  • here we are performing number of unique words are present in tweets
  • in positive sentiment the word congratulations , thanks ,love etc where as in negative sentiment words like ache, saddest , hated, weak occurring more.

2.12 : capturing more than one word :

Observation :

  • here i have done n_gram plotting which mean more than one word
  • if you observe the two gram and three gram we are getting some sense of -ve and +ve tweet
  • like in positive sentence word like happy , day, mother are occurring more often and in negative word don’t , not ,no, hate are occurring more.

3. Machine Learning Models / Deep learning models

In this section i have tried different kind of deep learning models (i won’t post the source code here as to look readable . here is the link to the whole source code :

3.1 : Base Line Model :

The base line model should be simple so that if any error occur we could detect easily and in future we can compare with other models

The base line model is a simple bidirectional LSTM .Before going to the architecture of the model , i want to discuss about the pre processing work of our text data :

  • remember how you give your input data to the model will going to affect your model , in some case it will may get you in trouble to analyze the model especially in deep learning

below all technique that i have discussed , i have written the code please check my github repo.

TEXT PREPROCESSING

  • get the data and split it into train , test and validation data (its an important step never forget in machine learning)
  • clean the data by removing @ , # , & , convert lower case , remove links, remove null , nan, etc
  • create <tok> for those word which are in both text as well as selected text.and put them as a target text.

for example :

text : god assignments are stressful ! but its finish…

selected text : stressful

target text : god assignments are <tok> ! but its finished …

TOKENIZATION

  1. take all the word from training data.
  2. create an object of tensorflow.keras tokenizer (here i have given maximum word to 54k) . now fit it to all the word we have (refer step 1).
  3. here i have taken maximum length of sentence (or vector) to be 35.
  4. now call the object of tokenizer and a function which is “ texts_to_sequences() “ and pass the training data which will give the vector.
  5. in the vocab (token) the words assigned to a number .like this [[(‘<tok>’, 1), (‘i’, 2), (‘to’, 3), (‘the’, 4), (‘a’, 5), (‘my’, 6)] in dict format. i have given 1 to <tok> intentionally.
  6. now padded the vector to our max length(which is 35).
  7. do the repeat step 4 and 5 to the test data.
  8. lets come to target data put 1 where <tok> present and 0 everywhere.
  • let me give an example :

lets say the sentence is : god assignments are stressful ! but its finish…

here the target word is “stressful”

now we are creating the target in this way : god assignments are <tok> ! but its finished …

i am putting <tok> in the place of stressful

the vector will look like this — [0,0,0,1,0,0,0,0,0,0,0,0……0]

WORD EMBEDDING with predefined glove model

what word embedding does is it allows words with similar meaning to have a similar representation.here we get a vector (real value) for a given word which we can get from a predefined vector space (i have used glove vector).

  • first download the glove vector from this site : https://nlp.stanford.edu/projects/glove/
  • create an object for glove vector
  • create a dictionary where we will store words and the vector representation of that particular word using the object we have created for glove vector.
  • now stack all the vector , get the mean and standard deviation and create a matrix using that of size (32515 ,300) (32515 vocab size)
  • now call the dictionary (which we have created in step 3) and get the word and vector and map it to the matrix (which we created in step 4)
  • now we have our embedded word and corresponding vector

MASKED LOSS

why do we need to masked our loss ?

Because we are padding the zero (remember we have max length = 35)and we have to remove those padded zero in predictions while calculating the loss.

  1. First create a the loss object here i have created BinaryCrossEntropy .name it as “loss_function
  2. now create a function where we will pass the loss object. named as “maskedLoss
  3. first lets get a binary representation of the target value like this lets call it as “ mask

4. call the loss_function (step 1) we have created and pass the y_train and y_pred

5. now we got the loss tensor , do a type cast with with the object which we have created in step 3 with the loss tensor. named as “mask_type_cast

which will look like this — [0. 0. 0. 1… 0. 0. 0.]

6. now multiply with the loss

7. now get mean over all the values

8. return the loss

here is the code

Bi-LSTM Architecture

This is simple architecture (as its our base line model)

now train the model with adam optimizer. lets look to the accuracy and loss , how its behaving

By looking to the above accuracy plot there may be a slight overffiting not so much. Loss giving us nice shape (as train loss decreasing test loss also decreasing)

now lets look to the jaccard score which we care about :

Jaccard score for train data : 0.5842960053101587
Jaccard score for validation data : 0.594346251251535

ok !!! its look good but i want to know how much jaccard score we are getting for positive sentiment , negative sentiment and neutral sentiment.

Jaccard score for positive data point:

Jac train sccore =  0.3086811397909643
Jac valid sccore = 0.33709575637410566

Jaccard score for negative points

Jac train sccore =  0.3322021538329059
Jac valid sccore = 0.358989438053781

Jaccard score for neutral points:

Jac train sccore =  0.8252107927755609
Jac valid sccore = 0.8593473997409419

ok !!!!we got some decent jaccard score , but can we improve it . lets build another model .

3.2 : Attention Model

For attention also we will do tokenization and will apply padding.

In the target value what i have done is instead of giving padded value to 0 we will give

-1 to padded value

0 to those word which are not <tok>

1 to those value which are <tok>

like this [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,-1, -1, -1, -1, -1, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 1, 1]

now we will follow the same step as we have done for our base line model

i have just taken the base line model added an attention layer to it , then added some dropout and layer normalization so that we don’t overfit.

Architecture of attention model :

lets look at the accuracy and loss

its look like model is overfitting .after adding layer normalization and dropout. lets look to the jaccard score

Jaccard score for train data =  0.5895125412375438
Jaccard score for validation data = 0.5827399630028105

the attention layer jaccard score is kind of same as the base line model

lets look at the positive ,negative and neutral data points jaccard score :

Jacrad score for positive data points:

Jac train sccore =  0.2848550415162455
Jac valid sccore = 0.30991278847885434

Jacrad score for negative data points:

Jac train sccore =  0.3074721686328846
Jac valid sccore = 0.3312753654139326

Jacrad score for neutral data points:

Jac train sccore =  0.7168542910565285
Jac valid sccore = 0.7384339042299354

3.3 Transformer Roberta

Tokenization

First you need to install the hugging face transformer

! pip install transformers

  • take max length of sentence to be 96 (we can take 128 also)
  • create a byte level tokenizer , this tokenizer has “ ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing ”.

The byte level BPE tokenizer done something like subword tokenizer where it will break a single word into two word , example: faster could be fast and ##er

  • creating a dict where we will give some unique encoded value to positive , negative and for neutral . which will help us in the preprocessing phase where we have to define the start and token.

sentiment_id = {‘positive’: 1313, ‘negative’: 2430, ‘neutral’: 7974}

  • now lets see how our tokenizer working

what i have done is i encoded “ this is my first deep learning project” this sentence and also decoded to check is it working or not.

Text Preprocessing (encoding)(Train data),attention mask,defining the start and end token

here i have explain this preprocessing technique using an example

Model Building

Now when we fit / train our model we have to give the

  • encoded text
  • attention mask
  • start token
  • end token
  • token type id

now train the model

Jaccard Score on validation data : 0.7045

This is the best Jaccard score we have got. now lets check each sentiment jaccard score.

Jacrad score for positive data points: 0.5661

Jacrad score for negative data points: 0.572

Jacrad score for neutral data points: 0.974

well well we got quite well jaccard score for each sentiment .But one question which coming to my mind

Why negative data points are getting high jaccard score in every model ???

if you look carefully the length of train text data is equal to length train selected text , so it was a easy task to classify neutral points comparing to positive and negative data points. That’s why jaccard score of neutral point is high in every model

All model score

Post Analysis

In this section i will do post analysis , where we will look where our models fails

  1. Lets looked the distribution of original selected text and predicted selected text

Observation : — we can see the above two distribution is not same , but ideally it should be same as one is the prediction of other.

2. lets check it out number of word in original selected text and number of word in predicted selected text

Observation

  • There are 4955 rows present where original selected text and prediction selected text word length are not same .
  • if word length of these two column(predicted and original) are not same then it result in low jaccard score
  • In some cases model prediction is same as selected text but there are (in predction) few additional words present which cause low jaccard score.
  • we can see the distribution plot, in most of the cases word length of original selected text containing more number of words than predicted selected text .,which result in low jaccard score

3. lets check it out number differing word in original selected text and number of word in predicted selected text

  • we can see the original selected text and predicted selected text is fully overlapping (8738 such datapoints are present where word length are differing)
  • where the text length are not differing the jaccard score also high.

Future Work

Here i have worked on twitter data where people are talking about different kind of things , there is no fix thing .What we can do in future we can choose a specific domain where we can retrieve customer review and from that we can give them sentiment based on those sentiment we will try to capture ,if people dislike(-ve review) something then we can find what is reason behind that using this method (method which we have used in this project), vice versa.

My github and Linkedin Profile

References:

Thank you

--

--