Emotion Classification from tweets with LSTM, NLTK, Keras.

Saitej Dandge

Published in

Appening.io

7 min readFeb 18, 2019

I’m gonna elaborate the usage of LSTM (RNN) Neural network to classify and analyse sequential text data.

Facebook Emotica’s for reference…. I don’t own this copyright.

Full Code Available at : https://github.com/saitejdandge/Sentimental_Analysis_LSTM_Conv1D

Problem Statement

We have to train a model that outputs an emotion for a given input text data. Output that we are trying to predict is just a label, not a continuous number hence we can structure this as a Classification Problem.

We’ll break this problem into 3 different modules,

1.0 Data Preparation

1.0 Understanding Data

1.1 Removing punctuations, words that start with ‘@’ and stop words

1.2 Tokenising words / Converting words to indices

1.3 Padding Words

1.4 Building Word Embeddings

1.5 One hot encoding labels

2.0 Building Model

2.1 Understanding Embedding Layer

2.2 Understanding LSTM Layer

2.3 Understanding Dense Layer

2.4 Adding Activations at each Layer.

2.5 Model Architecture with input and output shapes

3.0 Training our Model

3.1 Splitting data into training and testing dataset

3.2 Training the network

3.2 Plotting training and testing accuracies

Imports

Let’s start importing modules needed

1. Data Preparation

1.0 Understanding Data

//todo add csv here

We have 4000 tweets each labelled into one of below sentiment’s (labels)

{ anger, boredom, empty, enthusiasm, fun, happiness, hate, love, neutral, relief, sadness, surprise, worry}

What features to consider as input ? #Feature Selection

We call this methodology as Feature Selection. We should only consider columns that we think will affect the output. We can ignore tweet_id and author column, as emotional outcome don’t depend on them.

1.1 Removing punctuations, words that start with ‘@’ and stop words

Word vectors are sensitive to words with punctuation and are case sensitive.
Words that start with “@” are user and page references and doesn’t add value to output, as their just usernames and page names.
Removing Stop words like a, an, the….etc, we need to remove them as they might bias our model’s output. We need concentrate on more important and key words that we think will have impact on our output.

1.2 Tokenising words / Converting words to indices

Now that, we have preprocessed words by removing unnecessary and modifying them, we now go forward and convert each word into an index, We get indices by sorting all the words in alphabetical order and adding +1 (index 0 — Unknown word)

1.3 Padding Words

We’ll pad each input entry with 20 words each, we pad our last empty input entry with unknown words, if we ran out of words.

1.4 Building Word Embeddings

Word Embeddings are vectorised representation of words. Assume we have an space of n-dimensions. Each word in our dictionary has n dimensions and fit in our word space. This is to preserve relative distances among words and give a semantic understanding to our neural network.

Example : Distance between words “kitchen and battery” should be high while compared to distance between words “kitchen and bathroom”. As words “kitchen, bathroom” are somehow closely related as they are rooms.

We’ll use trained GloVe 50D (Each word will have 50 dimensions) model for word embeddings, we’ll just transfer its weights, instead of retraining it. This on a whole gives kickstart for our neural network.

Download GloVe 50D Word Embedding from here. We’ll add Embedding Layer in our network, this layer will return us word vector, given word index.

E-L stands for Embedding Layer, takes word-index and outputs word vector.

Embedding Layer internally has Embedding Matrix of

shape ( vocab+1, Embedding Dimension)

In our case, it is

(vocab +1, 50D) as we are using GloVe 50D vectors, Each word will be represented as 50 D vector and they all are sorted row wise based on indexes that are obtained after labelling words alphabetically.

1.55 One Hot encoding Labels

In machine learning, one-hot is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0).

Labels : { anger, boredom, empty, enthusiasm, fun, happiness, hate, love, neutral, relief, sadness, surprise, worry}

Each unique emotion is assigned to an integer value (Label Encoding).

For example, “anger” is 0, “boredom” is 1, and “empty” is 2….etc in an alphabetical order.

After this, a one-hot encoding can be applied to the integer representation. This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.

There are 13 unique labels/emotions and therefore 13 binary variables are needed. A “1” value is placed in the binary variable for the emotion and “0” values for the other emotions.

Example :

anger : 0000000000001
boredom : 0000000000010
empty: 0000000000100

..etc

2. Building Model

2.1 Understanding Embedding Layers (First Layer)

This layer acts as lookup table for vectors, given word index. It will return embedded word vector.

Embedding layer can only be used as first layer in Keras.

Our input layer will of be size : (None,20) ; None means variable number.

As we have padded 20 words for each input, in data preparation stage. we have 20 word indices in each row.

Embedding Layer, Converting Word Indices to Word Embeddings.

Embedding layer will convert each index to its corresponding vector by taking help of Embedding . 50 because, we have used GloVe 50 D in word embedding step.

input (None, 20) = >(Embedding Layer) => (None,20,50)

2.2 Understanding LSTM / GRU layers (Hidden Layers)

They fall under category call Recurrent Neural Networks. Recurrent Neural networks will consider output from previous timestamp as input for current timestamp.

Due to an internal memory, which makes it perfectly suited for Machine Learning problems that involve sequential data.

Output of Embedding layer will be fed to this LSTM layer.

We’ll use LSTM layer with 100 units. This layer has 100 RNN Cells, this number is variable and can be adjusted according to our need and complexity of our data.

Input given to LSTM will be considered as (batch_size, timesteps, features).(From Keras Documentation)

We have an option to modify return_sequences variable in LSTM constructor.

There are two different scenarios based on return_sequences

return_sequences = True

Output of RNN layer will include all the outputs from all the units/cells in that layer.

(None, 20,50) = > LSTM(100, return_sequences=True) => (None,20,100)

In the next step, we’ll flatten.

(None, 20,100) = > Flatten => (None,2000)

return_sequences = False

Output of RNN layer will only include the output from its last unit/cell.

input (None, 20) = >LSTM(100, return_sequences=False) => (None,100)

we can either go with any of above scenario, based on our requirement, in the end we’ll have output shape in 2 dimensions that is either

output from LSTM Layer + Flattening=> (None,2000) if return_sequences=True

output from LSTM Layer =>(None, 100 ) if return_sequences=False

2.3 Understanding Dense Layer (Last Layers)

We connect all the data that we get from previous levels using Dense Layers.. We keep reducing output units to (None, labels_count) by adding multiple Dense Layers.

(None, 2000) or (None,100)= > Dense(300) => (None,300)

Adding another dense layer

(None,300) => Dense(labels_count) => (None,13)

13 is labels’ count in our problem. i.e Total number of emotions count.

2.4 Adding Activation at each layer

We’ll add activation at each layer to add non linear understanding to our model. We’ll use ReLU at every layer and softmax for last layer.

Softmax is probability distribution activation function and helps in achieving better results by distributing probability among labels for a given input.

After adding this, we get 13 Outputs each lying between 0 and 1 for each input. Each output represents probability of that emotion for given input. One with highest value can be considered as our prediction.

2.5 Model Architecture with input and output shapes

Additional Info : we can add 1D convolution layers as hidden layers for better insights

Creating Model with Keras

Model Architecture with input and output shapes ; Typo Last layer will have 13 outputs, not 10,.As we have 13 Labels.

3.0 Training Model :

3.1 Splitting data into training and testing dataset

We gotta split our data into two parts, Training Data, Testing Data

We use Training dataset to train our neural network. Test dataset to provide an unbiased evaluation of a final model fit on the training dataset.

This is to find a sweet spot between underfitting and overfitting of our model.

3.2 Training the network

Now, we define number of epochs and check point conditions, these checkpoints will save our model locally if there’s an improvement.

Let’s now start training our model.

3.3 Plotting training and testing accuracies

This is will start training, now we shall monitor the accuracies and plot them as graph to understand the results.

As you can see our training accuracy has reached around 40, while validation accuracy (testing accuracy) is fluctuating.

By this results, we can say that our model still needs more data to understand insights to calculate emotion, Our Model ended up getting training and testing accuracy of 50 %.

Full code Available at : https://github.com/saitejdandge/Sentimental_Analysis_LSTM_Conv1D

Thank you all for reading, please clap, if you find this useful.