Sentiment Analysis For Bengali News Text Using LSTM

Published in

Analytics Vidhya

5 min readMay 11, 2020

Sentiment analysis is the part which manages decisions, reactions just as emotions, which is created from writings, being widely utilized in fields like data mining, web mining, and internet-based life investigation since suppositions are the most fundamental qualities to pass judgment on the human conduct. This specific field is making swells in both research and mechanical social orders. Estimations can be positive, negative, or neutral or it can contain an arithmetical score articulating the adequacy of the assessment.

News headline that comprises of emotions-good, bad, neutral. Sentiment analysis is utilized to investigate human emotions present in textual information. Here I will show that a deep learning-based implementation for sentiment analysis of news headline. The experiments have been performed on Bengali News Headline dataset, which expresses the applicability and validation of the adopted approach.

Load Dataset

pandas is a fast, powerful, flexible and easy to use open-source data analysis library which is build for python. The dataset has been kept into an excel sheet. So, need to read the excel using pandas library. In python 3 not need to convert encoding. And re is used to replacing regular expression from the Bengali text.

import pandas as pdimport redf=pd.read_excel("Bengali_News_Headline_Sentiment.xlsx")

Three attributes are present in the dataset. But we need Headline and Sentiment attributes for sentiment analysis.

df.head()

drop() is used to drop the unwanted column and axis=1 indicate column.

df=df.drop('News Type',axis=1)

Remove Expression

Bengali text has some most commonly used expression such as । , : , ‘ ‘, ?, etc. Those expressions need to remove from the text before converting token and sequence. A lambda function is applied to replace those expressions for the headline.

df['Headline'] = df['Headline'].apply((lambda x: re.sub('[,?'']','',x)))

Deep Learning Library

For converting the text into the token, Keras tokenizer is used. After that converting the token to sequence Keras pad sequence is used. Keras sequential model is imported to build the model which is appropriate for a plain stack of layers where each layer has exactly one input tensor and one output tensor.

A LSTM network is a kind of recurrent neural network. A recurrent neural network is a neural network that attempts to model time or sequence-dependent behaviour. For maintaining the text sequence Long Short Term Memory is imported. Here we will be building an LSTM network for maintaining text sequence. LSTM cell blocks in place of our standard neural network layers. These cells have various components called the input gate, the forget gate and the output gate.

Dense layer is the regular deeply connected neural network layer. It is the most common and frequently used layer.

Keras Embedding layer that can be used for neural networks on text data. It requires that the input data be integer encoded so that each word is represented by a unique integer.

SpartialDropout1D it drops entire 1D feature maps instead of individual elements.

Keras utils Convert a class vector (integers) to the binary class matrix.

Sklearn model selection is used to divide the dataset into train and test part.

import numpy as npfrom keras.preprocessing.text import Tokenizerfrom keras.preprocessing.sequence import pad_sequencesfrom keras.models import Sequentialfrom keras.layers import LSTM,Dense,Embedding,SpatialDropout1Dfrom keras.utils.np_utils import to_categoricalfrom sklearn.model_selection import train_test_split

Convert Text to Sequence

For feature extraction used 2500 maximum feature. Tokenizer tokenizes the text based on the maximum features number. And tokenizer fit the headline text for converting the text to the sequence. The converted sequence is added pad sequence.

max_fatures = 2500tokenizer = Tokenizer(num_words=max_fatures, split=' ')tokenizer.fit_on_texts(df['Headline'].values)X = tokenizer.texts_to_sequences(df['Headline'].values)X = pad_sequences(X)

X contains the array of the text sequence with the 32-bit integer data type.

np.shape(X)

Set Model

Set the embedding dimension 64. In the embedding layer, the maximum feature is used as an input with the embedding dimension and the input length is equal to the padding sequence shape. In SpatialDropout layer 0.4 ratios are used. LSTM cell is added with 64 hidden unit and 0.2 dropouts,0.2 recurrent dropouts. In the output layer set, the Dense size is 2 with the ‘Softmax’ activation function. Softmax converts a real vector to a vector of categorical probabilities. The elements of the output vector are in range (0, 1) and sum to 1. For loss calculation, ‘categorical_crossentropy’ is used and ‘adam’ is used for the optimization.

embed_dim = 64model = Sequential()model.add(Embedding(max_fatures, embed_dim,input_length = X.shape[1]))model.add(SpatialDropout1D(0.4))model.add(LSTM(64, dropout=0.2, recurrent_dropout=0.2))model.add(Dense(2,activation='softmax'))model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['acc'])model.summary()

Train Test divide

The headline text is the input of the model and sentiment is the output. Variable X contains the model input and Y is the output. For achieving the sentiment value from dataset pandas dummies function is used.

For the test, 20% of total data is used with 42 random states values.

Y = pd.get_dummies(df['Sentiment']).valuesX_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.20,random_state = 42)print(X_train.shape,Y_train.shape)print(X_test.shape,Y_test.shape)

Train Model

Save each step of model performance history variable is used. For iteration set 10 epochs with 32 batch size. Validate data 10% data is used for validation split.

history=model.fit(X_train, Y_train, epochs = 10, batch_size=32, verbose = 2,validation_split=0.1)

After the training model achieves 97% accuracy for the training data and 64% accuracy for valuation data. Model is also evaluated by 64 batch size.

score = model.evaluate(X_train, Y_train,batch_size=64, verbose=2)print('Train loss:', score[0])print('Train accuracy:', score[1])

score = model.evaluate(X_test, Y_test,batch_size=64, verbose=2)print('Test loss:', score[0])print('Test accuracy:', score[1])

Sentiment Prediction

The user gives a headline as an input sentence then it converts text to sequence for sentiment prediction. If the model predicts 0 that contains negative news otherwise it predicts positive news.

text = input()text = tokenizer.texts_to_sequences(text)text = pad_sequences(text,maxlen=14,dtype='int32',value=0)predict=model.predict(text,batch_size=1,verbose=2)[0]if(np.argmax(predict)==0):print("Negative News")else:print("Positive News")

Github: https://github.com/AbuKaisar24/Machine-Learning-Algorithms-Performance-Measurement-for-Bengali-News-Sentiment-Classification/blob/master/Headline%20Sentiment%20using%20LSTM.ipynb