Deep Learning — Natural Language Processing (Part V-c)

Dejan Jovanovic
Jul 3, 2019 · 6 min read
Image for post
Image for post

In the last article we completed the creation of the vocabulary used for our dataset, which is going to be used for for building NLP model. In our first attempt we will develop a Multilayer Perceptron (MLP) model to classify encoded review feedbacks as either positive sentiment or negative sentiment. The model will be a simple feedforward network model with fully a connected layer called Dense in the Keras deep learning library.

What is fully connected dense layer? It is a linear operation in which every input is connected to every output by a weight (so there are n_inputs * n_outputs weights — which can be a lot!), generally followed by a non-linear activation function.

The model will have an input layer that equals the number of words in the vocabulary.

First we need to load the previously created vocabulary though:

vocabulary_fileName = "vocabulary.txt"

# load file into memory
def load_document(fileName):
# open the file as read only
file = open(fileName, 'r')
# read all text
text = file.read()
# close the file
file.close()
return text


# load the vocabulary
vocabulary = load_document(vocabulary_fileName)
vocabulary = set(vocabulary.split())

Our next step is to load all reviews. Reviews are loaded and also at the same time cleaned. The review dataset is split into 90% training data and 10% testing data. Here is the snippet of the code for this:

vocabulary_fileName = "vocabulary.txt"


# clean and tokenize
def clean_tokens(document):
# split document into tokens by white space
tokens = document.split()
# punctuation removal
remove_punctuation = re.compile('[%s]' % re.escape(string.punctuation))
tokens = [remove_punctuation.sub('', w) for w in tokens]
# remove remaining tokens that are not alphabetic
tokens = [word for word in tokens if word.isalpha()]
# filter out stop words
stop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if not w in stop_words]
# filter out short tokens
tokens = [word for word in tokens if len(word) > 1]
return tokens


# load document, clean it and return line of tokens
def document_to_line(fileName, vocabulary):
# load the document
document = load_document(fileName)
# clean the document
tokens = clean_tokens(document)
# filter the tokens by vocanulary
tokens = [x for x in tokens if x in vocabulary]
return ' '.join(tokens)


# process all documents in the folder
def process_documents(directory, vocabulary, isTraining):
lines = list()
# go over all files in the directory
for fileName in listdir(directory):
# skip any reviews in the test set
if isTraining and fileName.startswith('cv9'):
continue
if not isTraining and not fileName.startswith('cv9'):
continue
# create the full path of the file to be opened
path = directory + '/' + fileName
# load and clean the data
line = document_to_line(path, vocabulary)
# add to list
lines.append(line)
return lines


# load and clean a dataset
def load_clean_dataset(vocabulary, isTraining):
# load documents
negativeFeedback = process_documents('./review_polarity/txt_sentoken/neg', vocabulary, isTraining)
positiveFeedabck = process_documents('./review_polarity/txt_sentoken/pos', vocabulary, isTraining)
documents = positiveFeedabck + negativeFeedback
# prepare labels
labels = array([0 for _ in range(len(negativeFeedback))] + [1 for _ in range(len(positiveFeedabck))])
return documents, labels

# load all reviews
training_docs, ytrain = load_clean_dataset(vocabulary, True)
test_docs, ytest = load_clean_dataset(vocabulary, False)

In the first attempt to build a deep learning sentiment analysis model, we are going to use the MLP model. This model is going to have an input layer that equals to the number of words in the vocabulary.

Our model looks as follows:

_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 50) 1288450
_________________________________________________________________
dense_2 (Dense) (None, 1) 51
=================================================================
Total params: 1,288,501
Trainable params: 1,288,501
Non-trainable params: 0
_________________________________________________________________

After running the training session we get the following results:

Epoch 1/10
- 5s - loss: 0.4819 - acc: 0.7794
Epoch 2/10
- 3s - loss: 0.0702 - acc: 0.9911
Epoch 3/10
- 3s - loss: 0.0179 - acc: 1.0000
Epoch 4/10
- 3s - loss: 0.0072 - acc: 1.0000
Epoch 5/10
- 3s - loss: 0.0038 - acc: 1.0000
Epoch 6/10
- 3s - loss: 0.0022 - acc: 1.0000
Epoch 7/10
- 3s - loss: 0.0014 - acc: 1.0000
Epoch 8/10
- 3s - loss: 9.5990e-04 - acc: 1.0000
Epoch 9/10
- 3s - loss: 7.0107e-04 - acc: 1.0000
Epoch 10/10
- 3s - loss: 5.3340e-04 - acc: 1.0000
Test Accuracy: 92.50

Testing the model with the test data we are get an accuracy of 92.5%. You may say “very impressive”. It may seem so, but when we used this model again and test it with some new data we are getting the following:

Review: [Best movie ever! It was great, I recommend it.]
Sentiment: POSITIVE (56.636%)
Review: [This is a bad movie.]
Sentiment: NEGATIVE (66.628%)

As you may have noticed, the accuracy is just over 50% which is not good at all! We suspected that perhaps the review text that we had provided was just not long enough.

Here is the full source code of the example used today:

from os import listdir
from nltk.corpus import stopwords
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers import Dense
import re
import string

vocabulary_fileName = "vocabulary.txt"

# load file into memory
def load_document(fileName):
# open the file as read only
file = open(fileName, 'r')
# read all text
text = file.read()
# close the file
file.close()
return text

# clean and tokenize
def clean_tokens(document):
# split document into tokens by white space
tokens = document.split()
# punctuation removal
remove_punctuation = re.compile('[%s]' % re.escape(string.punctuation))
tokens = [remove_punctuation.sub('', w) for w in tokens]
# remove remaining tokens that are not alphabetic
tokens = [word for word in tokens if word.isalpha()]
# filter out stop words
stop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if not w in stop_words]
# filter out short tokens
tokens = [word for word in tokens if len(word) > 1]
return tokens

# load document, clean it and return line of tokens
def document_to_line(fileName, vocabulary):
# load the document
document = load_document(fileName)
# clean the document
tokens = clean_tokens(document)
# filter the tokens by vocanulary
tokens = [x for x in tokens if x in vocabulary]
return ' '.join(tokens)

# process all documents in the folder
def process_documents(directory, vocabulary, isTraining):
lines = list()
# go over all files in the directory
for fileName in listdir(directory):
# skip any reviews in the test set
if isTraining and fileName.startswith('cv9'):
continue
if not isTraining and not fileName.startswith('cv9'):
continue
# create the full path of the file to be opened
path = directory + '/' + fileName
# load and clean the data
line = document_to_line(path, vocabulary)
# add to list
lines.append(line)
return lines


# load and clean a dataset
def load_clean_dataset(vocabulary, isTraining):
# load documents
negativeFeedback = process_documents(
'./review_polarity/txt_sentoken/neg',
vocabulary, isTraining)
positiveFeedabck = process_documents(
'./review_polarity/txt_sentoken/pos',
vocabulary, isTraining)
documents = positiveFeedabck + negativeFeedback
# prepare labels
labels = array([0 for _ in range(len(negativeFeedback))] +
[1 for _ in range(len(positiveFeedabck))])
return documents, labels

# fit a tokenizer
def create_tokenizer(lines):
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
return tokenizer

# create the model
def create_model(n_words):
# define network
model = Sequential()
model.add(Dense(50, input_shape=(n_words,), activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# compile network
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# summarize defined model
model.summary()
return model

def predict_sentiment(review, vocabulary, tokenizer, model):
#
tokens = clean_tokens(review)
# filter by vocabulary
tokens = [w for w in tokens if w in vocabulary]
# convert to line
line = ' '.join(tokens)
# encode
encoded = tokenizer.texts_to_matrix([line], mode='binary')
# predict sentiment
what = model.predict(encoded, verbose=0)
percent_pos = what[0,0]
if round(percent_pos) == 0:
return (1-percent_pos), "POSITIVE"
return percent_pos, "NEGATIVE"

# load the vocabulary
vocabulary = load_document(vocabulary_fileName)
vocabulary = set(vocabulary.split())

# Vocabulary size
print("Vocabulary size: ", len(vocabulary))

# load all reviews
training_docs, ytrain = load_clean_dataset(vocabulary, True)
test_docs, ytest = load_clean_dataset(vocabulary, False)

# create the tokenizer
tokenizer = create_tokenizer(training_docs)
# encode data
Xtrain = tokenizer.texts_to_matrix(training_docs, mode='binary')
Xtest = tokenizer.texts_to_matrix(test_docs, mode='binary')

# define the model
n_words = Xtest.shape[1]

print("n_words-", n_words)

model = create_model(n_words)
# fit the network
model.fit(Xtrain, ytrain, epochs=10, verbose=2)

# evaluate
loss, acc = model.evaluate(Xtest, ytest, verbose=0)
print('Test Accuracy: %.2f' % (acc*100))

# test positive text
text = 'Best movie ever! It was great, I recommend it.'
percent, sentiment = predict_sentiment(text, vocabulary,
tokenizer, model)
print('Review: [%s]\nSentiment: %s (%.3f%%)' % (text,
sentiment, percent*100))
# test negative text
text = 'This is a bad movie.'
percent, sentiment = predict_sentiment(text, vocabulary,
tokenizer, model)
print('Review: [%s]\nSentiment: %s (%.3f%%)' % (text,
sentiment, percent*100))

Summary

Hope you enjoyed this reading. We have seen that the simple MLP model doesn’t provide satisfactory results. With testing data looks good but once you start using it with real data, accuracy really drops. In our next story we will continue with the exploration of different models, techniques and network architectures in order to improve the accuracy of the sentiment analyses model for movie reviews.

References

  1. Deep Learning with Python, By Francois Chollet, ISBN 9781617294433
  2. Artificial Intelligence for Humans Volume 1: Fundamental Algorithms, By Jeff Heaton, ISBN978–1493682225
  3. Develop Deep Learning Models on Theano and TensorFlow Using Keras, By Jason Brownlee
  4. Deep Learning, By Ian Goodfellow, Yoshua Bengio and Aaron Courville, ISBN 9780262035613
  5. Neural Networks and Learning Machines, By Simon Haykin, ISBN 9780131471399

NCB consists of a group of engineers and specialists with extensive technology and business backgrounds, united by a passion for innovation, professional development and building high-quality software products. Technology has the capacity of bringing to live revolutionary ideas that can change and better the world compared to the way we know it.

info@ncb.global

NewCryptoBlock

Technology in Action

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store