Dejan Jovanovic
Jul 17 · 6 min read

Working with Text

Text is one of the most prevailing forms of sequence data. In order for deep learning models to understand natural language, one must prepare an statistical representation of the language. Deep learning for natural language processing is pattern recognition applied to words, sentences and paragraphs. Raw text must be vectorized in order to be used; this can be done in multiple ways:

  1. Segment text into words and transform each word into a vector.
  2. Segment text into characters and transform each character into a vector
  3. Extract n-grams of words or characters and then transform each n-gram into a vector.

In the previous example for text vectorization I used one-hot encoding of tokens. In my second attempt in building a sentiment analysis engine for movie reviews, I’m going to use word embedding of tokens. Word embedding is a dense word vector. While the vectors obtained through one-hot encoding are binary, sparse, and very high dimensional, word embedding are low dimensional floating point vectors. Simply put, word embedding is a way of representing text where each word in the vocabulary is represented by a real value vector in a high dimensional space.

Implementing word embedding in Keras is very easy.

from keras.layers import Embedding
embedding_layer = Embedding(25767, 100)

The Embedding layer can be understood as a dictionary that maps integer indices into a dense vector. It takes an integer as input, it looks up this integer in an internal dictionary and it returns the associated vector. The embedding layer returns a three dimensional floating point tensor (samples, sequence length and embedding dimensionally).

The first part of the problem — building a vocabulary — stays the same as what was already done in this series. Since I’m using the same dataset than last time, I reuse the same vocabulary.txt. The dataset is going to be divided into training data and test data, where 90% is going to be training data and 10% for testing.

# testing vs training split
testing_split = 0.9

For the neural network model this time I’m going to use a convoluted neural network (CNN) with an embedding layer. This is a classification problem.

# create the model
def create_model(vocabulary_size, max_length):
# define network
model = Sequential()
model.add(Embedding(vocabulary_size, 250,
input_length=max_length))
model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(10, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# compile network
model.compile(loss='binary_crossentropy', optimizer='adam',
metrics=['accuracy'])
# summarize defined model
model.summary()
return model

In spite of the implemented changes, the accuracy of the model has not improved significantly. I assume the problem is that the review feedback does not have a rich dataset to measure sentiment against. Here is the result of my run:

— — — — — — — — — — — —
Test accuracy: 81.00%
Vocabulary size: 25579
Maximum length: 1152
— — — — — — — — — — — —
Review: [Best movie ever! It was great, I recommend it.]
Sentiment: NEGATIVE (63.633%)
Review: [This is a bad movie.]
Sentiment: NEGATIVE (63.633%)

Below is the complete code of the solution:

from os import listdir
from os import path
from nltk.corpus import stopwords
from numpy import array
from numpy import random
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Embedding
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from math import floor
from random import shuffle
import re
import string

seed = 7
random.seed(seed)

vocabulary_fileName = "vocabulary.txt"

# testing vs training split
training_split = 0.9

# load file into memory
def load_document(fileName):
# open the file as read only
file = open(fileName, 'r')
# read all text
text = file.read()
# close the file
file.close()
return text

# get list of files from specified directory
def get_file_list_from_dir(datadirectory):
# get list of all files in the specified directory
all_files = listdir(path.abspath(datadirectory))
# make sure that we only get our review files that are ending
# with .txt
data_files = list(filter(lambda file: file.endswith('.txt'),
all_files))
return data_files

# split list on training and testing list
def get_training_and_testing_set(file_list):
split_index = floor(len(file_list) * training_split)
training = file_list[:split_index]
testing = file_list[split_index:]
return training, testing

# clean and tokenize
def clean_tokens(document):
# split document into tokens by white space
tokens = document.split()
# punctuation removal
remove_punctuation = re.compile('[%s]' %
re.escape(string.punctuation))
tokens = [remove_punctuation.sub('', w) for w in tokens]
# remove remaining tokens that are not alphabetic
tokens = [word for word in tokens if word.isalpha()]
# filter out stop words
stop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if not w in stop_words]
# filter out short tokens
tokens = [word for word in tokens if len(word) > 1]
return tokens

# load document, clean it and return line of tokens
def document_to_line(fileName, vocabulary):
# load the document
document = load_document(fileName)
# clean the document
tokens = clean_tokens(document)
# filter the tokens by vocabulary
tokens = [x for x in tokens if x in vocabulary]
return ' '.join(tokens)

# process all documents in the folder
def process_documents(directory, fileList, vocabulary):
lines = list()
# go over all files in the directory
for fileName in fileList:
# create the full path of the file to be opened
path = directory + '/' + fileName
# load and clean the data
line = document_to_line(path, vocabulary)
# add to list
lines.append(line)
return lines

# load and clean a dataset
def load_clean_dataset(vocabulary):
# get positivFeedback file list
positiveFileList =
get_file_list_from_dir('./review_polarity/txt_sentoken/pos')
# get negative Feedback file list
negativeFileList =
get_file_list_from_dir('./review_polarity/txt_sentoken/neg')
# shuffle files
shuffle(positiveFileList)
shuffle(negativeFileList)

# get training and testing file list
posTraining, posTesting =
get_training_and_testing_set(positiveFileList)
negTraining, negTesting =
get_training_and_testing_set(negativeFileList)

# load documents
negativeFeedbackTraining =
process_documents('./review_polarity/txt_sentoken/neg',
negTraining, vocabulary)
positiveFeedbackTraining =
process_documents('./review_polarity/txt_sentoken/pos',
posTraining, vocabulary)
negativeFeedbackTesting =
process_documents('./review_polarity/txt_sentoken/neg',
negTesting, vocabulary)
positiveFeedbackTesting =
process_documents('./review_polarity/txt_sentoken/pos',
posTesting, vocabulary)
trainingDocuments = positiveFeedbackTraining +
negativeFeedbackTraining
testingDocuments = positiveFeedbackTesting +
negativeFeedbackTesting
# prepare labels
trainingLabels = array([0 for _ in
range(len(negativeFeedbackTraining))] +
[1 for _ in range(len(positiveFeedbackTraining))])
testingLabels = array([0 for _ in
range(len(negativeFeedbackTesting))] +
[1 for _ in range(len(positiveFeedbackTesting))])
return trainingDocuments, trainingLabels,
testingDocuments, testingLabels

# fit a tokenizer
def create_tokenizer(lines):
tokenizer = Tokenizer()
tokenizer.fit_on_texts(lines)
return tokenizer

# integer encode and pad documents
def encode_docs(tokenizer, max_length, documents):
# integer encode
encoded = tokenizer.texts_to_sequences(documents)
# pad sequence
padded = pad_sequences(encoded, maxlen=max_length,
padding='post')
return padded

# create the model
def create_model(vocabulary_size, max_length):
# define network
model = Sequential()
model.add(Embedding(vocabulary_size, 100,
input_length=max_length))
model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(10, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# compile network
model.compile(loss='binary_crossentropy', optimizer='adam',
metrics=['accuracy'])
# summarize defined model
model.summary()
return model

# turn the document into clean tokens
def clean_doc(doc, vocabulary):
# split doc into tokens by white space
tokens = doc.split()
# prepare regular expression for filtering
re_punc = re.compile('[%s]]' % re.escape(string.punctuation))
# remove punctuation from each word
tokens = [re_punc.sub('', w) for w in tokens]
# filter out tokens that are not part of vocabulary
tokens = [word for word in tokens if word in vocabulary]
tokens = ' '.join(tokens)
return tokens

def predict_sentiment(review, vocabulary, tokenizer,
max_length, model):
# clean review
line = clean_doc(review, vocabulary)
# filter by vocabulary
padded = encode_docs(tokenizer, max_length, [line])
# predict sentiment
what = model.predict(padded, verbose=0)
percent_pos = what[0,0]
if round(percent_pos) == 0:
return (1-percent_pos), "POSITIVE"
return percent_pos, "NEGATIVE"

# load the vocabulary
vocabulary = load_document(vocabulary_fileName)
vocabulary = set(vocabulary.split())

# load all reviews
training_docs, ytrain, test_docs, ytest =
load_clean_dataset(vocabulary)

# calculate maximum sequence length
max_length = max([len(s.split()) for s in training_docs])

# create the tokenizer
tokenizer = create_tokenizer(training_docs)

# Vocabulary size
vocabulary_size = len(tokenizer.word_index)+1

# encode data
Xtrain = encode_docs(tokenizer, max_length, training_docs)
Xtest = encode_docs(tokenizer, max_length, test_docs)

model = create_model(vocabulary_size, max_length)
model.fit(Xtrain, ytrain, epochs=10, verbose=2)

# evaluate the model
loss, accuracy = model.evaluate(Xtest, ytest, verbose=0)
print("-------------------------------------------------")
print('Test accuracy: %.2f' % (accuracy*100) + '%')
print("Vocabulary size: ", vocabulary_size)
print('Maximum length: %d' % max_length)
print("-------------------------------------------------")

# test positive text
text = 'Best movie ever! It was great, I recommend it.'
percent, sentiment = predict_sentiment(text, vocabulary, tokenizer,
max_length, model)
print('Review: [%s]\nSentiment: %s (%.3f%%)' % (text, sentiment,
percent*100))
# test negative text
text = 'This is a bad movie.'
percent, sentiment = predict_sentiment(text, vocabulary, tokenizer,
max_length, model)
print('Review: [%s]\nSentiment: %s (%.3f%%)' % (text, sentiment,
percent*100))

Summary

Hope you enjoyed this reading. My exploration of NLP deep learning neural network configurations continues. In my next story I will change again how to deal with vocabulary and will use n-gram data encoding with CNN network configuration.

References

  1. Deep Learning with Python, By Francois Chollet, ISBN 9781617294433
  2. Develop Deep Learning Models on Theano and TensorFlow Using Keras, By Jason Brownlee
  3. Deep Learning, By Ian Goodfellow, Yoshua Bengio and Aaron Courville, ISBN 9780262035613
  4. Neural Networks and Learning Machines, By Simon Haykin, ISBN 9780131471399

NCB consists of a group of engineers and specialists with extensive technology and business backgrounds, united by a passion for innovation, professional development and building high-quality software products. Technology has the capacity of bringing to life revolutionary ideas that can change and better the world compared to the way we know it.

info@ncb.global

NewCryptoBlock

Technology in Action

Dejan Jovanovic

Written by

Seasoned executive, business and technology leader, entrepreneur, blockchain and smart contract expert

NewCryptoBlock

Technology in Action

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade