Deep Learning — NLP (Part V- b)

Dejan Jovanovic
aihive
Published in
5 min readJun 5, 2019

Continuing with the previous story, in this post we are going to go over an example of text preparation of the sentiment analysis of a movie review dataset. This sample dataset can be found at: http://www.cs.cornell.edu/people/pabo/movie-review-data/review_polarity.tar.gz

The dataset is classified into positive reviews and negative review; all positive reviews are stored in pos directory, and all negative in neg directory. Each review is in a separate file.

First, we need to split the data into training data and testing data prior to any data preparation. This means that any knowledge in the test set that could help us better prepare the data is unavailable during the preparation of the data and the training of the model. In this example we will use 10% of data for testing and 90% for training.

Hence, we need to execute following steps:

As our first step we need to load all the reviews into memory. The following code is responsible for uploading all training files that are stored in a specified directory, in our case this means that 90% of the files will be loaded:

from os import listdir# load training documents only in a directory
def process_documents_for_vocabulary(directory, vocabulary):
# walk through all files and folders
i = 0
print("Total number of file = %s" % len(listdir(directory)))
for fileName in listdir(directory):
# only training files
if fileName.startswith('cv9') or not
fileName.endswith('.txt'):
continue
# create the full path of the file to open
path = directory + '/' + fileName
i += 1
print("File number = ", i, " - ", path)
# add to vocabulary
add_document_to_vocabulary(path, vocabulary)

We are not loading files that are starting with cv9 or are not .txt extension. All files are loaded into the vocabulary.

# load documents into memory
def load_document(fileName):
# open the file as read only
file = open(fileName, 'r')
# read all text
text = file.read()
# close the file
file.close()
return text
# add document to vocabulary
def add_document_to_vocabulary(fileName, vocabulary):
# load document
document = load_document(fileName)
# clean document
tokens = clean_tokens(document)
# update counts
vocabulary.update(tokens)

The next step is to perform data cleaning. The following cleaning steps are going to be performed:

  1. Remove any punctuation from words
  2. Remove tokens that are just punctuation
  3. Remove tokens that contain numbers
  4. Remove tokens that have only one character
  5. Remove tokens that don’t have much meaning such as ‘or’ or ‘and’

And here is the code example that is doing exactly that:

# clean and tokenize
def clean_tokens(document):
# split document into tokens by white space
tokens = document.split()
# punctuation removal
remove_punctuation = re.compile('[%s]' % re.escape(string.punctuation))
tokens = [remove_punctuation.sub('', w) for w in tokens]
# remove remaining tokens that are not alphabetic
tokens = [word for word in tokens if word.isalpha()]
# filter out stop words
stop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if not w in stop_words]
# filter out short tokens
tokens = [word for word in tokens if len(word) > 1]
return tokens

The vocabulary could end up to being huge. A part of preparing the sentiment analysis involves defining and tailoring the vocabulary of words supported by the model. One can decide to support all of the words that occur in the text or discard some based on the number of occurrence. The final set can at the end be saved in a file for later use.

# constants
minimumOccurence = 2
# keep only tokens that occure at least 5 times
tokens = [i for i,j in vocabulary.items() if j >= minimumOccurence]

Once the vocabulary is created, the data is saved to the file system.

# save list to file
def save_list(lines, filename):
data = '\n'.join(lines)
file = open(filename, 'w')
file.write(data)
file.close()

This is all that is needed to do in order to create our vocabulary. Notice that we have not used test data for our vocabulary. Here is the complete code for creating the vocabulary.

from os import listdir
from nltk.corpus import stopwords
from collections import Counter
import re
import string
# constants
vocabularyFileName = 'vocabulary.txt'
negativeDirectory = 'review_polarity/txt_sentoken/neg'
positiveDirectory = 'review_polarity/txt_sentoken/pos'
minimumOccurence = 2
# initiate
vocabulary = Counter()
# save list to file
def save_list(lines, filename):
data = '\n'.join(lines)
file = open(filename, 'w')
file.write(data)
file.close()
# load documents into memory
def load_document(fileName):
# open the file as read only
file = open(fileName, 'r')
# read all text
text = file.read()
# close the file
file.close()
return text
# clean and tokenize
def clean_tokens(document):
# split document into tokens by white space
tokens = document.split()
# punctuation removal
remove_punctuation = re.compile('[%s]' % re.escape(string.punctuation))
tokens = [remove_punctuation.sub('', w) for w in tokens]
# remove remaining tokens that are not alphabetic
tokens = [word for word in tokens if word.isalpha()]
# filter out stop words
stop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if not w in stop_words]
# filter out short tokens
tokens = [word for word in tokens if len(word) > 1]
return tokens
# add document to vocabulary
def add_document_to_vocabulary(fileName, vocabulary):
# load document
document = load_document(fileName)
# clean document
tokens = clean_tokens(document)
# update counts
vocabulary.update(tokens)
# load training documents only in a directory
def process_documents_for_vocabulary(directory, vocabulary):
# walk through all files and folders
i = 0
print("Total number of file = %s" % len(listdir(directory)))
for fileName in listdir(directory):
# only training files
if fileName.startswith('cv9') or
not fileName.endswith('.txt'):
continue
# create the full path of the file to open
path = directory + '/' + fileName
i += 1
print("File number = ", i, " - ", path)
# add to vocabulary
add_document_to_vocabulary(path, vocabulary)
# define vocabulary
vocabulary = Counter()
# add all documents to vocabulary
process_documents_for_vocabulary(negativeDirectory, vocabulary)
process_documents_for_vocabulary(positiveDirectory, vocabulary)
# Vocabulary size
print("Vocabulary size: ", len(vocabulary))
# keep only tokens that occure at least 5 times
tokens = [i for i,j in vocabulary.items() if j >= minimumOccurence]
# save vocabulary to the file for later use
save_list(tokens, vocabularyFileName)
### End of first step ####### Report ###
print('*********************************************')
print ('Report')
print('---------------------------------------------')
# print the size of vocabulary
print("Vocabulary size: ", len(vocabulary))
# how many tokens do we have now
print("Reduced vocabulary size: ", len(tokens))

And as a final result we are getting following:

Report
— — — — — — — — — — — — — — — — — — — — — — -
Vocabulary size: 44276
Reduced vocabulary size: 25767

File “vocabulary.txt” has been saved to file system.

Summary

Hope you enjoyed this reading. Now that we have the vocabulary prepared, our next step is going to be the tokenization of the vocabulary and building a Deep Learning model for sentiment analysis. We will explore these topics in our next story.

References

  1. Deep Learning with Python, By Francois Chollet, ISBN 9781617294433
  2. Artificial Intelligence for Humans Volume 1: Fundamental Algorithms, By Jeff Heaton, ISBN978–1493682225
  3. Artificial Intelligence for Humans Volume 3: Deep Learning and Neural Networks, By Jeff Heaton, ISBN978–1505714340
  4. Develop Deep Learning Models on Theano and TensorFlow Using Keras, By Jason Brownlee
  5. Deep Learning, By Ian Goodfellow, Yoshua Bengio and Aaron Courville, ISBN 9780262035613
  6. Neural Networks and Learning Machines, By Simon Haykin, ISBN 9780131471399
  7. Dropout: A Simple Way to Prevent Neural Networks from Overfitting, by Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov

--

--

Dejan Jovanovic
aihive
Editor for

Seasoned executive, business and technology leader, entrepreneur, blockchain and smart contract expert