In the next set of topics we will dive into different approachs to solve the hello world problem of the NLP world, the sentiment analysis.
The code for this implementation is at https://github.com/iolucas/nlpython/blob/master/blog/sentiment-analysis-analysis/neural-networks.ipynb
We will use two machine learning libraries:
- scikit-learn to create onehot vectors from our text and split the dataset into train, test and validation;
- tensorflow to create the neural network and train it.
Our dataset is composed of movie reviews and labels telling whether the review is negative or positive. Let’s load the dataset:
The reviews file is a little big, so it is in zip format. Let’s Extract it with the the
Now that we have the reviews.txt and labels.txt files, we load them to the memory:
Next we load the module to transform our review inputs into binary vectors with the help of the class
After that we split the data into training and test set with the
train_test_splitfunction. We then split the test set in half to generate a validation set:
We then define two functions: label2bool, to convert the string label to a binary vector of two elements and get_batch, that is a generator to return parts of the dataset in a iteration:
Tensorflow connects expressions in structures called graphs. We first clear any existing graph , then get the vocabulary length and declare placeholders that will be used to input our text data and labels:
This post does not intend to be a tensorflow tutorial, for more details visit https://www.tensorflow.org/get_started/
We then create our neural network:
- h1 is the hidden layer that received as input the text words vectors;
- logits is the final layer that receives the h1 as input;
- output is the result of applying the sigmoid function to the logits;
- loss is the loss expression to calculate the current error of the neural network;
- optimizer is the expression to adjust the weights of the neural network in order to reduce the loss expression;
- correct_pred and accuracy are used to calculate the current accuracy of the neural network ranging from 0 to 1.
We then train the network, periodically printing its current accuracy and loss:
With this network we got an accuracy of 90%! With more data and using a bigger network we can improve this result even further!