The Common Approach to Binary Classification

The most generic way to setup your deep learning models to categorize movie reviews

Jake Batsuuri
Computronium Blog
90 min readMay 29, 2020

--

Here are the topics of study in this article:

  • Input & output
  • Encoding & decoding
  • Validating
  • Predicting
  • Regularization

The General Approach

The general approach to building deep learning algorithms is gonna be first knowing what data you have and in which format. Then figuring out what your goal is with the data. This step is what we call the Input and Output Step. At this stage, we should also partition the input examples into Training Set and Testing Test.

At this 30'000 feet above view, we know in general terms if we have a classification or regression problem. With this information we figure out how to get our input data transformed into a viable input to the neural network. This stage, we should call the Encode and Decode Step, also in the real world called Preprocessing Step.

We know the rough architecture of our neural network. But in the Architecture Step we will commit to the layer architecture specifically, based on how complicated the input data and how complicated the task is, you will use more or less layers.

The next step is the Components Step, when you will choose the loss function, the metrics and the optimizer. The metrics are what you will use to regularize.

And finally the most important step of training. In the Training Step, we will choose a rough number of epochs to iterate. As well as the batch size. This step may also be called the Fitting Step, because you are fitting the model to the data.

In the last step, Evaluation Step, you evaluate all the metrics and stop the fitting process right before it over-fits. The model will freeze the weights at this stage and theoretically it is optimally positioned to give high accuracy on unseen data.

These steps are quite simple, but if it seems still a bit abstract. The following is a simple example to solidify you understanding.

The Input and Output Step

We will be using the IMDB movie data set to do binary classification. The data set contains 50'000 movie reviews that are pretty polar. There are no movie reviews that neutral or hard to guess. Some are very positive review and the other are very negative. The entire set is divided equally into the 2 types, 25'000 positive reviews and 25'000 negative.

Now we partition the data into the 2 sets: Training Set and Testing Set.

The Training Set consists of 50% positive reviews and 50% negative reviews, totaling 25'000. While the Testing Set consists of 50% positive reviews and 50% negative reviews, totaling 25'000, exactly mirroring the other set.

So let’s import our data set from keras.datasets so that we can put it into training (examples + labels) and testing (examples + labels) while taking only the top 10'000 most frequently occurring words in the entire review data set.

The stack for this project is Jupyter Notebook, Keras, Tensorflow Backend.

Side Note

This is a simple natural language processing project. Given a bunch of text, we are trying to have our computer understand whether its a positive review or negative review. But our model doesn’t understand the meanings of the word well in “well directed” or bad in “bad acting”. It simply learns to associate the words with labels. As such we don’t give the words as strings to the model, we encode it as vector that can be manipulated with linear transformations. This kind of encoding is called one-hot encoding. We find the highest frequency words in the entire set and rank it by the frequency, and take only the top x number of words, in this case 10'000.

Generally we encode a word as [0 0 0 1 …. 0 0], this may be the word “Bad”. While the word “good” might be another unique one-hot configuration of let’s say [0 1 0 0 … 0 0]. We generally wanna keep the size of this vector small, because our computers have processing limitations and we don’t want it to take too long to train.

Side Note 2

I recommend you set up a TensorFlow Jupyter docker container on your machine for your projects. Another option if you want to prototype quickly is to use Google Colaboratory. Its a ready to use, no config notebook that runs on a Compute Engine with a very generous RAM and Storage.

Side Notes Are Over

The next thing to encode of course are the labels. Our deep learning model isn’t gonna magically produce numbers into strings like “positive review” or “negative review”. It’s encoded into 1 for positive and 0 for negative as well.

There’s an array of dictionary items in the IMDB data set, that maps the actual strings to the one-hot vectors. You can use that to decode the vectors back into English.

Based on the task and the data we now know that we will be classifying text into 2 groups, a classification problem. We roughly know that this means our input vectors will go into our neural network and come out either a 1 or 0.

Encoding and Decoding Step

A typical review looks something like this, in text form:

? this film was just brilliant casting location scenery story direction everyone’s really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy’s that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don’t you think the whole story was so lovely because it was true and was someone’s life after all that was shared with us all

When we consider that this is just a single review and encode the entire review into an array of integers we get:

Now, if we wanna express our series of words as a matrix, or 2D tensor, we use this function:

Here we turn the array of integers into a tensor.

  • One-hot encode your lists to turn them into vectors of 0s and 1s
  • Then you can use a Dense layer, which is able to handle float vectors
  • Define the function that vectorizes the input
  • The function creates an all-zero matrix of shape (len(sequences), dimension)
  • Sets specific indices of result[i] to 1
  • If we remove the truncation on our print function and print x_train[0] we get the above
  • While the below is x_test[0
  • We also want to vectorize our labels
  • This vectorizes the labels, from [int] to [float32]

Sorry for spamming you with all zeroes and ones, but I think just printing out 1 element in the array illustrates how insane and impossible it would be for people to execute this. Computers are truly a miracle. Our inputs and outputs are encoded now.

Architecture Step

So far we have input vectors and label vectors. Our setup will work best with a Dense layer with RELU activation functions. How do we know that?

  • Dense signifies that all the layers are fully connected

Remember that a single layer will implement the above operation.

  • Where 16 is the number of hidden units in the layer
  • Having 16 units means the weight matrix, W, will have the weight matrix W of shape (input_dimension, 16)
  • The dot product with W will project the input data onto a 16-dimensional representational space, and then you will add the bias and then run it through the activation function
  • A hidden unit is a dimension in the representation space of the layer.
  • You can think of the dimensionality of the representation space as “how much freedom you’re allowing the network to have when learning internal representations”
  • The more hidden units you have more complex representation your network can learn, at the expense of compute time. So generally, you wanna pick a dimensionality that is just enough for the complexity of the data.

Now that we know what a single layer does, how do we choose how many layers to stack together?

In general, neural networks are just chained layers. Chained in the sense that the layers form a linked list or a stack or queue. The data goes through these objects.

Since these are simple linear structures, then the only question is how many layers to stack together. Often you will find even a single layer can start to work.

With deeper networks, you can make the layers smaller and the parameters lower. However the optimization gets more difficult.

The optimal architecture is a matter of experimentation and evaluation of the validation set.

In summary, in the Architecture Step, you wanna cover 3 main areas of concern:

  • How many layers to use, in other words how deep
  • How many hidden units to choose for each layer, in other words how wide
  • Output layer, should convert to usable format

For now we’re using 2 intermediate layers with 16 hidden units each. The third layer will output the scalar value predicting the sentiment of the current review.

Our code looks like this so far.

The hidden layers will use relu, but the final layer will use a sigmoid to produce a probability value.

The Components Step

The Loss Function

We use a binary_crossentropy loss. You can also use a mean_squared_error, but cross entropy is usually best for outputs that are probabilities. We use mean squared error for regression problems. The binary in binary_crossentropy is used for binary classification, while the categorical_crossentropy is used for the n-classification problems.

Cross entropy comes to us from Information Theory, it is a measure of the distance between probability distributions.

The Optimizer

The loss function creates a loss score, which the optimizer uses to determine how the network layers weights will be updated.

For our project we will use the rmsprop for the optimizer. We can even pass in a custom loss or metric functions. Or even configure the parameters of the optimizer.

The Metrics

The metrics are the last piece of the puzzle before we start. We use the metrics to regularize. Which you will see next. Our code looks like this so far.

Training Step

Out of the training set of data, we will set apart 10'000 to use first.

Then we put everything together into this nice piece of code and watch it train.

Our output should be this.

Evaluation Step

The above print is a little disgusting and difficult to analyze so let’s graph this baby.

Note that the call to model.fit() returns a History object. This object has a member history, which is a dictionary containing data about everything that happened during training.

Which outputs.

Let’s try plotting this with matplotlib, and see what we get.

Which gives us these nice graphs with the loss and accuracy per epoch for the training and validation sets.

As you can see, the training loss decreases with every epoch, and the training accuracy increases with every epoch. That’s good. But we do this bit here to see when the validation loss is at it’s lowest. After about the 4th epoch we get better and better at predicting our training data. But, this doesn’t translate well to new unseen data.

So to prevent this over-fitting, we stop our training at the 4th epoch. This gives us the best generalized and fitted model in our neural network. To do this you have to compile the model from scratch again, with 4 epochs this time.

This outputs us the result.

You can graph this too. But the values we are looking for are accuracy and val_accuracy. Our accuracy on the training set is 93% and on the validation set 89%.

Using The Model

After having trained a network, you’ll want to use it in a practical setting. You can generate the likelihood of reviews being positive by using the predict method.

Here we are plotting 100 of the first results

We are seeing more polar distribution which is good, since closer to 0 are negative reviews and closer to 1 are positive reviews.

Just to give it a human look, here’s the third movie review from the test set that got a 0.74, which is a positive-ish review. While 0.99 is a strong result, the values in the range of 0.2 to 0.8 are not to be trusted. Because a sigmoid looks like this.

? i generally love this type of movie however this time i found myself wanting to kick the screen since i can’t do that i will just complain about it this was absolutely idiotic the things that happen with the dead kids are very cool but the alive people are absolute idiots i am a grown man pretty big and i can defend myself well however i would not do half the stuff the little girl does in this movie also the mother in this movie is reckless with her children to the point of neglect i wish i wasn’t so angry about her and her actions because i would have otherwise enjoyed the flick what a number she was take my advise and fast forward through everything you see her do until the end also is anyone else getting sick of watching movies that are filmed so dark anymore one can hardly see what is being filmed as an audience we are ? involved with the actions on the screen so then why the hell can’t we have night vision

My guess is that this review has phrases like “i generally love” and “otherwise enjoyed the flick”. And our model isn’t smart enough to pick up on the subtleties of a natural language. Or pick up on sarcasm and all those things.

In the upcoming articles, will explore some topics such as meaning, context and logical consequence to be able to build better natural language processing models.

Other Articles

Up Next…

Coming up next is the architectural design of neural networks. If you would like me to write another article explaining a topic in-depth, please leave a comment.

For the table of contents and more content click here.

--

--