Multinomial Logistic Regression In a Nutshell

Logistic Regression on the Fashion MNIST Dataset

Introduction

Logistic regression is one of the most frequently used models in classification problems. It can accurately predict the probability of a person having certain diseases, the probability of a person getting a ticket if he/she is speeding, or the probability of a sports team winning a game. Notice that these examples are binary, meaning that the logistic regression would have two outcomes: a “Yes” or a “No”. We call this a binary logistic regression.

There is another type of logistic regression that can predict more than two outcomes. This is multinomial (multiclass) logistic regression (MLR).

In this tutorial, we will not be using any external package functions to build our model. Instead, we will be building a multinomial logistic regression model from scratch, only using numpy and seemingly complex mathematics. Don’t fret, I will explain the math in the simplest form possible.

Michael Scott from The Office

Prerequisites

Before we dive into the definition of multinomial logistic regression, I assume that you are familiar with the concept of binary logistic regression. If not, check out this video for binary logistic regression. You should also know about the overall structure of a neural network. If not, check out this article for more.

Multinomial Logistic Regression… and More

To learn about the multinomial logistic regression, let’s first remind ourselves the components of a binary logistic regression model:.

In binary logistic regression, we have:

  • Sigmoid function, which maps a real-valued input to the range 0 to 1.
  • Maximum likelihood estimation (MLE), which maximizes the probability of the data
  • Gradient descent, which attempts to find the minimum parameters of MLE.

In multinomial logistic regression, we have:

  • Softmax function, which turns all the inputs into positive values and maps those values to the range 0 to 1
  • Cross-entropy loss function, which maximizes the probability of the scoring vectors to the one-hot encoded Y (response) vectors.
  • Stochastic gradient descent, which is just a gradient descent from a sample features.

MLR shares steps with binary logistic regression, and the only difference is the function for each step. In binary logistic regression, Sigmoid function is used because it is a binary classification problem. In MLR, we use the Softmax function because the problem is no longer binary. This function can distribute probabilities for each output node. Now that our activation function is different in MLR, the loss function is also different because our loss function depends on the activation function.

Lastly, we use stochastic gradient descent rather than regular gradient descent because there are too many features in our data. Calculating gradient descent for each feature would take too much computations. We will come back to this topic later when we implement stochastic gradient descent in our code. Now that we know MLR in words, let’s see what MLR looks like visually.

Figure 1: Structure of multinomial logistic regression

Does the graph above look familiar? It must be! MLR shares a similar structure with neural networks! Actually, MLR follows the structure of a perceptron, and a multi-layer perceptron is called neural networks. There are three commonly used neural networks. Feedforward neural network (ANN) is a neural network where the connection between the nodes proceeds in a forward direction. A recurrent neural network (RNN) does not feed forward, it repeats itself in the process. A convolutional neural network (CNN) does not feedforward either, it works by extracting important features using filters during the process (Figure 2).

Compare with neural networks, MLR is much easier to implement because it is simple. MLR only requires 1 layer of network, which means that there are fewer calculations than multi-layer neural networks. However, since MLR is a less complex model, the accuracy will not be as high as neural network models.

Figure 2: ANN (left), RNN (middle), and CNN (right)

Now that we understand multinomial logistic regression, let’s apply our knowledge. We’ll be building the MLR by following the MLR in the graph above (Figure 1).

Data

Our data will be the Fashion MNIST dataset from Kaggle. The dataset is stored as a DataFrame with 60,000 rows, where each row represents an image. The DataFrame also has 785 columns, where the first column of the DataFrame represents the label of the image (Figure 3.2). The rest of the 784 columns contain the RGB-values for the pixels of each training image (Figure 3.1). Each image pixel number ranges from 0 to 255 based on its RGB value.

Figure 3.1. Sample image of a shirt from the training set.
Figure 3.2. Labels

Task:

  • Split the DataFrame into DataFrame X and DataFrame Y
  • Convert DataFrame X to an array
  • One-hot encoding Y values and convert DataFrame Y to an array

We are using one-hot encoder to transform the original Y values into one-hot encoded Y values because our predicted values are probabilities. I will explain this later in the next step.

Figure 4. Given X- and Y-values and desired X- and Y-values

Score & Softmax

Task:

  • Compute the score values
  • Define an activation function
  • Run the activation function to compute errors

Looking at Figure 1, the next step would be computing the dot product between the vectors containing features and weights. Our original weight vector will be an array of 0s because we do not have any better values. Don’t worry, our weight will be constantly updating as the loss function is minimized. The dot product is called the score. This score is the deciding factor that predicts whether our image is a T-shirt/top, a dress or a coat.

Figure 5. Softmax function. Photo credit to Wiki Commons

Before we utilize the score to predict the label, we have 2 problems. Remember that we one-hot encode our scores because our predicted values are probabilities? Our current scores are not probability values because they contain negative values and thus do not range from 0 to 1. So, we need to implement the Softmax function in order to normalize the scores. This exponent normalization function would convert our scores into positives and turn them into probabilities (Figure 5).

In an array of probability values for each possible result, the argmax of the probability values is the Y value. For example, in an array of 10 probabilities, if the 5th element has the highest probability, then the image label is a coat, since the 5th element in the Y values is the coat (Figure 3.1).

Figure 6. Score and Softmax functions in Python

Gradient Descent & Loss Function

Task:

  • Define a gradient function
  • Define a loss function
  • Optimize the loss function

After the Softmax function computes the probability values in the initial iteration, it is not guaranteed that the argmax matches the correct Y value. We need to iterate multiple times until we are confident about our argmax. To validate our argmax, we need to set up a loss function. We will use cross-entropy loss.

Figure 7. Cross-entropy loss in Python

The way to maximize the correctness is to minimize the loss in cross entropy function. To do that, we will apply gradient descent. Specifically, we will use stochastic ٖgradient descent. Stochastic gradient descent is no different than regular gradient descent. The term “stochastic” means random, meaning the gradient descent will be done by randomly selecting a sample of features. Then, instead of taking the gradient of all features, we are only calculating the gradient for the sample features. The purpose of stochastic gradient descent is to decrease the number of iterations and save time.

Figure 8. Gradient descent and stochastic gradient descent formulas

In order to achieve randomness, we will disrupt the order inside the X array (permutation). Each time we sample an image from the X array, we’ll compute the stochastic gradient descent and update the weight. Then the updated weight will be used to find the minimum value of the loss function. The number of times the minimum is found in the loss function is known as the number of epochs. Typically, more epochs would lead to better results since there is more training involved. However, too many epochs would lead to overfitting. Choosing a good epoch depends on the loss values, there is an article that talks about how to choose a good number of epochs here.

Train & Test

Task:

  • Define a training set and a test set
  • Train our samples
  • Visualize our loss values

Now that we have optimized the loss function, let’s test our model on our data. Our training sample has 60,000 images, we will split 80% of the images as the train set and the other 20% as the test set based on the Pareto principle. While we fit the model, let’s keep track of the loss values to see if the model is working correctly.

Figure 9. Losses after iterations

We can clearly see that the value of the loss function is decreasing substantially at first, and that’s because the predicted probabilities are nowhere close to the target value. That means our loss values are far from the minimum. As we are getting close to the minimum, the error is getting smaller because the predicted probabilities are getting more and more accurate.

Accuracy

After fitting the model on the training set, let’s see the result for our test set predictions.

Figure 10. Prediction result

It looks like our accuracy is about 85% (Figure 11), which is not so bad.

Figure 11. Accuracy Score

Challenge

Now that we have our initial predictions, see if you can improve the accuracy by adjusting the parameters in the model or adding more features. Tuning parameters like the learning rate and epoch is something to start with. I’ve attached the code and datasets for you to play around with. Enjoy!

--

--