ML from Scratch-Multinomial Logistic Regression
Your Complete Guide to Multinomial Logistic Regression, a.k.a. Softmax Regression
When it comes to real-world machine learning, around 70% of the problems are classification-based, where, on the basis of the available set of features, your model tries to predict that out of a given set of categories(discrete possible outcomes), what category does your target variable might belong to. Today, in this article, we are going to have a look at Multinomial Logistic Regression− one of the classic supervised machine learning algorithms capable of doing multi-class classification, i.e., predict an outcome for the target variable when there are more than 2 possible discrete classes of outcomes.
This is a project-based guide, where we will see how to code an MLR model from scratch while understanding the mathematics involved that allows the model to make predictions.
For the project, we will be working on the famous UCI Cleveland Heart Disease dataset. We will create an ML model from scratch that uses multinomial logistic regression, capable of predicting the severity of heart disease within a patient.
Multinomial Logistic Regression Basics
Before we start working on the actual project, let us first familiarize ourselves with the basic idea behind MLR- what it is, what it does, and how it operates?
What exactly is Multinomial Logistic Regression?
You can think of multinomial logistic regression as logistic regression (more specifically, binary logistic regression) on steroids. While the binary logistic regression can predict binary outcomes (eg.- yes or no, spam or not spam, 0 or 1, etc.), the MLR can predict one out of k-possible outcomes, where k can be any arbitrary positive integer.
How does MLR work?
The multinomial regression function is a statistical classification algorithm. What this means is that once we feed the function a set of features, the model performs a series of mathematical operations to normalize the input values into a vector of values that follows a probability distribution.
- The input that we give to the model is a feature vector, X, containing features x1, x2, x3…..xn.
- The output we get is a probability vector Y, containing probabilities y1, y2, y3…yk for the k target classes.
- Here, y1 + y2 + y3… + yk = 1, since the total probability of all the possible events in a system is always 1.
Finally, the outcome with the highest probability will be the predicted outcome for the given feature set.
Now the question is, how exactly does the MLR function convert feature sets to probability values? We will try to understand this while working on our project.
Importing Project Dependencies
Before we begin working on the project, let us first import all the necessary modules and packages.
Now, we will import the dataset. According to the data source, the dataset does not have column names. So we will set the header attribute as None and then we will manually set the column names as per the information available on the source.
Understanding the Data
Now that we have imported the dataset, let us try to understand what each of these columns denotes.
Here, the num column is our target variable, with the values ranging from 0 (no disease present) to 4 (high chances of heart disease).
Now that we know exactly what our dataset represents, let us move on to the next step.
Data Preprocessing
Now, let us analyze the data and see if it needs any cleaning or modifications. As the first step of our data preprocessing, we will check if there are any null values that need to be dealt with.
As we can see, the columns ca and thal have 4 and 2 ‘?’ values respectively. These are null values that we need to deal with. Since both these columns consist of categorical values, we will replace the null values with the median of the respective columns. We will also type-cast the two columns to ‘float64’ values.
Now that we have cleaned our data, let us have a look at the statistical analysis of our dataset.
Upon observing the data, we can see that the data needs to be scaled since we have values in the range (1, 1e+2). The main reason we are scaling our data is that since we will be using Stochastic Gradient Descent for optimizing our model parameters, scaling can significantly improve the speed and accuracy of our optimizer.
Here, we will use standard scaling in order to standardize the data. Standardization typically means rescaling data to have a mean of 0 and a standard deviation of 1 (unit variance). The following is the mathematical formula for standard scaling.
Here,
- μ = Mean of all the values within a column
- σ = Standard deviation of the column
Now, let us implement this within our code. The first step is to split the dataset into target and feature arrays.
Now, let us define the function for standard scaling.
We will now perform standardization on our features set.
As we can see, the standard deviation for each feature column is now 1, as expected of the standard scaling. We have successfully standardized our feature set.
With this, we have completed the data wrangling process. With the data cleaned and standardized, let us now start working on our model.
Formulating the Model from Scratch
As we saw earlier, the MLR model takes a vector of features as the input and then, on the basis of the features, computes the probabilities for the possible outcomes. So how exactly does the MLR model does that? Let us find out in this section where we will code an MLR model.
The multinomial regression function consists of two functional layers-
- Linear prediction function (a.k.a. logit layer)
- Softmax function (a.k.a. softmax layer)
First, let us see what the linear predictor function does. Given below is the formula for the linear prediction function.
If you observe carefully, this is similar to the function that we use for the linear regression model. What it basically does is that it maps the score for each possible outcome of our target variable in the range (-∞, +∞).
This is somewhat similar to the log odds(logit function), which maps the odds of an event to the range (-∞, +∞). Hence, the linear predictor function is also known as the logit function.
Now, we will see the code for the linear predictor function.
Step 1- Creating random weights and biases for our model (Since we have 5 possible target outcomes and 13 features, k = 5 and m = 13).
Step 2- Defining the linear predictor function.
Now, let us test the function for our features matrix. The final output should be a 303 x 5 matrix since we have 303 feature sets in our dataset and 5 possible outcomes for our target variable.
As we can see, the function worked just fine. Now to the next step, converting logit scores to probability values. This is where the softmax function comes into the picture. Given below is the formula for the softmax algorithm.
What the softmax function does is that it normalizes the logit scores for each possible outcome in a way such that the normalized outputs follow a probabilistic distribution.
In layman’s terms, the softmax function converts logit scores of the possible outcomes of a feature set to probability values.
Now, let us define the softmax function for our model.
Now that we have defined our softmax function as well, let us combine these two functions into a single multinomial function for our model.
Now, let us perform logistic regression on our feature set.
Let us now check the accuracy of our model. Since the weights and biases were randomly generated, we can’t expect our model to be very accurate at the moment.
As we can see, the initial model accuracy is only about 16%, which is very poor to even consider this model for making any heart disease predictions in real life.
As a result, we need to optimize our model parameters in order to improve its accuracy.
Model Optimization
Before we proceed any further towards optimizing our model, we should first split our dataset into test and training sets. Training and testing on the same dataset is considered a bad practice, as it can severely affect your model’s real-world performance.
Let us define train_test_split function for splitting our dataset into training and testing data. We will then run it on our dataset.
Now that we have separate training and testing datasets, we will keep the testing dataset aside and only use it for testing purposes. All the training and optimization will be performed on the training dataset.
Now we are just one step away from optimizing our model. Before we jump to optimization, we have a few questions to answer. What exactly is the criterion on the basis of which we are planning to optimize our model? What exactly is even the purpose of optimization?
The answer is — We want to optimize the model in order to reduce the information loss generated by our model. Since the criterion for optimization is information loss, we need to define a loss function for our model.
For the multinomial regression function, generally, we use the cross-entropy-loss function. Given below is the formula for the cross-entropy-loss function.
Let us now define our cross-entropy loss function.
Now that we have defined our loss function, we will finally define our optimizer algorithm. Given below is the formula on which the Stochastic Gradient Descent operates.
For more details on this, please refer to this source. Given below is the code for the SGD algorithm.
Now that we have the optimizer function ready, we will run it for our model.
We have reached the last step of our project now.
“We are in the endgame!”
We will now test our multinomial logistic regression model with the updated weights and biases that we obtained by running the optimizer function.
NOTE- The test will be conducted on the test dataset and not the training dataset.
As we can see, our model showed around 67% accuracy on the test data. While this is a significant improvement over the initial 16% that we got, there’s still enough room for improvement.
You can further improve the accuracy by playing around with the hyperparameters (learning rate, training epochs, etc.) or by trying the process with a different scaling or optimizer algorithm. Remember, the more you experiment, the more you learn!
With this, we come to the end of our project.
Here were the key takeaways-
- Coding a multinomial logistic regression model from scratch.
- The mathematics involved in an MLR model.
For more fun projects like this one, check out my profile.
I am just a novice in the field of Machine Learning and Data Science so any suggestions and criticism will really help me improve.
Hit that follow and stay tuned for more ML stuff!
Link to GitHub repo for the dataset and Jupyter notebook-
References
- Multinomial Logistic Regression — Wikipedia
- Logit Function — Wikipedia
- Softmax Function — Wikipedia
- Cross-Entropy Function- MachineLeariningMechanic.com