Multinomial Logistic Regression

talking about the Logistic Regression concept and implementing a logistic function for a real simple classification problem

Sina Dalvand

Follow

10 min readApr 8, 2022

--

Hi, folks!

This article will talk about one of the most helpful regression models for classification problems.

Before deep-diving into concepts, let's start with a usual question!

Imagine you decide to go on a picnic, the first question you will ask yourself is how is the weather cause if it is not proper, it'll make an awful day for you!
So your answer could be Sunny, Windy, Showers, etc., but we know it's not like a real number and couldn't be infinity.
But how did you answer? You'll use many parameters such as Wind Speed, Humidity, Season, and your past year's experience in that place to answer!

We could say that you try to get a specific answer by considering each parameter's importance in sudden weather changes to forecasting.

Now you are playing a machine learning model role for weather prediction! But how the Machine be taught this thinking style! Nothing can do it but Logistic Regression.

Firstly let's take a close look at the concept.

What is Logistic Regression?

By considering you know the fundamental of Regression, we could say Logistic Regression is an advanced model of Linear Regression cause it uses several ideas is related to Linear Regression.

What is different? Why not use Linear?
Linear Regression freely plots all data into a graph and matches every X to the corresponding Y, and it will give an infinity output! But as you saw, we don't have infinity states! Logistic regression yield a mechanism for surrounding outcome between two acceptable values, 1 and 0; these values are pair of Yes and No

Linear Regression Graph vs. Logistic Regression Graph by analyticsvidhya

Logistic Regression uses the Sigmoid function to squeeze values in this range and finally make an S-curve graph.

What is Sigmoid Function?

sigmoid function formula Image by Author

sigmoid is one of the famous functions in statistics and Machine Learning that maps values from real numbers into 0 and 1 or -1 and 1. statistics students know it as the inverse of the Logit function.

output for each different input shows if x is equal to +ꝏ, then we got 1, and if it is equivalent to -ꝏ, it will give 0 output.

Now we know enough about behind the scene, let's step further and find out how we should use these theories in real-world problems.

Logistic Regression Implementaion

I'll use python and Jupiter notebook for implementation.
First of all, we should glimpse the Dataset; that's accessible from here

DataSet looks like the below image:

a small sample from the whole Dataset Image by Author

We realize some tips by the first look:

the Dataset has 14 columns that contain 13 features
the last column shows three labels, and we know it's a multinomial problem
feature values are not on the same scale and need tuning

Fetch data from Dataset and parsing into Dataframe
in this project, we use two famous libraries, Pandas and NumPy

that adds support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

Fetch data from Dataset

the output of Logestic_regression_Imports.py Image by Author

Data Splitting

Now we need to split data into Training and Testing Sets because the Model will Train and Tune its parameter(thetas) by considering features and Lable for each record and finally use the Train set (contain unseen data) to evaluate preciseness. According to this description, we should use 70% of the Dataset for the training phase and 30% for the Test set!

Common ratios used are: 70% train, 30% test. 80% train, 20% test

for implementing this splitter function, I've used an effortless way :

In the second-line whole data record's count will get from the Dataset and calculated 70% size of the data, and it will be about 144/0.7~ 101 records.

Then in the third line, we create an array that contains True and False values in the size of the whole data records, but we distinguish it by generating True for the Training set and False for the Testing set. Now we have an array that contains 70% True and 30% False values.

Data should split randomly, and we don't know whether the current Dataset is ordered in a specific way or not! So let's break it and shuffle generated True/False array to ensure it will collect randomly.
So I did it by using the shuffle function.

Then try to zip every True and False to correspond index in Dataset and put them together.

Finally, by filtering paired data, we could have two split lists: Train and Test sets.

Extract Features and Labels from Sets

to make access to the data more straightforward, I've decided to distinguish features from labels and put them into an individual array.

then training set will be like this:

Scale Features value into the desired size

as we mentioned in dataset tips, features value need to be on the same scale, and we could do this by applying normalization to it and putting values between 0 and 1 quickly :)

Normalization statistic formula Image by Author

according to the right formula, for implementation, we need to use two functions in Numpy, "Numpy.mean() "and "Numpy.std() "are used to calculate the mean of every single element in a set of features.

At first, I defined normalize function that gets features as input and then parses all values into float type to prevent errors programmatically.

Then I did this scaling for both the training set and the test set features.

X_train before applying normalization Image by Author

X_train after using normalization Image by Author

as shown in the top table; all values squeezed between 0 and 1.

Every theta needs a corresponding feature

Linear Regression Hypothesis function Image by Author

if you remember the Linear Regression hypothesis function, we imagine xₒ
equal 1 to make matrixes calculation more convenient.

so we have the same here, but with a little bit different hypothesis formula:

Logistic Regression Hypothesis function Image by Author

We need to insert a ones column as the first index in features.

Ones column added to features Image by Author

Touch the Core

Now it's time to talk about the main part of the process!
Before jumping into implementation, let's talk about the concept and understand what's going on around it.

Linear Regression Cost Function Image by Author

We remember in Linear Regression, we used Cost Function (loss function) J(θ) to represent the optimization objective by using gradient descent for minimization.

Fortunately, we have the same approach in Logistic Regression! But may you ask why we don't use the cost function of Linear Regression for these problems, and the answer is many local minimums!

Using the Linear Regression cost function for Logistic Regression Image by Author

as the graph shows, we have many local minimums, and it's not a convex graph! it makes complex circumstances to find the best one because when you think you're in a global minimum point, may exist a bit better, but you’re stuck in your limited vision :)

so for solve this problem we must use another cost function that's related to Logistic Regression, and that's nothing but :

Logistic Regression Cost Function Image by Author

and if we compress these functions into one, it will be sth like this:

Well, now it's time to use the Gradient Descent for minimizing the cost function:

Gradient descent formula Image by Author

there is an α just before the derivative part and that’s nothing but The Learning rate and I expect you to know it before

as you can see, we need the derivate cost function, and it would calculate like this :

derivate of cost function Image by Author

finally, we have :

Derivated Logistic Cost function Image by Author

theoretically, mission done successfully, now let's make hands dirty:

there is no pre-build sigmoid function, and we saw this more than everything
so the first function that needs to be written is sigmoid and we talked about it before

sigmoid function implementation

the second frequent function is the hypothesis function in Logistic Regression:

Hypothesis Function in Logistic Regression

Ok, now it's time for Gradient descent to come up :

Wow, that’s a lot of complex code compared to others! 🤯
let’s make it as easy as possible by line by line description

input parameters :

x: Train Set features as 2d matrix
y: the corresponding class label for features set
alpha: learning rate
iter: iteration count used to stop the algorithm

In the first line, we get the feature’s count to know how many thetas needs, then in the second line we try to convert all Labels into a unique set by using pure python and it works like this:

set function sample

then in the third line, we created an empty array for thetas in the future.

the first loop is the different points between Multinomial and Binary Logistic Regression. for make it easier to understand let’s check it out:

In binary classification, we only have two classes! and when you want to train your Model, you just have two types of data, triangle, and square.
When you try to tune thetas, as types are just two, it belongs to the triangle; otherwise, it belongs to the square. this comparison can happen in one step. For this reason, you don’t need to compare them to each other one by one!
so the code for Binary Logistic Regression will be sth like this:

Binary Logistic Regression Implementation

But in multinomial classification, we have more than two class labels, which makes the process harder. let’s start with a simple graph shape sample :

We know Logistic Regression works Binary instinctively, so we should change the problem according to the work style! We need to group classes to binary mode and make triangles as one class and all other types into one other class; with this approach, we have a binary problem but should consider it needs to repeat for each class. i.e., in the next step, the square will be one class, and the two others will be one another class.

Classes group in multinomial Logistic Regression Image by Author

According to this method, now we know the reason for the first loop! It tries to do grouping for each type and tune thetas for that specific type of class!
also, it makes initial thetas for that particular type and then distinguishes between that indexed Y and others
Ops! there is a little function that did not mention before :

extract specific label from Y_train

And this distinguishes happened by the above function and filter Y just for a specific class label: i.e, filtering is done by making selected type value as 1 and other as 0.

from now everything is just like a Binary Logistic Regression

above function exactly is an implementation of :

theta updated simultaneously Image by Author

At the end of the function, we return thetas and class labels for the next step.

Let’s run our Logistic Regression:

Evaluate your Train

after all of these steps, let’s take the last one and evaluate model prediction accuracy; we first try it by training data!
expect to be 100% :

evaluate the Model by training data

evaluation result for train data Image by Author

now round to Test set (this one is more reliable cause may you trap in an overfitting problem)

evaluation result for test data Image by Author

98% is high accuracy for a model, and that sounds good.

But what is the predict() function?

as you see, we used predict function to apply X (features) into the hypothesis function by providing thetas, and predict looks like this:

note: @ operator is a dot product operation to two matrix

After applying arguments, In the Logistic Regression hypothesis function, use argmax to detect which output is maximum to assign it to a specific class label.

Feel free if you need to make contact:

Email: dalvandsina@yahoo.com

Github: www.github.com/sinadalvand

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com