Multinomial Logistic Regression
talking about the Logistic Regression concept and implementing a logistic function for a real simple classification problem
Hi, folks!
This article will talk about one of the most helpful regression models for classification problems.
Before deep-diving into concepts, let's start with a usual question!
Imagine you decide to go on a picnic, the first question you will ask yourself is how is the weather cause if it is not proper, it'll make an awful day for you!
So your answer could be Sunny, Windy, Showers, etc., but we know it's not like a real number and couldn't be infinity.
But how did you answer? You'll use many parameters such as Wind Speed, Humidity, Season, and your past year's experience in that place to answer!
We could say that you try to get a specific answer by considering each parameter's importance in sudden weather changes to forecasting.
Now you are playing a machine learning model role for weather prediction! But how the Machine be taught this thinking style! Nothing can do it but Logistic Regression.
Firstly let's take a close look at the concept.
What is Logistic Regression?
By considering you know the fundamental of Regression, we could say Logistic Regression is an advanced model of Linear Regression cause it uses several ideas is related to Linear Regression.
What is different? Why not use Linear?
Linear Regression freely plots all data into a graph and matches every X to the corresponding Y, and it will give an infinity output! But as you saw, we don't have infinity states! Logistic regression yield a mechanism for surrounding outcome between two acceptable values, 1 and 0; these values are pair of Yes and No
Logistic Regression uses the Sigmoid function to squeeze values in this range and finally make an S-curve graph.
What is Sigmoid Function?
sigmoid is one of the famous functions in statistics and Machine Learning that maps values from real numbers into 0 and 1 or -1 and 1. statistics students know it as the inverse of the Logit function.
output for each different input shows if x is equal to +ꝏ, then we got 1, and if it is equivalent to -ꝏ, it will give 0 output.
Now we know enough about behind the scene, let's step further and find out how we should use these theories in real-world problems.
Logistic Regression Implementaion
I'll use python and Jupiter notebook for implementation.
First of all, we should glimpse the Dataset; that's accessible from here
DataSet looks like the below image:
We realize some tips by the first look:
- the Dataset has 14 columns that contain 13 features
- the last column shows three labels, and we know it's a multinomial problem
- feature values are not on the same scale and need tuning
Fetch data from Dataset and parsing into Dataframe
in this project, we use two famous libraries, Pandas and NumPy
that adds support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
Data Splitting
Now we need to split data into Training and Testing Sets because the Model will Train and Tune its parameter(thetas) by considering features and Lable for each record and finally use the Train set (contain unseen data) to evaluate preciseness. According to this description, we should use 70% of the Dataset for the training phase and 30% for the Test set!
Common ratios used are: 70% train, 30% test. 80% train, 20% test
for implementing this splitter function, I've used an effortless way :
In the second-line whole data record's count will get from the Dataset and calculated 70% size of the data, and it will be about 144/0.7~ 101 records.
Then in the third line, we create an array that contains True and False values in the size of the whole data records, but we distinguish it by generating True for the Training set and False for the Testing set. Now we have an array that contains 70% True and 30% False values.
Data should split randomly, and we don't know whether the current Dataset is ordered in a specific way or not! So let's break it and shuffle generated True/False array to ensure it will collect randomly.
So I did it by using the shuffle function.
Then try to zip every True and False to correspond index in Dataset and put them together.
Finally, by filtering paired data, we could have two split lists: Train and Test sets.
Extract Features and Labels from Sets
to make access to the data more straightforward, I've decided to distinguish features from labels and put them into an individual array.
then training set will be like this:
Scale Features value into the desired size
as we mentioned in dataset tips, features value need to be on the same scale, and we could do this by applying normalization to it and putting values between 0 and 1 quickly :)
according to the right formula, for implementation, we need to use two functions in Numpy, "Numpy.mean() "and "Numpy.std() "are used to calculate the mean of every single element in a set of features.
At first, I defined normalize function that gets features as input and then parses all values into float type to prevent errors programmatically.
Then I did this scaling for both the training set and the test set features.
as shown in the top table; all values squeezed between 0 and 1.
Every theta needs a corresponding feature
if you remember the Linear Regression hypothesis function, we imagine xₒ
equal 1 to make matrixes calculation more convenient.
so we have the same here, but with a little bit different hypothesis formula:
We need to insert a ones column as the first index in features.
Touch the Core
Now it's time to talk about the main part of the process!
Before jumping into implementation, let's talk about the concept and understand what's going on around it.
We remember in Linear Regression, we used Cost Function (loss function) J(θ) to represent the optimization objective by using gradient descent for minimization.
Fortunately, we have the same approach in Logistic Regression! But may you ask why we don't use the cost function of Linear Regression for these problems, and the answer is many local minimums!
as the graph shows, we have many local minimums, and it's not a convex graph! it makes complex circumstances to find the best one because when you think you're in a global minimum point, may exist a bit better, but you’re stuck in your limited vision :)
so for solve this problem we must use another cost function that's related to Logistic Regression, and that's nothing but :
and if we compress these functions into one, it will be sth like this:
Well, now it's time to use the Gradient Descent for minimizing the cost function:
there is an α just before the derivative part and that’s nothing but The Learning rate and I expect you to know it before
as you can see, we need the derivate cost function, and it would calculate like this :
finally, we have :
theoretically, mission done successfully, now let's make hands dirty:
there is no pre-build sigmoid function, and we saw this more than everything
so the first function that needs to be written is sigmoid and we talked about it before
the second frequent function is the hypothesis function in Logistic Regression:
Ok, now it's time for Gradient descent to come up :
Wow, that’s a lot of complex code compared to others! 🤯
let’s make it as easy as possible by line by line description
input parameters :
- x: Train Set features as 2d matrix
- y: the corresponding class label for features set
- alpha: learning rate
- iter: iteration count used to stop the algorithm
In the first line, we get the feature’s count to know how many thetas needs, then in the second line we try to convert all Labels into a unique set by using pure python and it works like this:
then in the third line, we created an empty array for thetas in the future.
the first loop is the different points between Multinomial and Binary Logistic Regression. for make it easier to understand let’s check it out:
In binary classification, we only have two classes! and when you want to train your Model, you just have two types of data, triangle, and square.
When you try to tune thetas, as types are just two, it belongs to the triangle; otherwise, it belongs to the square. this comparison can happen in one step. For this reason, you don’t need to compare them to each other one by one!
so the code for Binary Logistic Regression will be sth like this:
But in multinomial classification, we have more than two class labels, which makes the process harder. let’s start with a simple graph shape sample :
We know Logistic Regression works Binary instinctively, so we should change the problem according to the work style! We need to group classes to binary mode and make triangles as one class and all other types into one other class; with this approach, we have a binary problem but should consider it needs to repeat for each class. i.e., in the next step, the square will be one class, and the two others will be one another class.
According to this method, now we know the reason for the first loop! It tries to do grouping for each type and tune thetas for that specific type of class!
also, it makes initial thetas for that particular type and then distinguishes between that indexed Y and others
Ops! there is a little function that did not mention before :
And this distinguishes happened by the above function and filter Y just for a specific class label: i.e, filtering is done by making selected type value as 1 and other as 0.
from now everything is just like a Binary Logistic Regression
above function exactly is an implementation of :
At the end of the function, we return thetas and class labels for the next step.
Let’s run our Logistic Regression:
Evaluate your Train
after all of these steps, let’s take the last one and evaluate model prediction accuracy; we first try it by training data!
expect to be 100% :
now round to Test set (this one is more reliable cause may you trap in an overfitting problem)
98% is high accuracy for a model, and that sounds good.
But what is the predict() function?
as you see, we used predict function to apply X (features) into the hypothesis function by providing thetas, and predict looks like this:
note: @ operator is a dot product operation to two matrix
After applying arguments, In the Logistic Regression hypothesis function, use argmax to detect which output is maximum to assign it to a specific class label.
Feel free if you need to make contact:
Email: dalvandsina@yahoo.com
Github: www.github.com/sinadalvand