What is Softmax Regression?

Preethi Thakur
4 min readOct 3, 2022

--

Softmax Regression

Softmax regression (or multinomial logistic regression) is a generalization of logistic regression to the case where we want to handle multiple classes.

In logistic regression we usually assume that the labels are binary i.e., y(i)∈{0,1}. The classifier distingushes between instance belongs to which class. While Softmax regression allows us to handle y(i)∈{1,…,K} where K is the number of classes.

When we give an instance x, to the Softmax Regression classifier, it first computes a score sk(x) for each class k, then estimates the probability of each class by applying the softmax function to the scores. The equation to compute softmax score sk(x) for class k is as follows:

Softmax score for class k

After computing all the score for all classes for the instance x, we estimate the probability, pk, that the instance belongs to the class k, by applying softmax function on the scores.

Softmax function
  • Here k is the number of classes.
  • s(x) is a vector containing the scores of each class for the instance x.
  • σ(s(x))k is the estimated probability that the instance x belongs to class k for the scores of each class for that instance.

From above equation we notice, the softmax function computes the exponential of every score, then normalizes them by dividing with the sum of all the exponentials.

Just like the Logistic Regression classifier, the Softmax Regression also predicts the class with the highest estimated probability i.e., the class with the highest score.

In softmax regression sum of all probabilities is equal to 1.

Softmax regression classifier prediction

The argmax operator returns the value of a variable that maximizes a function. In this equation, it returns the value of k that maximizes the estimated probability σ(s(x))k.

Note: The Softmax Regression classifier predicts only one class at a time i.e., it is multi-class model and not a multi-output model. So, SR classifier should be used only with mutually exclusive classes like different types of flowers(example, iris dataset).

Cost Function

In training the objective is to have a model that estimates high probability for the target class and low probability for the other classes. Keeping this objective in mind we are going to minimize the cost function, also called as cross entropy. Cross entropy cost function penalizes the model when it estimates a low probability for a target class.

Cross entropy is a measure of how well a set of estimated class probabilities match the target classes. So, we can say that it measures the performance of the model.

The softmax cost function is similar to logistic regression, except that we now sum over the K different possible values of the class label. The equation of Cost function is as follows.

Cross Entropy Cost Function

yk(i) is the target probability that says whether the instance i belongs to class k or not. The probability will be equal to 1 or 0 depending on which class the instance belongs.

When there are just two classes (K = 2), this cost function is same as logistic regression cost function.

We cannot solve for the minimum of J(θ) analytically, and thus as usual we’ll resort to an iterative optimization algorithm, Gradient Descent. We take derivatives on softmax regression equation that we got after applying softmax function to the scores, to calculate probabilities Pk.

Softmax Regression

After taking derivatives on above equation, the gradient vector of the cost function with regards to θ(k) is given by,

Cross Entropy Gradient Vector for class k

∇θ(k)J(θ) is a vector, so that its jth element is the partial derivative of J(θ) with respect to the jth element of θ(k).

This way we can compute the gradient vector for every class, then use Gradient Descent to find the parameter matrix Θ that minimizes the cost function J(θ).

Decision Boundary

Let’s consider the IRIS dataset. When we use Softmax Regression to classify the iris flowers into all three classes, we get the following classification.

Softmax Regression decision boundaries

We can observe the resulting decision boundaries for three classes, represented by the background colors. The decision boundaries between any two classes are linear. The probabilities for the Iris-Versicolor class, represented by the curved lines (e.g., the line labeled with 0.450 represents the 45% probability boundary). Notice that the model can predict a class that has an estimated probability below 50%. For example, at the point where all decision boundaries meet, all classes have an equal estimated probability of 33%.

Click here to know other logistic regression methods(analysis) and for a brief introduction on Logistic Regression.

--

--