Understanding Logistic Regression — Part 1: Maths

Mei Leng
6 min readOct 6, 2020

Logistic regression (LR) is a probabilistic binary classifier, it finds vast applications in various fields, and its importance cannot be overstated. Given a set of explanatory X-variables as input, logistic regression builds a mathematical equation to produce a probability value on how likely the output y-variable will be from Class 1 against Class 0.

I learned Logistic Regression many years ago, I thought I understood it well, (yeah, nothing special, it is just anther classifier with some linear equation), but only after I started to work on credit score models, I realized that it requires more efforts to master it. The logic behind is simple, but it is never easy to use it well in practice, especially with all complications of real-world data. So that’s why I decided to start this series of articles, to make sure that I truly understand this algorithm.

Part 1: Maths

Part 2: Feature selection and multicollinearity

Part 3: How to interpret the coefficients

Part 4: Applications in credit risk score

Math Alert!

This part will be equation and math heavy, but hopefully it will provide a more profound understanding of the algorithm.

Where does the ``logistic`` come from?

LR solves a binary classification problem, where we aim to predict how likely our observations are from Class 1 against Class 0. That is, we have a set of observations from a system whose output is either 1 or 0, denoting the observations as {𝒙₁, …, 𝒙_N ⎮𝒙ᵢ ∈ ℛᵈ} and output as {𝑦₁, …, 𝑦_N ⎮𝑦ᵢ ∈ [0,1]}, with each observation 𝒙ᵢ corresponding to the output 𝑦ᵢ, we would like to build a classifier which takes any unseen observation 𝑥ⱼ and predict what is the probability of 𝑦ⱼ = 1.

This is a statistical inference problem. Denoting 𝑝 = ℙ(𝑌 = 1), we can write the probability distribution as:

To find the probability value 𝑝, we seek to maximize the log-likelihood function:

Note: the above is from the discriminative point of view for classification modelling, the same problem can also be framed as a generative classification, which basically assigns a Class 1/0 to the observations by computing both likelihood and prior.

The maximum occurs where the gradient is zero. Solving for ▽𝓛 =0, we have

and it yields:

which basically says that the chance of any future observation from Class 1 depends on how frequent it happens in the past. This is a quite useless conclusion, and it carries no information regarding the observation set. So, how should we make use of our observations in the framework of maximum likelihood?

How about assuming the probability linearly depends on the observation values? And we can define that the probability of each observation as a linear function of all the observation values plus a constant bias term, that is,

where the i-th observation 𝒙ᵢ ∈ ℛᵈ, with its j-th dimension value (i.e., the j-th feature) as x_{i,j}, and \bar{x} is 𝒙 augmented with a constant term 1. But this probability value can span from -∞ to +∞ in theory, which violates the axioms of probability.

Figure 1. Logistic Function

To create a proper probability, we pass the linear sum to a logistic function, which squash the sum value into the range of [0,1]. The logistic function has a shape shown in Figure 1 above, it shows that the function is nearly linear around 0 but outlier values get squashed toward 0 or 1. This leads to the probability representation as follow:

Substituting this equation into the log-likelihood function and solving for the linear coefficients {𝛽ᵢ}, we arrived at the formulation of logistic regression, that is,

where the coefficient vector 𝛽 has dimension d+1.

Another important perspective to view the logistic relationship between probability and observation values is that,

the left-hand side of the equation corresponds to the log odds, and this equation states that we represent the log odds as a linear combination of all feature values. The log odds and this linear representation enables a nice way to express the effect of X-variable and allows the easy update with new data. It is closely related to the interpretability of logistic regression, and we will see more on this in Part 3.

How to find 𝛽 ?

The linear coefficient 𝛽 is the root of the equation ▽𝓛 =0. However, the log-likelihood function 𝓛 is non-linear with respect to 𝛽, and there is no explicit solution to the equation. How should we approach it then? Gradient descent! Of cause 🤣 ! (Actually, the logistic regression can be thought of as a one layer neural network with only one neuron.)

Denoting

and observing that

we have the gradient,

Using the basic Gradient descent formula, and denoting the learning rate as 𝜆, we can estimate the coefficient 𝜷 with the update rule,

and we iterate from t=0 till convergence.

We can avoid the hyper-parameter of learning rate by using Newton’s method, where we set the learning rate equal to the inverse of the Hessian of the log-likelihood function. Also, to further address the issue of matrix inversion due to the introduction of the Hessian matrix, some advanced optimization techinques, like Conjugate Gradient method, can be used. That is howsklearn.linear_model.LogisticRegression() from sklearn package works in practice.

The objective function in sklearn implementation:

In the previous formulation, we set the y-variable to be either 1 or 0. In practice, there exists another alternative, y ∈ [-1, +1], and the LogisticRegression function in sklearn did such conversion internally.

The starting point is the same, we seek to maximize the log-likelihood function (equivalently to minimize its negative):

and the difference lies in how we model the probability function 𝑓(yᵢ; p). With y ∈ [-1, +1], we have,

and we can write in a compact format that,

Substituting this function into the log-likelihood function, we have,

we can then proceed to find 𝜷 following a similar method. The resulting Gradient will have a slightly different format, but the derivation is simpler and clearer.

--

--