Generalized Linear Models — about

The mathematics and logic behind the Poisson regression and other Generalized Linear Machines.

Tej Sukhatme
5 min readAug 25, 2020

Overview

Both the documentation and the code was heavily inspired from pyGLMnet:

The term generalized linear model (GLM), alludes to a bigger class of models. In these models, the response variable yᵢ follows an exponential family distribution with mean μᵢ, which is a nonlinear function of xᵢT. Some would call these models “nonlinear” because μᵢ is often a nonlinear function of the co-variates, but we consider GLMs to be linear because the co-variates affect the distribution of yᵢ only through the linear combination xᵢT

There are three components to any GLM:

Random Component: the probability distribution of the vectors.

Systematic Component: the explanatory variables in the model, more specifically their linear combination in creating the linear predictor.

Link Function η: the link between systematic and random components. It decides how the expected value of the response relates to the linear predictor of explanatory variables.

Assumptions:

  • The data is independently distributed.
  • The dependent variable assumes a distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal,…)
  • GLM assumes a linear relationship between explanatory variables and eta, which is then non-linearly transformed by the link function to give mu for the exp. family dist
  • Independent variables can be even the power terms or some other nonlinear transformations of the original independent variables.
  • Errors need to be independent.
  • It uses maximum likelihood estimation (MLE) and thus relies on large-sample approximations.

Advantages of GLMs over traditional regression

  • We do not need to transform the response Y to have a normal distribution.
  • The choice of link is separate from the choice of random component thus we have more flexibility in modeling.(though typically one uses an exp family distribution with its canonical link function)
  • If the link produces additive effects, then we do not need constant variance.
  • The models are fitted via Maximum Likelihood estimation; thus optimal properties of the estimators.

In this project I will be focusing on a particular variant of GLM, that is Poisson Regression. It is the kind of regression used in the original research paper written by Mclver and Brownstein.

GLM with elastic net penalty

In the elastic net regularized generalized Linear Model (GLM), we want to solve the following convex optimization problem.

We will go through the Poisson link function case and show how we optimize the cost function.

Poisson GLM

For the Poisson GLM,

λ is the rate parameter of an in-homogeneous linear-nonlinear Poisson (LNP) process with instantaneous mean given by:

where xᵢ ∈ Rᵖˣ¹, i={1,2,…,n} are the observed independent variables (predictors), β₀ ∈ R¹ˣ¹, β ∈ Rᵖˣ¹ are linear coefficients. The rate parameter Λ is also known as the conditional intensity function, conditioned on (β₀,β), q(z)=exp(z)is the non-linearity.

Poisson Log-likelihood

The likelihood of observing the spike count y_i under the Poisson likelihood function with in-homogeneous rate \lambda_i is given by:

The log-likelihood is given by:

We are interested in maximizing the log-likelihood with respect to β₀ and β. Thus, we can drop the factorial term:

Elastic net penalty

The elastic net penalty is given by:

When α= 0 the penalized model is known as ridge regression and when α= 1 it is known as LASSO. Note that we do not penalize the bias β₀.

Objective function

We minimize the objective function:

where L(β₀, β)is the Poisson log-likelihood and Pₐ(β)is the elastic net penalty term and λ and α are regularization parameters.

Gradient descent

To calculate the gradients of the cost function with respect to β₀ and β, let’s plug in the definitions for the log likelihood and penalty terms from above.

Since we will apply coordinate descent, let’s rewrite this cost in terms of each scalar parameter βⱼ.

Let’s take the derivatives of some big expressions using chain rule. Define

For the non-linearity in the first term

For the non-linearity in the second term

where q̇(z) happens to be be the sigmoid function

Putting it all together we have:

This is what gives us the gradient for our loss function, which I will be coding in the next weekly blog post.

--

--