How to: Poisson Regression Model + Python Implementation

Ximena Sandoval
LCC-Unison
Published in
6 min readDec 23, 2020

Hello there! As my first post I’ll be attempting to make almost the whole inference of the Poisson regression model, which was also a course work for my class of Pattern recognition, so don’t try this at home.

In this post we will cover:

  1. What is the Poisson Regression?
  2. The Poisson Distribution as part of the Exponential Family.
  3. Maximum Log-Likelihood method.
  4. Derivating the loss function.
  5. The matricial representation.
  6. A simple implementation.
  7. Awesome references.

1. What is the Poisson Regression?

We already know about the Linear Regression, which helps us answer questions like “How much will a house with these characteristics cost?”. Or the Logistic Regression, which is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick.

But, what happens when the questions are “How many costumers will come today?”, “How many people are in line at the grocery store?”, “How many…”, and one way to answer these questions is using the Poisson Regression Model.

Poisson Regression is used to model count data. For this, we assume the response variable Y has a Poisson Distribution, and assumes the logarithm of its expected value can be modeled by a linear combinations of unknown parameters.

2. The Poisson Distribution as part of the Exponential Family

Alright, first things first, can we express the Poisson distribution on the Exponential Family form? Well, we can find that out.
First, if we have a random variable X such that

we know we can express its probability function as the following

Okay, now let’s do this, assume we have a random variable Y that is distributed as a Poisson with a parameter λ

with its probability function being

We know that

and thus

but wait a minute, the equation (2) has the same form as the equation (1)

and thus

Finally we can say

3. Maximum Log-Likelihood method

Okay! Now, assume we have a random variable Y such as

and for (3) we now we can express its probability function as

Recall

and we do know for GLM that

and thus

Alright, now if we have a set of m observations

where we assume for each y^(i) being i.i.d. and we are looking to maximize the following probability function

Now we use our secret not so secret weapon, the Maximum Likelihood method

and for the Maximum Log-Likelihood method

since log⁡(1/y!) does not depend on w∗ or b∗

and since we would like to minimize instead of maximize

but we know

and thus

and because we want an average

finally obtaining the following loss function

4. Derivating the loss function

Now that we have a loss function, we want it to have a value as low as possible, so we do what we learned on high school and derivate to find the minimum. First we do it respect our vector of features w and make it equal to 0

and for (5) and (6) we have

and thus

and we do the same for b

5. The matricial representation

We’re almost there! Now, we know we can represent the information of our observations as the following

and we also have a feature vector w and a bias vector b⃗

Now let’s define a matrix Z such that

and a function σ as the following

and now we do

where we declare a matrix A that

and we now for (7) and (8)

6. A simple implementation

Alright! Now it’s time to code these results, you can check out the Jupyter Notebook too see the full setup and implementation, but here I’ll leave the important parts :)

We know that we will need the loss function, so let’s start with it

def loss(x, y, w, b):
y_hat = np.exp(x @ w + b)
error = (y_hat - np.log(y_hat) * y).mean()
return error

Done! Let’s remember we also will need to calculate the gradient with respect w and also respect with b, down bellow I’ll show you the implementation following with the gradient_descent algorithm

def grad(x, y, w, b):
M, n = x.shape
y_hat = np.exp(x @ w + b)
dw = (x.T @ (y_hat - y)) / M
db = (y_hat - y).mean()
return dw, db
def gradient_descent(x, y, w_0, b_0, alpha, num_iter):
w, b = w_0.copy(), b_0
hist = np.zeros(num_iter)
M, n = x.shape

for iter in range(num_iter):
dw, db = grad(x, y, w, b)
w -= alpha * dw
b -= alpha * db
hist[iter] = loss(x, y, w, b)

return w, b, hist

We used these functions with a generated dataset and we calculate the following results (with the loss function decreasing as we can see)

And that is pretty much it, thank you for coming to my TED Talk

--

--

Ximena Sandoval
LCC-Unison

Cloud Platform Engineer at Encora. Currently learning about CI/CD, Kubernetes and watching Studio Ghibli movies✨