How to: Poisson Regression Model + Python Implementation
Hello there! As my first post I’ll be attempting to make almost the whole inference of the Poisson regression model, which was also a course work for my class of Pattern recognition, so don’t try this at home.
In this post we will cover:
- What is the Poisson Regression?
- The Poisson Distribution as part of the Exponential Family.
- Maximum Log-Likelihood method.
- Derivating the loss function.
- The matricial representation.
- A simple implementation.
- Awesome references.
1. What is the Poisson Regression?
We already know about the Linear Regression, which helps us answer questions like “How much will a house with these characteristics cost?”. Or the Logistic Regression, which is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick.
But, what happens when the questions are “How many costumers will come today?”, “How many people are in line at the grocery store?”, “How many…”, and one way to answer these questions is using the Poisson Regression Model.
Poisson Regression is used to model count data. For this, we assume the response variable Y has a Poisson Distribution, and assumes the logarithm of its expected value can be modeled by a linear combinations of unknown parameters.
2. The Poisson Distribution as part of the Exponential Family
Alright, first things first, can we express the Poisson distribution on the Exponential Family form? Well, we can find that out.
First, if we have a random variable X such that
we know we can express its probability function as the following
Okay, now let’s do this, assume we have a random variable Y that is distributed as a Poisson with a parameter λ
with its probability function being
We know that
and thus
but wait a minute, the equation (2) has the same form as the equation (1)
and thus
Finally we can say
3. Maximum Log-Likelihood method
Okay! Now, assume we have a random variable Y such as
and for (3) we now we can express its probability function as
Recall
and we do know for GLM that
and thus
Alright, now if we have a set of m observations
where we assume for each y^(i) being i.i.d. and we are looking to maximize the following probability function
Now we use our secret not so secret weapon, the Maximum Likelihood method
and for the Maximum Log-Likelihood method
since log(1/y!) does not depend on w∗ or b∗
and since we would like to minimize instead of maximize
but we know
and thus
and because we want an average
finally obtaining the following loss function
4. Derivating the loss function
Now that we have a loss function, we want it to have a value as low as possible, so we do what we learned on high school and derivate to find the minimum. First we do it respect our vector of features w and make it equal to 0
and for (5) and (6) we have
and thus
and we do the same for b
5. The matricial representation
We’re almost there! Now, we know we can represent the information of our observations as the following
and we also have a feature vector w and a bias vector b⃗
Now let’s define a matrix Z such that
and a function σ as the following
and now we do
where we declare a matrix A that
and we now for (7) and (8)
6. A simple implementation
Alright! Now it’s time to code these results, you can check out the Jupyter Notebook too see the full setup and implementation, but here I’ll leave the important parts :)
We know that we will need the loss function, so let’s start with it
def loss(x, y, w, b):
y_hat = np.exp(x @ w + b)
error = (y_hat - np.log(y_hat) * y).mean()
return error
Done! Let’s remember we also will need to calculate the gradient with respect w and also respect with b, down bellow I’ll show you the implementation following with the gradient_descent algorithm
def grad(x, y, w, b):
M, n = x.shape
y_hat = np.exp(x @ w + b)
dw = (x.T @ (y_hat - y)) / M
db = (y_hat - y).mean()
return dw, dbdef gradient_descent(x, y, w_0, b_0, alpha, num_iter):
w, b = w_0.copy(), b_0
hist = np.zeros(num_iter)
M, n = x.shape
for iter in range(num_iter):
dw, db = grad(x, y, w, b)
w -= alpha * dw
b -= alpha * db
hist[iter] = loss(x, y, w, b)
return w, b, hist
We used these functions with a generated dataset and we calculate the following results (with the loss function decreasing as we can see)
And that is pretty much it, thank you for coming to my TED Talk