Intro to Poisson Regression

Poisson Distribution

Poisson Distribution is the discrete probability of count of events which occur randomly in a given interval of time. It is a limiting form of the binomial distribution in which n becomes very large and p is very very small (meaning the number of trials is very large while the probability of occurrences of outcome under observation is small.

Definition of Poisson Distribution

X =the number of events in a given interval

λ = mean number of events per interval

The probability of observing x events in a given interval is given by

Poisson Distribution problem 1

Births in a hospital occur randomly at an average rate of 1.8 births per hour. What is the probability of observing 4 births in a given hour at the hospital?

Let X = Number of births in a given hour

(i) Events occur randomly

(ii) Mean rate λ = 1.8

Poisson Distribution problem 2

What is the probability of observing more than or equal to 2 births in a given hour at the hospital?

Solutions can be found on page 11 and 13 here http://www.stats.ox.ac.uk/~marchini/teaching/L5/L5.slides.pdf

Poisson Regression Model

Poisson regression is a form of regression analysis used to model discrete data. It is appropriate when the conditional distributions of Y (count data) given the observed counts are expected to be Poisson distributions.

Poisson regression model is written in terms of the mean response. We assume that there exists a function that relatives the mean of the response to a linear predictor.

First we want to transform our k-dimensional input (x1, x2, …, xk) into the real number space:

Then we perform another transformation to achieve only positive values on the real number space (as our mean value has to be ≥ 0):

The Poisson regression model:

Example on Python using Statsmodels

Credit data and code from: https://github.com/mahat/PoissonRegression

Here we would like to predict the number of awards received by students using the data with the structure below:

Here’s the summary of the data

After fitting the data into Poisson regression, we get the following results and graph:

Limitations of Poisson Regression Model

• Heterogeneity in the data — there is more than one process that is generating the data. For example, the data might be collected on more than one group of people, unknowingly
• Overdispersion — when the variance of the fitted model is larger than what is expected by the assumptions (the mean and the variance are equal)
• In Statsmodels, the ratio, Pearson chi2 / Df Residuals, is approximately 1 if the data is drawn from a Poisson distribution with sufficient samples. For observed data, a ratio more than 1 implies overdispersion while less than 1 implies underdispersion. If the data is underdispersed, a zero-inflated model is required
• For our python example above, the Pearson chi2 / Df Residuals is 1.08