Intro to Poisson Regression
Poisson Distribution is the discrete probability of count of events which occur randomly in a given interval of time. It is a limiting form of the binomial distribution in which n becomes very large and p is very very small (meaning the number of trials is very large while the probability of occurrences of outcome under observation is small.
Definition of Poisson Distribution
X =the number of events in a given interval
λ = mean number of events per interval
The probability of observing x events in a given interval is given by
Poisson Distribution problem 1
Births in a hospital occur randomly at an average rate of 1.8 births per hour. What is the probability of observing 4 births in a given hour at the hospital?
Let X = Number of births in a given hour
(i) Events occur randomly
(ii) Mean rate λ = 1.8
Poisson Distribution problem 2
What is the probability of observing more than or equal to 2 births in a given hour at the hospital?
Solutions can be found on page 11 and 13 here http://www.stats.ox.ac.uk/~marchini/teaching/L5/L5.slides.pdf
Poisson Regression Model
Poisson regression is a form of regression analysis used to model discrete data. It is appropriate when the conditional distributions of Y (count data) given the observed counts are expected to be Poisson distributions.
Poisson regression model is written in terms of the mean response. We assume that there exists a function that relatives the mean of the response to a linear predictor.
First we want to transform our k-dimensional input (x1, x2, …, xk) into the real number space:
Then we perform another transformation to achieve only positive values on the real number space (as our mean value has to be ≥ 0):
The Poisson regression model:
Example on Python using Statsmodels
Credit data and code from: https://github.com/mahat/PoissonRegression
Here we would like to predict the number of awards received by students using the data with the structure below:
Here’s the summary of the data
After fitting the data into Poisson regression, we get the following results and graph:
Limitations of Poisson Regression Model
- Heterogeneity in the data — there is more than one process that is generating the data. For example, the data might be collected on more than one group of people, unknowingly
- Overdispersion — when the variance of the fitted model is larger than what is expected by the assumptions (the mean and the variance are equal)
- In Statsmodels, the ratio, Pearson chi2 / Df Residuals, is approximately 1 if the data is drawn from a Poisson distribution with sufficient samples. For observed data, a ratio more than 1 implies overdispersion while less than 1 implies underdispersion. If the data is underdispersed, a zero-inflated model is required
- For our python example above, the Pearson chi2 / Df Residuals is 1.08