Poisson Regression

Estimation, Interpretations and Diagnostics

Last week: Logistic Regression

Logistic regression: definition
Estimation and Interpretation for different type of covariates
Inference: confidence intervals and tests
Fitted values and residuals
Diagnostics: overdispersion

This week: Poisson Regression

Poisson regression is a regression analysis designed for modelling count data.

The response variable represents the number of times an event occurs in a given timeframe or even in a space such as a geographic unit. For example,

the number of customer arrivals
disease occurrences
traffic accidents

The response

Scroll down to see full content

Poisson is a common distribution to model a random variable that takes discrete non-negative integer values (i.e., 0, 1, 2,…)

\[Y_i|\mathbf{X}_i \sim \text{Poisson}(\lambda_i),\] where \(\lambda_i\) is the conditional mean.

A particularity of the Poisson distribution is that its mean equals its variance:

\[\lambda_i = E[Y_i|\mathbf{X}_i] = Var[Y_i|\mathbf{X}_i]\]

Thus, any factor that affects the mean will also affect the variance.

This fact could be a potential drawback for using a Poisson regression model.

Poisson regression can also be used to model rates of events but we won’t cover that case in this course.

Dataset

In this lecture, we’ll use the Bikeshare dataset from the ISL package discussed in the ISL book

The dataset contains information on the hourly users of a bike sharing program in Washington DC, for different months, hours, weather conditions, and membership types (see ?Bikeshare)


The Bikeshare dataset has information about 8645 users

season	mnth	day	hr	holiday	weekday	workingday	weathersit	temp	atemp	hum	windspeed	casual	registered	bikers
1	March	62	10	0	4	1	clear	0.18	0.1818	0.29	0.1940	7	49	56
4	Nov	327	1	0	3	1	cloudy/misty	0.48	0.4697	0.94	0.2985	0	8	8

A range problem

Since counts are nonnegative integers, if we use linear regression, we might have a range problem by predicting negative counts.

Instead, we can use a function of the linear component with a positive range, for example, an exponential function.

this is not the only function but it’s commonly used due to its simplicity and interpretation

Poisson Regression

Scroll down to see full content

The Poisson regression is defined as:

\[E[Y_i|\mathbf{X}] = \lambda_i = e^{\beta_0 + \beta_1X_{i,1} + \ldots + \beta_pX_{i,p}} \quad\quad [\text{mean}]\] or equivalently,

\[\log(\lambda_i) = \beta_0 + \beta_1X_{i,1} + \ldots + \beta_pX_{i,p} \quad\quad [\text{log-mean}]\]

the log function links the conditional expectation with the linear component in a natural way in the GLM family so it’s called the canonical link.

We model the log of the mean counts with a linear component!!

then all that we learned for MLR can be used for this function of the conditional expectation!!

Note that there is no error term! we are modeling a function of the conditional expectation

Poisson: a generalized linear model

The Poisson regression also belongs to the generalized linear models (GLMs) family.

In R, we can use the glm() function again but setting family = poisson to calculate the maximum likelihood estimator of the coefficients.

glm() models, by default, the log of the mean counts, which is the “canonical link” function but other functions can be used

Fitting a Poisson Regression

Scroll down to see full content

The interpretation of the coefficients will be very simular to those of the logistic regression case.

So this time let’s fit a model with various types of covariates at once.

From all the covariates available, let’s consider

workingday: an indicator variable that equals 1 if it is neither a weekend nor a holiday
temp: the normalized temperature, in Celsius
windspeed: the normalized windspeed
season: a qualitative variable that takes four values for each season

Additional information about the variables is available here

Factors vs numeric

Note that R will create dummy variables of factors, not of numerical variables.

sometimes numerical values are used for qualitative variables, e.g., season.

# A tibble: 2 × 5
  season workingday  temp windspeed bikers
   <dbl>      <dbl> <dbl>     <dbl>  <dbl>
1      3          1  0.64     0.164     65
2      2          1  0.72     0.328    155

You will need to adjust that before fitting the model

Let’s fit a Poisson model to the data as is.

summary(glm(formula = bikers ~ ., data = bikeshare, 
                              family = 'poisson'))


Call:
glm(formula = bikers ~ ., family = "poisson", data = bikeshare)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.411649   0.004158  820.39   <2e-16 ***
season       0.123221   0.000969  127.16   <2e-16 ***
workingday  -0.028971   0.001940  -14.93   <2e-16 ***
temp         2.050466   0.004919  416.82   <2e-16 ***
windspeed    0.822224   0.007361  111.70   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1052921  on 8644  degrees of freedom
Residual deviance:  805356  on 8640  degrees of freedom
AIC: 858404

Number of Fisher Scoring iterations: 5

Note that workingday and season were fit as numerical variables, and both should be treated as categorical variables!!

R can’t distinguish continuous from categorical variables if the variable is numeric.

Factors are required for qualitative variables!

bikeshare <- bikeshare %>%
             mutate(season = as.factor(season),
                    workingday = as.factor(workingday))


Call:
glm(formula = bikers ~ ., family = "poisson", data = bikeshare)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.345358   0.003981  840.38   <2e-16 ***
season2      0.062916   0.003741   16.82   <2e-16 ***
season3     -0.141267   0.004320  -32.70   <2e-16 ***
season4      0.379421   0.003357  113.01   <2e-16 ***
workingday1 -0.036254   0.001942  -18.67   <2e-16 ***
temp         2.688047   0.007435  361.54   <2e-16 ***
windspeed    0.741449   0.007377  100.50   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1052921  on 8644  degrees of freedom
Residual deviance:  783300  on 8638  degrees of freedom
AIC: 836352

Number of Fisher Scoring iterations: 5

log-mean vs mean counts

As before, we can interpret the “raw” results corresponding to log-mean counts and are interpretated as for MLR

\[\begin{aligned} \log(E[\text{bikers}|\mathbf{X}]) = & \beta_0 + \beta_1\text{S2} + \beta_2\text{S3} + \beta_3\text{S4} \ + \\ &+ \beta_4\text{W1} + \beta_5\text{temp} + \beta_6\text{wind} \end{aligned}\]

for simplicity, \(S\) = season, \(W\) = workingday, and wind = windspeed, and \(\mathbf{X}\) = all covariates

Interpretation: log-mean

Scroll down to see full content

term	estimate	std.error	statistic
(Intercept)	3.345	0.004	840.384
season2	0.063	0.004	16.818
season3	-0.141	0.004	-32.699
season4	0.379	0.003	113.008
workingday1	-0.036	0.002	-18.672
temp	2.688	0.007	361.541
windspeed	0.741	0.007	100.505

iClicker: true or false

Keeping other variables in the model constant, at any value, the estimated average bike usage decreases 0.036 in working days

Keeping other variables in the model constant, at any value, a unit increase in the normalized temperature is associated with an estimated increase in log mean bike usage of 2.688.

Interpretation: log-mean

The coefficients can be interpreted in terms of the log-means:

dummy variables \(\hat{\beta}_1 = 0.063\): the estimated log-mean usage in season 2 is 0.063 larger than that in season 1 (reference), keeping all other variables in the model constant at any value
continuous variables \(\hat{\beta}_5 = 2.688\): an increase of 1 unit in temp is associated with an estimated increase in the log-mean usage of 2.688, keeping all other variables in the model constant at any value

same as for MLR but with response “log-mean counts”!

Keeping constant …

\[ \log \lambda_{\texttt{temp}} = \beta_0 + \beta_5 \overbrace{\texttt{T}}^{{\texttt{temp}}} + \overbrace{\beta_1\text{S2} + \beta_2\text{S3} + \beta_3\text{S4} + \beta_4\text{W1} + \beta_6\text{wind}}^{\smash{\text{Constant}}} \] \[ \log \lambda_{\texttt{temp + 1}} = \beta_0 + \beta_5 \overbrace{(\texttt{T} + 1)}^{{\texttt{temp} + 1}} + \overbrace{\beta_1\text{S2} + \beta_2\text{S3} + \beta_3\text{S4} + \beta_4\text{W1} + \beta_6\text{wind}}^{\smash{\text{Constant}}} \]

We take the difference between both equations as:

\[\begin{align*} \log \lambda_{\texttt{temp + 1}} - \log \lambda_{\texttt{temp}} &= \beta_5 (\texttt{T} + 1) - \beta_5\texttt{T} \\ &= \beta_5 \end{align*}\]

Thus, we interpret the estimated coefficients as difference in estimated log-mean counts

Interpretation: mean counts

Scroll down to see full content

But by logarithm property for a ratio:

\[\begin{align*} \log \frac{\lambda_{\texttt{temp + 1}} }{\lambda_{\texttt{temp}}} &= \log \lambda_{\texttt{temp + 1}} - \log \lambda_{\texttt{temp}} \\ &= \beta_5 \end{align*}\]

Finally, we have to exponentiate the previous equation:

\[ \frac{\lambda_{\texttt{temp + 1}} }{\lambda_{\texttt{temp}}} = e^{\beta_5} \]

This expression indicates that an increase of 1 unit in a continuous covariate is associated with a mean count change by a factor of \(e^{\beta_5}\) (multiplicative change).

\[\lambda_{\texttt{temp + 1}}= e^{\beta_5}\lambda_{\texttt{temp}}\]

… which gives a more natural way of interpreting the estimated coefficients of a Poisson regression

Using `tidy`: `exponentiate = TRUE`

#Coefficients for the mean counts
tidy(bikers_poisson, exponentiate = TRUE) %>%
  mutate_if(is.numeric, round, 3) %>%
  knitr::kable()

term	estimate	std.error	statistic
(Intercept)	28.371	0.004	840.384
season2	1.065	0.004	16.818
season3	0.868	0.004	-32.699
season4	1.461	0.003	113.008
workingday1	0.964	0.002	-18.672
temp	14.703	0.007	361.541
windspeed	2.099	0.007	100.505

Scroll down to see full content

dummy variables \(e^{\hat{\beta}_1} = 1.065\): the estimated average usage in season 2 is 1.065 times that in season 1 that in season 1 (reference), keeping all other variables in the model constant at any value
- or … increases by a factor of 1.065 relative to …
- or … increases by \((1.065 - 1) \times 100\%=6.5\%\) relative to …
continuous variables \(e^{\hat{\beta}_5} = 14.703\): a one-unit increase in temp is associated with an estimated average usage that is 14.703 times larger, holding all other variables constant.

iClicker:

dummy variables S3 \(e^{\hat{\beta}_2} = -0.141 = 0.868\):

A: a change in season from season 1 to season 3 is associated with a change in mean bike usage by a factor of 0.868, keeping all other variables in the model constant at any value

B: only 86.8% as many people will use bikes in season 3 relative to season season 1, keeping all other variables in the model constant at any value

C: a change in season from season 1 to season 3 is associated with a decrease of \(13.2\%\) in mean bike usage, keeping all other variables in the model constant at any value

—> D: all of the above are correct

Again, 4 Poisson regressions in 1!! For the additive model:

\[ \log \lambda = \beta_0 + \beta_1\text{S2} + \beta_2\text{S3} + \beta_3\text{S4} + \beta_4\text{W1} + \beta_5 \texttt{temp} + \beta_6\text{wind} \]

\[\begin{equation*} \log \lambda = \begin{cases} \beta_0 + \beta_4\text{W1} +\beta_5 \texttt{temp} + \beta_6\text{wind} \quad &\text{if } \text{S2} = \text{S3} = \text{S4} = 0\\ \\ (\beta_0 + \beta_1) + \beta_4\text{W1} +\beta_5 \texttt{temp} + \beta_6\text{wind} \quad &\text{if } \text{S2} = 1, \ \text{S3} = \text{S4} = 0\\ \\ (\beta_0 + \beta_2) + \beta_4\text{W1} +\beta_5 \texttt{temp} + \beta_6\text{wind} \quad &\text{if } \text{S3} = 1, \ \text{S2} = \text{S4} = 0\\ \\ (\beta_0 + \beta_3) + \beta_4\text{W1} +\beta_5 \texttt{temp} + \beta_6\text{wind} \quad &\text{if } \text{S4} = 1, \ \text{S2} = \text{S3} = 0\\ \end{cases} \end{equation*}\]

(optional but useful) try to write these in exponential form

Model with interaction

For simplicity, let’s look at a model with only workingday and temp but interacting

we are thinking that the “effect” of temperature will be different for leisure days compared to working days.

\[ \log \lambda = \gamma_0 + \gamma_1\text{W1} + \gamma_2 \texttt{temp} + \gamma_3\text{W1} \times \texttt{temp} \]

Again, 2 Poisson regressions in 1 with different slopes!

\[ \log \lambda = \gamma_0 + \gamma_1\text{W1} + \gamma_2 \texttt{temp} + \gamma_3\text{W1} \times \texttt{temp} \]

For the log-mean counts:

\[\begin{equation*} \log \lambda = \begin{cases} \gamma_0 + \gamma_2 \texttt{temp} \quad &\text{if } \text{W1} = 0\\ (\gamma_0 + \gamma_1) + (\gamma_2 + \gamma_3) \texttt{temp} \quad &\text{if } \text{W1} = 1 \end{cases} \end{equation*}\]

different slope (\(\gamma_2\) vs (\(\gamma_2 + \gamma_3\))) , different intercepts (\(\gamma_0\) vs (\(\gamma_0 + \gamma_1\)))!

For the mean counts:

\[\begin{equation*} \lambda = \begin{cases} e^{\gamma_0} e^{\gamma_2 \texttt{temp}} \quad &\text{if } \text{W1} = 0\\ e^{(\gamma_0 + \gamma_1)}e^{(\gamma_2 + \gamma_3) \texttt{temp}} \quad &\text{if } \text{W1} = 1 \end{cases} \end{equation*}\]

in R

bikers_poisson_int <- glm(bikers~workingday*temp,
                          data = bikeshare,
                          family = poisson)

tidy(bikers_poisson_int, exponentiate = TRUE) %>%
  mutate_if(is.numeric, round, 3) %>%
  knitr::kable()

term	estimate	std.error	statistic
(Intercept)	34.343	0.005	687.857
workingday1	1.481	0.006	63.450
temp	14.723	0.008	318.215
workingday1:temp	0.483	0.010	-71.201

Interpretation (for mean counts)

In a model with interactions, the expected change in one variable depends on the value or levels the other variable

We CAN’T say “keeping the temp value constant” or “for working days or leisure days”

for leisure days, a 1 unit increase in temperature is associated with an estimated average bike usage 14.723 times higher.

Interaction term

For the “effect” of temperature for working days (i.e., workingday = 1), we multiply (instead of summing) the estimated coefficient of temp by that of the interaction term:

From the tidy table, with exponentiate = TRUE.

\[ e^{\hat{\gamma}_2} = 14.723, \ e^{\hat{\gamma}_3} = 0.483\]

We need to multiply (instead of summing):

\[\begin{equation*} e^{\hat{\gamma}_2 +\hat{\gamma}_3} = e^{\hat{\beta}_2} \ \ e^{\hat{\beta}_3} = 14.723 \times 0.483 = 7.11 \end{equation*}\]

In working days, a unit increase in temperature is associated with an estimated increase of average usage by a factor of 7.11.

In working days, a unit increase in temperature is associated with an estimated average usage 7.11 times higher.

Inference

Inference for Poisson regression is the same as that of Logistic regression. They are based on theorectical results from the CLT.

if the sample size is large, the Normal distribution is used to obtain boudaries of confidence intervals and \(p\)-values

\[ \frac{\hat{\beta}_j}{\hat{\text{SE}}(\hat{\beta}_j)} \sim N(0, 1) \]

This is called the Wald’s statistic

Scroll down to see full content

The confidence intervals can be calculated for both the log-mean and the mean counts.

But again, R will do all this for you!

tidy(bikers_poisson, 
     conf.int = TRUE, conf.level = 0.9) %>%
  mutate_if(is.numeric, round, 3) %>%
  knitr::kable()

term	estimate	std.error	statistic	conf.low	conf.high
(Intercept)	3.345	0.004	840.384	3.339	3.352
season2	0.063	0.004	16.818	0.057	0.069
season3	-0.141	0.004	-32.699	-0.148	-0.134
season4	0.379	0.003	113.008	0.374	0.385
workingday1	-0.036	0.002	-18.672	-0.039	-0.033
temp	2.688	0.007	361.541	2.676	2.700
windspeed	0.741	0.007	100.505	0.729	0.754

Interpretations: see examples in worksheet and tutorial

Residuals

The Poisson regression has the same problems as the Logistic Regression:

responses are discrete (at least more than 2 values now)
the variance is not constant

\[\text{resid}_{\text{raw}}= y_i - \hat{\lambda}_i, \quad\quad \text{resid}_{\text{pearson}}= \frac{y_i - \hat{\lambda}_i}{\sqrt{\hat{\lambda}_i}}\]

Residual plot

plot(bikers_poisson,1)

Overdispersion

The variance of a Poisson random variable equals the mean, which in practice may not be observed.

Poisson regression usually exhibits overdispersion.

An easy way to check is to fit a model using the quasilikelihood method and check if the dispersion parameter is very different from 1.

in R

bikers_quasipoisson <- glm(bikers ~ .,
                          data = bikeshare,
                          family = quasipoisson)

summary(bikers_quasipoisson)


Call:
glm(formula = bikers ~ ., family = quasipoisson, data = bikeshare)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.34536    0.03789  88.286  < 2e-16 ***
season2      0.06292    0.03561   1.767 0.077302 .  
season3     -0.14127    0.04112  -3.435 0.000595 ***
season4      0.37942    0.03196  11.872  < 2e-16 ***
workingday1 -0.03625    0.01848  -1.962 0.049849 *  
temp         2.68805    0.07077  37.981  < 2e-16 ***
windspeed    0.74145    0.07022  10.558  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 90.60984)

    Null deviance: 1052921  on 8644  degrees of freedom
Residual deviance:  783300  on 8638  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

Summary

Scroll down to see full content

A count response can be modelled using a Poisson Regression
The (conditional) expectation of a count response is the (conditional) mean counts.
The (conditional) expectation of a count response equals the (conditional) variance
A LR can not be used to model the conditional expectation of a count response since its range extends beyond positive values
Instead, one can model a function of the conditional mean counts. A common choice in Poisson regression is to use the log function

Summary (cont.)

The interpretation of the coefficients depends on the type of variables and the form of the model:

The raw estimated coefficients (of the log-mean count model) are interpreted as:

estimated log-mean counts of a reference group
difference of estimated log-mean counts of a treatment vs a control group
association between unit change in the input and changes in estimated log-mean counts

Summary (cont.)

The estiamted exponentiated coefficients (of the mean count model) are interpreted as:

estimated averages counts of a reference group
ratio of estimated averages counts of a treatment vs a control group
multiplicative changes in estimated averages counts per unit change in the input

Summary (cont.)

The estimated Poisson model can be used to make inference using the Wald’s satistic: confidence intervals and tests
There are different type of fitted values (log-mean counts or counts?) and residuals
Overdispersion is a potential problem in Poisson regression. Using quasilikelihood approach usual helps.

Poisson Regression

Last week: Logistic Regression

This week: Poisson Regression

The response

Dataset

A range problem

Poisson Regression

Poisson: a generalized linear model

Fitting a Poisson Regression

Factors vs numeric

log-mean vs mean counts

Interpretation: log-mean

iClicker: true or false

Interpretation: log-mean

Keeping constant …

Interpretation: mean counts

Using tidy: exponentiate = TRUE

iClicker:

Model with interaction

in R

Interpretation (for mean counts)

Interaction term

Inference

Residuals

Residual plot

Overdispersion

in R

Summary

Summary (cont.)

Summary (cont.)

Summary (cont.)

Using `tidy`: `exponentiate = TRUE`