Logistic Regression

Estimation and Interpretations

In this lecture

Logistic regression: definition
Estimation
Interpretation for different type of covariates
Inference: confidence intervals and tests

Logistic Regression

Scroll down to see full content

Logistic regression is a powerful statistical method used for modelling a binary (and binomial) response variable.

For example, we can use a logistic regression to

compare the presence of bacteria between groups taking a new drug and a placebo, respectively.
- Response: present or not present
predict whether or not a customer will default on a loan given their income and demographic variables.
- Response: default or not default
know how GPA, ACT score, and number of AP classes taken are associated with the probability of getting accepted into a particular university.
- Response: accepted or not accepted

The response

Scroll down to see full content

To fit a logistic regression in R, the response needs to be a numerical variable (0 and 1) or a factor, with two levels (since R stores factors as integers).

For example, a numerical binary response $Y_i$ that flags the successes (S) has the form:

\[ Y_i = \begin{cases} 1 \; \; \; \; \mbox{if the $i$th observation is S},\\ 0 \; \; \; \; \mbox{otherwise.} \end{cases} \]

In R, you can use the following code to convert a string response into a numerical one:

y_reponse <- if_else(y == "success", 1, 0)

Dataset

In this lecture, we’ll use the Titanic dataset, which contains survival and demographic information about the passangers.

You can find details about the columns by typing ?titan.


The titanic dataset has information about 891 passengers.

passenger_id	survived	passenger_class	name	sex	age	sibsp	parch	ticket_number	fare	cabin	embarked
775	1	2	Hocking, Mrs. Elizabeth (Eliza Needs)	female	54	1	3	29105	23.000		S
549	0	3	Goldsmith, Mr. Frank John	male	33	1	1	363291	20.525		S

Linear Regression: a range problem

Let’s start by using a LR to explore if there is any association between fare and survival

A function with range $(0,1)$

With a LR, fitted values may be larger than 1 or smaller than 0!

When the response is binary: $E[Y|X] = P[Y=1|X]$. We are modeling a probability!!

Let’s use a function of the linear component with range $(0,1)$:

\[ E\left[\ \text{survived}_i\ |\ \text{fare}_i\ \right]= P\left[\ \text{survived}_i = 1\ |\ \text{fare}_i\ \right] = \frac{e^{\beta_0 + \beta_1\times \text{fare}_i}}{1+ e^{\beta_0 + \beta_1\times \text{fare}_i}} \]

Note that this ratio is always between 0 and 1!

The logit function

Using some algebra you can get an alternative formulation:

\[ P\left[\ \text{survived}_i=1\ |\ \text{fare}_i\ \right] = p_i = \frac{e^{\beta_0 + \beta_1\times \ \text{fare}_i}}{1+ e^{\beta_0 + \beta_1\times \ \text{fare}_i}} \qquad [\text{probability}] \]

can also be written as

\[ \underbrace{\log\left(\frac{p_i}{1 - p_i}\right)}_{\textbf{the logit function}} = \beta_0 + \beta_1 \ \text{fare}_i \qquad [\text{log-odds}] \]

The Logistic Regression

We model the logit of the probability with a linear component!!

then all that we learned for MLR can be used for this function of the conditional expectation!!

Thus, this model is known as a Logistic Regression

\[ \log\left(\frac{p_i}{1-p_i}\right) = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \ldots + \beta_p X_{pi} \]

Note that there is no error term! we are modeling a function of the conditional expectation

Logistic: a generalized linear model

The logistic regression belongs to a family of models called “generalized linear models”, aka GLMs.

In R, we can use the function glm() that estimates these models but we need to specify which model in the family we are interested in.

For logistic regression: family = binomial

In general, there is no closed-form to calculate the MLE. Thus, an iterative algorithm is needed (see # iteration is the summary)

The MLR also belongs to this family and you can also use glm() to estimate it.

It is the default family: family = gaussian

Note that glm() calculates maximum likelihood estimators (MLE) and lm() calculates least squares estimators.

For Normal random errors, both methods give the same estimates!

You can get more information about this function using ?glm

Logit: log-odds

glm() models, by default, the logit of the probabilities, usually referred as the “canonical link” function

links the conditional expectation with a linear component

Note that the logit of the probabilities is the log of the odds of success (more details are given in the math appendix)

\[ \text{logit}(p) = \log{\underbrace{\left(\frac{p}{1-p}\right)}_{\textbf{odds}}} = \log(\text{odds}) \]

Fitting a Logistic Regression

Case 1: one continuous covariate

Scroll down to see full content

\[ \log\left(\frac{P(\left.\text{survived}_i = 1 \right| \text{fare}_i)}{1-P(\left.\text{survived}_i = 1 \right| \text{fare}_i)}\right) = \beta_0 + \beta_1 \ \text{fare}_i \]

model_titanic_logistic <- glm(formula = survived ~ fare, data = titan, 
                              family = 'binomial')

summary(model_titanic_logistic)


Call:
glm(formula = survived ~ fare, family = "binomial", data = titan)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.941330   0.095129  -9.895  < 2e-16 ***
fare         0.015197   0.002232   6.810 9.79e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1186.7  on 890  degrees of freedom
Residual deviance: 1117.6  on 889  degrees of freedom
AIC: 1121.6

Number of Fisher Scoring iterations: 4

Interpretation: log-odds

The coefficients can be interpreted in terms of the log-odds (logit link):

Intercept $\hat{\beta}_0 = -0.9413$: the estimated log-odds of surviving with a paid fare of 0 dollars is -0.9413.
Slope $\hat{\beta}_1 = 0.0152$: an increase of 1 dollar in the fare paid is associated with an estimated increase in the log-odds of surviving of 0.0152.

same as for MLR but with response “log-odds of success”!

Important

Note the word “estimated” in the interpretation.

Interpretation: odds

A more natural way of interpreting the coefficients of a logistic regression is based on odds of success

Using basic algebra, the model can also be written as:

\[ \frac{p_i}{1 - p_i}= e^{\beta_0 + \beta_1 \ \text{fare}_i} = e^{\beta_0} e^{\beta_1 \ \text{fare}_i} \qquad [\text{odds}] \]

We need to exponentiate the coefficients to interprete them as changes in the odds. Find more details in the math appendix

Exponentating the coefficients

#Coefficients for the odds
tidy(model_titanic_logistic, exponentiate = TRUE) %>%
  mutate_if(is.numeric, round, 3) %>%
  knitr::kable()

term	estimate	std.error	statistic	p.value
(Intercept)	0.390	0.095	-9.895	0
fare	1.015	0.002	6.810	0

Slope: $e^{\hat{\beta}_1} = e^{0.0152} = 1.015$: we estimate that an increase of 1 dollar in the fare paid is associated with an increase of the odds of surviving by a factor of 1.015 (or with an increase of the odds of surviving of 1.5% percent )

Case 2: one categorical variable

\[ \log\left(\frac{P(\left.\text{survived}_i = 1 \right| X_\text{male,i})}{1-P(\left.\text{survived}_i = 1 \right| X_\text{male,i})}\right) = \beta_0 + \beta_1 X_\text{male,i} \]

where $X_\text{male,i} = 1$ if passenger $i$ is registered as a male and $X_\text{male,i} = 0$ if registered as female.

Again, 2 logistic regressions in 1!!

\[\begin{equation*} \text{logit}(p_\text{survival}) = \beta_0 + \beta_1\times X_\text{male} = \begin{cases} \beta_0 \quad\quad\quad \text{if } X_\text{male} = 0\\ \beta_0 + \beta_1 \quad \text{ if } X_\text{male} = 1 \end{cases} \end{equation*}\]

Therefore, for the odds of surviving:

\[\begin{equation*} \frac{p_\text{survival}}{1-p_\text{survival}} = \begin{cases} e^{\beta_0} \quad\quad\quad \text{if } X_\text{male} = 0\\ e^{\beta_0 + \beta_1} \quad\quad \text{if } X_\text{male} = 1 \end{cases} \end{equation*}\]

find more details in the math appendix

in R

Scroll down to see full content

model_titanic_logistic_sex <- glm(survived ~ sex, data = titan, family = binomial) 

tidy(model_titanic_logistic_sex) %>%
  mutate_if(is.numeric, round, 3) %>%
  knitr::kable()

term	estimate	std.error	statistic	p.value
(Intercept)	1.057	0.129	8.191	0
sexmale	-2.514	0.167	-15.036	0

tidy(model_titanic_logistic_sex, exponentiate = TRUE) %>%
  mutate_if(is.numeric, round, 3) %>%
  knitr::kable()

term	estimate	std.error	statistic	p.value
(Intercept)	2.877	0.129	8.191	0
sexmale	0.081	0.167	-15.036	0

Reference level: female, note the dummy variable name sexmale

Interpretation (in terms of odds)

Scroll down to see full content

Intercept: $e^{\hat{\beta}_0} = e^{1.057} = 2.8765$, a female passenger had a an estimated odds of surviving equal to 2.8765 (i.e., estimated by proportion of female survivors relative to the proportion of female deaths in the sample)
Slope: $e^{\hat{\beta}_1} = e^{-2.514} = 0.08097$, the estimated odds of surviving for a male passenger are 0.08097 times the estimated odds for a female passenger, corresponding to a decrease of $91.90\%$ (note $(0.08097-1) \times 100 = - 91.90$).
- Or, The estimated odds of surviving for a female passenger are multiplied by a factor of 0.08097 for a male passenger.

Odds ratio: note that $e^{\hat{\beta}_1} = 0.08097 = \frac{\text{odds}_\text{male}}{\text{odds}_\text{female}}$

turning to odds of failure: for a male passanger the estimated odds of dying are $e^{2.514} = 12.3542$ times higher than of those of females

More details about these equivalences are given in the math appendix

Case 3: one categorical and one numerical covariate

Let’s fit a logistic regression using sex and fare (omiting subscript $i$ for simplicity)

The additive model

\[ \log\left(\frac{P(\left.\text{surv} = 1 \right| \text{fare}, X_\text{male})}{1-P(\left.\text{surv} = 1 \right| \text{fare}, X_\text{male})}\right) = \beta_0 + \beta_1 X_\text{male} + \beta_2\times\text{fare} \]

Again, 2 logistic regressions in 1 with equal slope!

\[\begin{equation*} \text{logit}(p_\text{survival}) = \beta_0 + \beta_1\times X_\text{male} + \beta_2 \times \text{fare} \end{equation*}\]

\[\begin{equation*} \log\left(\frac{p_s}{1-p_s}\right) = \begin{cases} \beta_0 + \beta_2 \times \text{fare}, \quad\quad\quad \text{if } X_\text{male} = 0\\ \beta_0 + \beta_1 + \beta_2 \times \text{fare}, \quad \text{ if } X_\text{male} = 1 \end{cases} \end{equation*}\]

same slope ($\beta_2$), different intercepts ($\beta_0$ vs ($\beta_0 + \beta_1$))!

For the odds of surviving:

\[\begin{equation*} \frac{p_\text{survival}}{1-p_\text{survival}} = \begin{cases} e^{\beta_0}e^{\beta_2\text{fare}}, \quad\quad\quad \text{if } X_\text{male} = 0\\ e^{\beta_0 + \beta_1}e^{\beta_2 \text{fare}}, \quad\quad \text{if } X_\text{male} = 1 \end{cases} \end{equation*}\]

The curves won’t look “parallel”, even if the linear components have the same slopes

\[\begin{equation*} p_i = \begin{cases} \frac{e^{\beta_0 + \beta_2\times \ \text{fare}_i}}{1+ e^{\beta_0 + \beta_2\times \ \text{fare}_i}} \quad\quad\quad \text{if } X_\text{male, i} = 0\\ \\ \frac{e^{\beta_0 + \beta_1 + \beta_2\times \ \text{fare}_i}}{1+ e^{\beta_0 + \beta_1 + \beta_2\times \ \text{fare}_i}} \quad\quad\ \text{if } X_\text{male, i} = 1 \end{cases} \end{equation*}\]

in R

Scroll down to see full content

model_titanic_logistic_additive <- glm(survived ~ sex + fare, data = titan,
                                       family = binomial) 

tidy(model_titanic_logistic_additive) %>%
  mutate_if(is.numeric, round, 3) %>%
  knitr::kable()

term	estimate	std.error	statistic
(Intercept)	0.647	0.149	4.358
sexmale	-2.423	0.171	-14.208
fare	0.011	0.002	4.886

tidy(model_titanic_logistic_additive, exponentiate = TRUE) %>%
  mutate_if(is.numeric, round, 3) %>%
  knitr::kable()

term	estimate	std.error	statistic
(Intercept)	1.910	0.149	4.358
sexmale	0.089	0.171	-14.208
fare	1.011	0.002	4.886

Interpretation (for odds ratios)

Scroll down to see full content

Keeping the fare value constant, at any value:

the estimated odds of surviving for a male passenger are $e^{-2.42276} = 0.0887$ times that of a female passenger.
equivalently, the estimated odds of surviving for male passengers are only $8.87\%$ of those for female passengers.
or in other words, the estimated odds of dying for a male passanger are $e^{2.42276} = 11.2769$ times that of a female

note that we can “keep fare constant” since this is an additive model. More details in the math appendix

Interpretation (cont.)

Scroll down to see full content

Keeping the sex constant (or for either male or female, or regardless of sex):

increasing the fare in 1 dollar is associated with an increase of the estimated odds of surviving by a factor of $1.0113$.
increasing the fare in 1 dollar is associated with an increase of the estimated odds of surviving of $(1.0113-1)\times 100\%=1.13\%$.

Case 4: one categorical and one numerical covariate

Model with interaction

\[ \log\left(\frac{p_s}{1-p_s}\right) = \beta_0 + \beta_1 X_\text{male} + \beta_2\times\text{fare} + \beta_3\times X_\text{male}\times\text{fare} \]

Again, 2 logistic regressions in 1 with different slopes!

\[\begin{equation*} \text{logit}(p_s) = \beta_0 + \beta_1\times X_\text{male} + \beta_2 \times \text{fare} + \beta_3\times X_\text{male}\times\text{fare} \end{equation*}\]

\[\begin{equation*} \log\left(\frac{p_s}{1-p_s}\right) = \begin{cases} \beta_0 + \beta_2 \times \text{fare}, \quad\quad\quad\quad\quad\quad &\text{if } X_\text{male} = 0\\ (\beta_0 + \beta_1) + (\beta_2 + \beta_3) \times \text{fare}, \quad &\text{if } X_\text{male} = 1 \end{cases} \end{equation*}\]

different slope ($\beta_2$ vs ($\beta_2 + \beta_3$)) , different intercepts ($\beta_0$ vs ($\beta_0 + \beta_1$))!

For the odds of surviving:

\[\begin{equation*} \frac{p_\text{survival}}{1-p_\text{survival}} = \begin{cases} e^{\beta_0}e^{\beta_2\text{fare}}, \quad\quad\quad\quad &\text{if } X_\text{male} = 0\\ e^{(\beta_0 + \beta_1)}e^{(\beta_2 +\beta_3) \text{fare}}, \quad\quad &\text{if } X_\text{male} = 1 \end{cases} \end{equation*}\]

\[\begin{equation*} p_i = \begin{cases} \frac{e^{\beta_0 + \beta_2\times \ \text{fare}_i}}{1+ e^{\beta_0 + \beta_2\times \ \text{fare}_i}} \quad\quad\quad &\text{if } X_\text{male, i} = 0\\ \\ \frac{e^{(\beta_0 + \beta_1) + (\beta_2 + \beta_3) \times \ \text{fare}_i}}{1+ e^{(\beta_0 + \beta_1) + (\beta_2 + \beta_3) \times \ \text{fare}_i}} \quad\quad\ &\text{if } X_\text{male, i} = 1 \end{cases} \end{equation*}\]

in R

Scroll down to see full content

model_titanic_logistic_int <- glm(survived ~ sex * fare, data = titan,
                                       family = binomial) 

tidy(model_titanic_logistic_int) %>%
  mutate_if(is.numeric, round, 3) %>%
  knitr::kable()

term	estimate	std.error	statistic	p.value
(Intercept)	0.408	0.190	2.150	0.032
sexmale	-2.099	0.230	-9.116	0.000
fare	0.020	0.005	3.701	0.000
sexmale:fare	-0.012	0.006	-1.958	0.050

tidy(model_titanic_logistic_int, exponentiate = TRUE) %>%
  mutate_if(is.numeric, round, 3) %>%
  knitr::kable()

term	estimate	std.error	statistic	p.value
(Intercept)	1.504	0.190	2.150	0.032
sexmale	0.123	0.230	-9.116	0.000
fare	1.020	0.005	3.701	0.000
sexmale:fare	0.988	0.006	-1.958	0.050

Interpretation (for odds ratios)

Scroll down to see full content

In a model with interactions, the expected change in one variable depends on the value or levels the other variable

We CAN’T say “keeping the fare value constant” or “for either male or female”

for female passengers, increasing the fare in 1 dollar is associated with an estimated increase of the odds of surviving by a factor of 1.020.
for female passengers, increasing the fare in 1 dollar is associated with an estimated increase of the odds of surviving of $2\%$.

Interaction term

The interpretation of the interaction terms becomes a bit complicated.

Instead, you can calculate the two estimated curves and interprete them accordingly.

From the tidy table (with exponentiate = FALSE or default), we get

\[\hat{\beta}_0 = 0.408, \ \hat{\beta}_1 = - 2.099, \ \hat{\beta}_2 = 0.020, \ \hat{\beta}_3 = - 0.012\]

Then,

\[\begin{equation*} \hat{\beta}_0 = 0.408 \\ \hat{\beta}_2 = 0.020 \\ \\ \hat{\beta}_0 + \hat{\beta}_1 = 0.408 - 2.099 = -1.691 \\ \hat{\beta}_2 +\hat{\beta}_3 = 0.020 - 0.012 = 0.008 \end{equation*}\]

\[\begin{equation*} \frac{p_i}{1-p_i} = \begin{cases} e^{0.408 + 0.020 \times \ \text{fare}_i} \quad\quad\quad &\text{if } X_\text{male, i} = 0\\ \\ e^{-1.691 + 0.008 \times \ \text{fare}_i} \quad\quad\ &\text{if } X_\text{male, i} = 1 \end{cases} \end{equation*}\]

for male passengers, increasing the fare in 1 dollar is associated with an estimated increase of the odds of surviving by a factor of $e^{0.008} = 1.008$.
for male passengers, increasing the fare in 1 dollar is associated with an estimated increase in the odds of surviving of $0.8\%$.

Using exponentiated coefficients

Note that you can also recover both curves using the exponentiated coefficients, from the tidy table, with exponentiate = TRUE.

\[e^{\hat{\beta}_0} = 1.504, \ e^{\hat{\beta}_1} = 0.123, \ e^{\hat{\beta}_2} = 1.020, \ e^{\hat{\beta}_3} = 0.988\]

We need to multiply (instead of summing):

\[\begin{equation*} e^{\hat{\beta}_0 + \hat{\beta}_1} = e^{-1.691} = 0.184\\ e^{\hat{\beta}_0} \ e^{\hat{\beta}_1} = 1.504 \times 0.123 = 0.185\\ e^{\hat{\beta}_2 +\hat{\beta}_3} = e^{0.008} = 1.008\\ e^{\hat{\beta}_2} \ \ e^{\hat{\beta}_3} = 1.020 \times 0.988 = 1.008 \end{equation*}\]

which are equivalent, up to rounding errors

Inference

Inference for GLMs, and logistic regression in particular, are based on theorectical results from the CLT.

if the sample size is large, the Normal distribution is used to obtain boudaries of confidence intervals and $p$-values

\[ \frac{\hat{\beta}_j}{\hat{\text{SE}}(\hat{\beta}_j)} \sim N(0, 1) \]

This is called the Wald’s statistic

The confidence intervals of $(1-\alpha)\%$ for the coefficients of the log-odds model (canonical link) are

\[ CI\left(\hat{\beta}_j, 1-\alpha\right) = \hat{\beta}_j \pm z_{\alpha/2}\hat{\text{SE}}(\hat{\beta}_j) \]

You can also use the Wald’s statistic to test $H_0: \hat{\beta}_j = 0$ vs $H_1: \hat{\beta}_j \neq 0$

But again, R will do all this for you!

in R

tidy(model_titanic_logistic_additive, 
     conf.int = TRUE, conf.level = 0.9) %>%
  mutate_if(is.numeric, round, 3) %>%
  knitr::kable()

term	estimate	std.error	statistic	conf.low	conf.high
(Intercept)	0.647	0.149	4.358	0.405	0.894
sexmale	-2.423	0.171	-14.208	-2.707	-2.145
fare	0.011	0.002	4.886	0.008	0.015

For fare: $p < 0.001$ indicating significant evidence to reject the null hypothesis that states that there is no association between fare and the log-odds of surviving, for either female or male passengers.

Inference for odds of success

Scroll down to see full content

You can also get confidence intervals for the exponenciated coefficients

tidy(model_titanic_logistic_additive, 
     conf.int = TRUE, conf.level = 0.9, exponentiate = TRUE)%>%
  mutate_if(is.numeric, round, 3) %>%
  knitr::kable()

term	estimate	std.error	statistic	conf.low	conf.high
(Intercept)	1.910	0.149	4.358	1.499	2.444
sexmale	0.089	0.171	-14.208	0.067	0.117
fare	1.011	0.002	4.886	1.008	1.015

We can be $90\%$ confident that an increase of 1 dollar in fare is associated with a $0.8\%$ to $1.5\%$ increase in the odds of surviving for either male or female passengers.

since $(1.008-1)\times 100\%= 0.8\%$ and $(1.015-1)\times 100\%= 1.5\%$

Summary

The (conditional) expectation of a binary response is the (conditional) probability of success.
A MLR can not be used to model the conditional expectation of a binary response since its range extends beyond the interval $(0,1)$
Instead, one can model a function of the conditional probability. A common choice in logistic regression is to use the logit function (logarithm of odds)

Summary (cont.)

Scroll down to see full content

The interpretation of the coefficients depends on the type of variables and the form of the model:

The raw coefficients (of the log-odds model) are interpreted as:

log-odds of a reference group
difference of log-odds of a treatment vs a control group
changes in log-odds per unit change in the input

The exponentiated coefficients (of the odds model) are interpreted as:

odds of a reference group
odds ratio of a treatment vs a control group
multiplicative changes in odds per unit change in the input

Important

Once you have an estimated model, remember to interprete results as “estimated” values.

Summary (cont.)

The estimated logistic model can be used to make inference about the population coefficients using the Wald’s satistic: confidence intervals and tests
In the next lecture we’ll look at
- fitted values
- residuals
- diagnostics

Logistic Regression

In this lecture