Multiple Linear Regression

Categorical Variables

Multiple Linear Regression

Photo by Erik Eastman on Unsplash

Last week

Scroll down to see full content

LR: definition

LR can be used to study the relationship between a continuous response variable and input variables of different types
LR to model the conditional expectation of the response given the input variables
LR coefficients are unknown (but not random)
We use a random sample to estimate the LR coefficients
Least Squares is a method that can be used to estimate the LR coefficients using data

SLR: definition, estimation and inference

Simple linear regression (SLR): a LR with only one input variable
We interpreted and visualized the coefficients of a SLR
We learned two different ways to make inference (hypothesis tests and CIs) in the context of SLR:
- using theoretical results or using bootstrapping
We used computer scripts for estimation and inference tasks in a SLR analysis

This week

Multiple Linear Regression (MLR) to study the association between a continuous response and many input variables of different types!!

Heads up: Multiple Linear Regression is not the same as Multivariate Linear Regression (a regression with a multivariate response variable!)

1. Categorical input variables with 2 or more levels
2. Additive MLR: with different type of input variables
3. MLR with interaction terms (between continuous and categorical input variables)

1. Categorical input variables

In STAT 201:

a categorical variables creates groups (levels)
you study how parameters of the distribution of a continuous variable differ across these groups

For example:

does the average size of the donations depend on the variation of a website?
does the average cancer mortality vary across states?

If a variable is categorical, can we include it in a LR?

In STAT 301

Does the distribution of a continuous variable differ across groups?
How does the (linear) relation between variables differ across groups?

Photo by Tomas Sobek on Unsplash

Distribution of cancer mortality

Case 1: 1 categorical variable, 2 levels

Let’s start by comparing the average cancer mortality in 2 states: Washington vs Indiana.

Can we use a LR??!!

The linear regression equation is:

\[Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i,\]

the response \(Y_i\) is TARGET_deathRate

But \(X_i\) is not numeric: Indiana vs Washington

There’s not a line in this case!!

Using LR

Scroll down to see full content

We can still call use a “linear regression”!! How??

Trick: we use an auxiliary numeric variable to represent the levels of a categorical variable: a dummy variable

lm() uses the “reference-treatment” parametrization (as a default) to include factors in LR

each dummy variable defines 2 groups: a “reference” and a “treatment” group

lm() creates these dummy variables automatically, as long as the input variable (in our case state) is a factor!!

\[X_i = \left\{ \begin{array}{ll} 1 & \text{if county is in "Washington"};\\ 0 & \text{otherwise}\end{array} \right.\]

Dummy variables

Scroll down to see full content

It is a numerical variable with values \(0\) (“reference”) or \(1\) (“treatment”).

For example, if the \(i\)th county in your data is in Washington, then \(X_i = 1\)

R chooses the reference level by alphabetical order, but you can change that!
Naming convention: the name of the variable followed by the non-reference level

in our example: stateWashington

Packages and data

Categorical variables as factors

Important

Make sure categorical variables are coded as factors before fitting a regression. R will consider it numerical otherwise.

Code

Scroll down to see full content

(WI_data_LR <- tidy(lm(TARGET_deathRate ~ state,
                  data = WI_cancer_data))  %>%
              mutate_if(is.numeric, round, 3))

# A tibble: 2 × 5
  term            estimate std.error statistic p.value
  <chr>              <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)      188.12      1.825   103.07        0
2 stateWashington  -21.628     3.345    -6.466       0

Where’s the line?

Note that \(\beta_0\) and \(\beta_1\) are not intercepts and slopes (even if R calls them that way)!

What are \(\beta_0\) and \(\beta_1\)?

\[Y_i = \beta_0 + \beta_1 \times X_i + \varepsilon_i\]

For counties in Indiana,

\(X_i=0\), then \(Y_i = \beta_0 + \beta_1 \times 0 + \varepsilon_i\). So we get,

\[Y_i = \beta_0 + \varepsilon_i\]

For counties in Washington,

\(X_i=1\), then \(Y_i = \beta_0 + \beta_1 \times 1 + \varepsilon_i\). So we get,

\[Y_i = \beta_0 + \beta_1 + \varepsilon_i\]

Interpretation of the “intercept”

For counties in Indiana,

\(X_i=0\) and \(Y_i = \beta_0 + \varepsilon_i\), then

\[E[Y_i|X_i=0] = \beta_0\]

Important

\(\beta_0\) is the mean of the response for the reference level of the input variable.

for example, the mean cancer mortality per capita in Indiana

Interpretation of the “slope”

Scroll down to see full content

For counties in Washington,

\(X_i=1\) and \(Y_i = \beta_0 + \beta_1 + \varepsilon_i\), then

\[E[Y_i|X_i=1] = \beta_0 + \beta_1\]

Note that

\[\beta_1 = E[Y_i|X_i=1] - E[Y_i|X_i=0]\]

Important

\(\beta_1\) is the difference of means of the response between levels, not the mean of the response in the other level!

for example, the difference between the mean cancer mortality in Washington relative to Indiana.

The estimated coefficients

Scroll down to see full content

The estimated \(\hat{\beta}_0 = 188.121\), is the sample average (sample mean) mortality per capita in Indiana!!

It is the sample version of the conditional expectation (mean of the reference group)

The estimated \(\hat{\beta}_1 = -21.628\) is the difference between the average mortality per capita in Washington and the average mortality per capita in Indiana.

It is the sample version of the difference of the conditional expectations (or group means)

Deja vu

Note that you have done these type of analyses in STAT201!!

t.test(TARGET_deathRate ~ state,WI_cancer_data,var.equal=T)


    Two Sample t-test

data:  TARGET_deathRate by state
t = 6.4659, df = 129, p-value = 1.896e-09
alternative hypothesis: true difference in means between group Indiana and group Washington is not equal to 0
95 percent confidence interval:
 15.01026 28.24643
sample estimates:
   mean in group Indiana mean in group Washington 
                188.1207                 166.4923

This is not a coincidence! `lm()` is computing the same t-test!!

And more than 2 levels?

If the categorical variable has more levels we need additional dummy variables!

Important

Each dummy variable compares each group to the reference group.

For example, suppose there are 3 states: “Indiana”, “Washington” and “Kansas”

2 treatments, 1 reference

\[\text{stateWashington}_i = \left\{ \begin{array}{ll} 1 & \text{if county is in Washington};\\ 0 & \text{otherwise}\end{array} \right.\]

\[\text{stateKansas}_i= \left\{ \begin{array}{ll} 1 & \text{if county is in Kansas};\\ 0 & \text{otherwise}\end{array} \right.\]

Indiana is the reference with both dummy variables equal to 0.
We need two dummy variable for 3 levels: 2 “treatment” levels compared to one reference level

Mathematically

Scroll down to see full content

\[Y_i = \beta_0 + \beta_1 \times \text{stateWashington}_{i} + \beta_2 \times \text{stateKansas}_{i} + \varepsilon_i\]

For counties in Indiana,

\(\text{stateWashington}_{i} = \text{stateKansas}_{i} = 0\), then \[Y_i = \beta_0 + \beta_1 \times 0 + \beta_2 \times 0 + \varepsilon_i\]

So we get,

\[Y_i = \beta_0 + \varepsilon_i\]

\[E[Y_i|\text{Indiana}]= \beta_0 \]

For counties in Washington,

\(\text{stateWashington}_{i} = 1\) and \(\text{stateKansas}_{i} = 0\), then

\[Y_i = \beta_0 + \beta_1 \times 1 + \beta_2 \times 0 + \varepsilon_i\]

So we get,

\[Y_i = \beta_0 + \beta_1 + \varepsilon_i\]

\[E[Y_i|\text{Washington}]= \beta_0 + \beta_1\]

For counties in Kansas,

\(\text{stateWashington}_{i} = 0\) and \(\text{stateKansas}_{i} = 1\), then

\[Y_i = \beta_0 + \beta_1 \times 0 + \beta_2 \times 1 + \varepsilon_i\]

So we get,

\[Y_i = \beta_0 + \beta_2 + \varepsilon_i\]

\[E[Y_i|\text{Kansas}]= \beta_0 + \beta_2\]

Interpretation

\(\beta_0\) is the mean of the response for the reference level of the input variable
- the mean cancer mortality in Indiana
\(\beta_1\) is the difference between the mean of the response for level 1 and the reference level.
- the difference between the mean cancer mortality in Washington relative to Indiana
\(\beta_2\) is the difference between the mean of the response for level 2 and the reference level
- the difference between the mean cancer mortality in Kansas relative to Indiana

Numerically

(WIK_data_LR <- tidy(lm(TARGET_deathRate ~ state,
                  data = WIK_cancer_data))  %>% mutate_if(is.numeric, round, 2))

# A tibble: 3 × 5
  term            estimate std.error statistic p.value
  <chr>              <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)       188.12      2.29     82.13       0
2 stateKansas       -20.29      3.16     -6.42       0
3 stateWashington   -21.63      4.2      -5.15       0

WIK_cancer_data %>%
  group_by(state) %>%
  summarise(mean_mortality = round(mean(TARGET_deathRate, na.rm = TRUE), 2))

# A tibble: 3 × 2
  state      mean_mortality
  <fct>               <dbl>
1 Indiana            188.12
2 Kansas             167.83
3 Washington         166.49

Key takeaways

California mortality is below Alabma. We know that because the estimate of the “slope” is negative
The coefficient for categorical variables:
- Intercept: represents the mean response for the reference
- for the dummy variable (“slope”): represents difference of the mean response between the treatment and the reference
Dummy name: name of the factor followed by a level
How many dummy variables: k (number of levels) -1
Levels or categories or groups of a factor

Multiple Linear Regression