Course Review
| Model | Response | Distribution | Parameter | Linear |
|---|---|---|---|---|
| MLR | continuous | Normal | mean of response | identity |
| Logistic | binary | Bernoulli | probability of S | log-odds |
| Poisson | counts | Poisson | mean count | log |
mean(house price) = \(\beta_0\) + \(\beta_1\)size + \(\beta_2\)basementY
log(odds of default) = \(\beta_0\) + \(\beta_1\)balance + \(\beta_2\)income + \(\beta_3\)studentY
log(mean(number of bikes)) = \(\beta_0\) + \(\beta_1\)temperature
These functions are used to interpret the regression coefficients!!!
| Model | continuous covariate |
|---|---|
| MLR | an increase of 1 unit in the covariate is associated with an estimated increase/decrease of \(|\hat{\beta}_1|\) in the average response |
| Logistic | an increase of 1 unit in the covariate is associated with an estimated increase/decrease in the log odds of success by \(|\hat{\beta}_1|\), or a change in the odds of success by a factor of \(e^{\hat{\beta}_1}\) |
| Poisson | an increase of 1 unit in the covariate is associated with an estimated increase/decrease in the log average counts by \(|\hat{\beta}_1|\), or or a change in the mean counts by a factor of \(e^{\hat{\beta}_1}\) |
Scroll down to see full content
Scroll down to see full content
estimate or estimated (these are not population quantities, they depend on the sample)
associated with (if the data comes from an observational study causation can not be established)
“by a factor of” or “times” or “percent” if estimated coefficients are exponentiated
“holding other variables constant at any value” if the model is additive and has more variables, otherwise don’t!
check units!!
for additive models: coefficients are interpreted holding other variables constant at any value
for models with interactions: interpretations depend on levels or values of other variables, not at any value, not regardless of other variables!
Warning
Interaction terms do not model correlations between covariates!!
We use interactions when the association between a covariate and the response depends on another covariate(s)
or “keeping the spending in TV and newspaper advertising constant at any value”
Table from ISLR
To be completed
We need the sampling distribution (distribution of the estimators of the regression coefficients, \(\hat{\beta}_j\))
For the 3 models, we usually use a Normal approximation of the sampling distribution (details beyond this course)
glance()Scroll down to see full content
The null hypothesis: \(H_0: \beta_j = 0\)
check the significance level
check the alternative hypothesis
the interpretion depends on the meaning of the coefficient
interpretations are for true coefficients, not estimates
We have statistically significant evidence (p-value < .001) that the mean number of fires is positively associated with temperature.
Remember that we never know what the true population parameters are! At a significance level:
or
Recall that once the data is collected, the intervals are not random and we interpret them in terms of confidence, not probabilities!
We are 95% confident that each additional degree in temperature is associated with an increase in the mean number of fires between 5% and 10%.
We can also use bootstrapping to compute CIs
In worksheet_03 and tutorial_03 we used simulations to study:
Normality: only for MLR, not needed for estimation or large sample apprximations but the linear model will be a good fit if the assumption holds.
Homoscedasticity: only for MLR, the spread (or variance) of the errors in a model is the same across all levels of the covariate(s).
Confounding Factor: a variable related with a covariate and the response, and it can make it look like there’s an association between them, even if there isn’t.
Multicollinearity: correlation between covariates.
Independence: we assume that observations are independent of each other
Scroll down to see full content
Predictions are values of the response computed with the estimated model for fixed values of the covariates:
Residuals is the difference between the observed response and the fitted values
Fitted values and residuals can be computed for the 3 models: MLR, Logistic and Poisson
However: recall that different quantities can be predicted with logistic (log-odds, odds, probabilities) and Poisson (log-counts, counts). Check code for different versions!
In Logistic and Poisson, the variance of each observation depends on the covariates (not constant), so residuals are adjusted (e.g., Pearson, deviance). Check code for different versions!
See tutorial_04 and tutorial_05
Many evaluation metrics used to evaluate MLR measure how far \(y\) is from \(\hat{y}\):
Residuals Sum of Squares: RSS = \(\sum_{i=1}^n(y_i - \hat{y}_i)^2\)
Residuals Standard Error: RSE = \(\sqrt{\frac{1}{n-p-1} \text{RSS}}\)
Mean Squared Error: MSE = \(\frac{1}{n}\sum_{i=1}^n(y_i - \hat{y}_i)^2\)
Coefficient of Determination, \(R^2 = 1 - \frac{RSS}{TSS}\)
Adjusted \(R^2\): \(\text{adj}R^2 = 1 - \frac{RSS/(n-p-1)}{TSS/(n-1)}\)
For more details see Model Evaluation
For Logistic and Poisson regression, we measure the distance between \(y\) and \(\hat{y}\) using the deviance.
The deviance generalizes the Residual Sum of Squares (RSS) defined for MLR.
Image from Notes for Predictive Modeling
For more details see Model Evaluation: Goodness of Fit for Logistic and Poisson
We can compare nested models testing the (simultaneous) significance of additional coefficients in the larger model
reduced: price ~ size
full: price ~ size + distance + years_since_built
Note
The full model contains all the variables in the reduced one plus some new variables (in bold fonts)
The null hypothesis states that the coefficients of both variables, distance and years_since_built, are equal to zero
Suppose that the reduced model has \(q\) covariates and the full model has the same \(q\) plus \(s\) additional ones.
Is the full model significantly different from the reduced one??
\[H_0: \beta_{q+1} = \beta_{q+2} = \ldots = \beta_{q+s} =0\] For MLR: we can use an \(F\)-test to test this hypothesis (review worksheet_06 and tutorial_06)
For Logistic and Poisson: we use can use an \(\chi^2\)-test to test this hypothesis (review worksheet_07)
Note
In all cases you use anova(reduced, full) in R to compute the test
For Logistic and Poisson, you need to specify test = "Chisq"
Recall special cases: \(F\)-test vs \(t\)-test, reduced = intercept-only
For MLR: the RSS is used to add/remove variables (i.e., to compare models of the same size) and different measures can be used to compare models of different sizes (e.g., adj\(R^2\), AIC, BIC, \(C_p\))
For Logistic and Poisson: the AIC or BIC can be used to compare models
Note
We distinguished between the selection of generative models and predictive models, using different evalution metrics.
Review worksheet_07 for generative MLR models and tutorial_07 for generative MLR models
regsubsets() function from the leaps package in R can only be used for linear regression
stepAIC function from the MASS package in R can be used for the 3 models but is based on AIC and BIC only
Smoothly shrink the estimated coefficients. They minimize the RSS but subject to a bound on the size of the coefficients.
Ridge uses an \(L_2-\)norm to measure the size of the coefficients \[\lVert \beta \rVert_2^2 = \sum_{j = 1}^{p} \beta_j^2\]
Lasso uses an \(L_1-\)norm to measure the size of the coefficients \[\lVert \beta \rVert_1 = \sum_{j = 1}^{p} |\beta_j|\]
The bound (or equivalently, level of penalty \(\lambda\)) is usually selected by cross-validation aiming to minimize the test MSE
With enough penalization, LASSO shrinks all estimated coefficients to zero. A way to select variables!
Ridge will never reach a value of zero and thus can not be used to select variables (see tutorial_08)
This “shrinkage” process biases the estimated coefficients to favor prediction performance (see worksheet_08)
In an inference problem, we can re-fit the model using the selected variables postLASSO (see worksheet_08)
More details in Model Selection: Regularization Methods
Regardless if you are focused on inference or prediction, we need to use an independent dataset after model selection
one set to select, another set for inference (estimated coefficients, p-values, goodness-of-fit, etc)
if you check multicollinearity and select variables based on high GVIF, you need another set for inference. Make all selections using one set, test on the other one
this problem is known as the post-inference problem
in worksheet_08 we used a simulation study to show that the inference results computed on reused data are not valid!! but using an independent set gives valid results
one set to select, another set to predict
don’t test many models on the test data!! Only one model is tested on the test data
use cross-validation to choose models based on training data
or, divide the data in 3 parts: training, validation, testing. Use the validation test to compare models, move only one to the testing phase!
Logistic Regression can be used to build predict models
Classifier: the predicted probability can be used to predict the actual response, a class
We cannot use the Mean Squared Error as in MLR, alternative evaluation metrics are used:
Misclassification rate
Sensitivity and specificity
Precision and recall
the confusion matrix summarizes different types of errors made by the classifier.
the function confusionMatrix() from the package caret computes the confusion matrix and other evaluation metrics for classifiers
to define a classifier from a Logistic model, we need to set a threshold on the predicted probability to define the classes
using different thresholds and the predicted probabilites of a logistic regression, we can calculate an ROC and its AUC
perfect prediction results in an AUC of 1; the higher the AUC, the better predictive performance!!
regularized methods (e.g., LASSO and Ridge) can be used to build strong predictive logistic models
Tip
review worksheet_09 for definition of a classifier and metrics
review tutorial_09 for regularized methods for logistic regression
Confidence Intervals for Prediction (CIP): account for the uncertainty given by the estimated MLR to predict the conditional expectation of the response
Prediction intervals (PI): account for the uncertainty given by the estimated MLR to predict the actual response, i.e, the conditional expectation of the response plus the error that generates the data!
PIs are wider than CIPs, since they account for additional sources of variation. Both intervals are centered at the fitted value!
interpretations are conditional on the values of the covariates! (see worksheet_10 and tutorial_10)
More details in Prediction Uncertainty
© 2025 Gabriela Cohen Freue – Material Licensed under CC By-SA 4.0