Subset Selection and Diagnostics

Stepwise Regression

Which variables should you drop from your model? This is a common question for all modeling, but especially logistic regression. In this section we will cover a popular variable selection technique - stepwise regression. This isn’t the only possible technique, but will be the primary focus here.

We will be going back to the Ames, Iowa dataset for exploring these techniques.

Stepwise regression techniques involve the three common methods:

  1. Forward Selection
  2. Backward Selection
  3. Stepwise Selection

These techniques add or remove (depending the technique) one variable at a time from your regression model to try and “improve” the model. There are a variety of different selection criteria to use to add or remove variables from a logistic regression. Two common approaches are to use either p-values or one of the pair of AIC/BIC.

Although p-values are falling out of popularity, it is primarily because people often use the 0.05 significance level without any regards to sample size. Although 0.05 is a good significance level for a sample size around 50, this level should be adjusted based on sample size.

However, it can be shown that mathematically the AIC/BIC criterion for adding or removing variables with stepwise selection (which is becoming very popular) is the same thing as using p-values in likelihood ratio tests. AIC is calculated as follows:

\[ AIC = -2 \log(L) + 2p \] where \(L\) is the likelihood function and p is the number of variables being estimated in the model. Let’s compare two models - one with \(p\) variables and one with \(p+1\) variables. Assuming the additional variable lowers AIC, we can see the following relationship:

\[\begin{aligned} AIC_{p+1} &< AIC_p \\ -2 \log(L_{p+1}) + 2(p+1) &< -2 \log(L_p) + 2p \\ 2 &< 2(\log(L_{p+1}) - \log(L_p)) \\ \end{aligned}\]

The right hand side of the equation is a Likelihood Ratio Test that follows a \(\chi^2_1\) distribution. So we know the significance level from this LRT is the following:

\[ 1 - P(\chi^2_1 > 2) = 0.1573 \]

This means that the AIC selection method for stepwise selection is mathematically the same as a p-value stepwise selection technique with a significance level (\(\alpha\) level) of 0.1573.

The same math can be applied to the BIC selection technique as well. The BIC calculation is the following:

\[ BIC = -2 \log(L) + p \times \log(n) \]

Working through the math, this is also a Likelihood Ratio Test that follows a \(\chi^2_1\) distribution. So we know the significance level from this LRT is the following:

\[ 1 - P(\chi^2_1 > \log(n)) = \ldots \]

Notice how the significance level changes depending on sample size due to the BIC equation, which is exactly what is recommended for any selection techniques using p-values.

With that all understood, let’s evaluate the three common stepwise regression approaches.

Stepwise Selection

Here we will work through stepwise selection. In stepwise selection, the initial model is empty (contains no variables, only the intercept). Each variable is then tried to see if it is significant based on AIC/BIC or p-value with a specified significance level. The most significant variable (or the one that reduces the AIC/BIC the most) is then added to the model. Next, all variables in the model (here only one) are tested to see if they are still significant (or don’t hinder the AIC/BIC if dropped). If not, they are dropped. If so, then the remaining variables are again tested and the next most impactful variable (p-value or AIC/BIC depending on the approach) is added to the model. This process repeats until either no more significant variables are available to add or the same variable keeps being added and then removed from the model based on AIC/BIC or p-value (depending on the approach).

Let’s look at this approach using all of our softwares!

Backward Selection

Here we will work through backward selection. In backward selection, the initial model is full (contains all variables, including the intercept). Each variable is then tried to see if it is significant based on AIC/BIC or p-value with a specified significance level. The least significant variable (or the one that hinders the AIC/BIC the most) is then dropped from the model. Then the remaining variables are again tested and the next least significant variable is dropped from the model. This process repeats until no more insignificant variables are available to drop or the AIC/BIC no longer improves with the deletion of another variable.

Let’s look at this approach using all of our softwares!

Forward with Interactions

Here we will work through forward selection. In forward selection, the initial model is empty (contains no variables, only the intercept). Each variable is then tried to see if it is significant based on AIC/BIC or p-value with a specified significance level. The most significant variable (or the one that reduces the AIC/BIC the most) is then added to the model. Then the remaining variables are again tested and the next most impactful variable (p-value or AIC/BIC depending on the approach) is added to the model. This process repeats until either no more significant variables are available to add to the model based on AIC/BIC or p-value (depending on the approach). This approach is the same as stepwise selection without the additional check at each step for possible removal.

Forward selection is the least used technique because stepwise selection does the same as forward selection with the added benefit of dropping insignificant variables. The main use for forward selection is to test higher order terms and interactions in models.

Let’s look at this approach using all of our softwares!

Diagnostic Plots

Linear regression models contain residuals with properties that are very useful for model diagnostics. However, what is a residual in a logistic regression model? Since we don’t have actual probabilities to compare our predicted probabilities against, residuals are not as clearly defined. Instead we have pseudo “residuals” in logistic regression that we can explore further. Two examples of this are deviance residuals and Pearson residuals.

Deviance is a measure of how far a fitted model is from the fully saturated model. The fully saturated model is a model that predicts our data perfectly by essentially overfitting to it - a variable for each unique combination of inputs. This makes this model impractical for use, but good for comparison. The deviance is essentially our “error” from this “perfect” model. Logistic regression minimizes the sum of the squared deviances. Deviance residuals tell us how much each observation reduces the deviance.

Pearson residuals on the other hand tell us how much each observation changes the Pearson Chi-squared test for the overall model.

Other forms of measuring an observations influence on the logistic regression model are DFBetas and Cook’s D. Similar to their interpretation in linear regression, these two calculations tell us how each observation changes the estimation of each parameter individually (DFBeta) or how each observation changes the estimation of all the parameters holistically (Cook’s D).

Let’s see how to get all of these from our softwares!

Calibration Curves

Another evaluation/diagnostic for logistic regression is the calibration curve. The calibration curve is a goodness-of-fit measure for logistic regression. Calibration measures how well predicted probabilities agree with actual frequency counts of outcomes (estimated linearly across the data set). These curves can help detect bias in your model - if predictions are consistently too high or low.

If the curve is above the diagonal line, this indicates the model is predicting lower probabilities than actually observed. The opposite is true if the curve is below the diagonal line.

This is best used on larger samples since we are calculating the observed proportion of events in the data. In smaller samples this relationship is extrapolated out from the center and may not as accurately reflect the truth.

Let’s look at creating these in all of our softwares!