Preparing the Data

Previously, we looked a binary logistic regression with only a couple of variables from our Ames housing data predicting bonus eligibility. However, to explore our data at scale we need to run many \(\chi^2\) tests like we learned from the Categorical Data Analysis section of the course deck to explore relationships between our categorical target variable and categorical predictors. We also need to evaluate relationships between the categorical target variable and continuous predictor variables. To evaluate those relationships we could run individual ANOVA calculations between the continuous and target variable or simple logistic regressions. The ANOVA calculations would be quicker due to not needing to optimize maximum likelihood estimation.

Let’s run these variable explorations in each software!

Quasi-Complete Separation

In the previous section of the code deck we talked about looking at quasi-complete separation. To make sure that our data doesn’t have variables that have quasi complete separation concerns, we can write a function to check our data and then combine those categories with reference categories for the variable.

Let’s see how to do this in each software!

Stepwise Regression

Which variables should you drop from your model? This is a common question for all modeling, including logistic regression. In this section we will cover a popular variable selection technique - stepwise regression. This isn’t the only possible technique, but will be the primary focus here.

Stepwise regression techniques involve the three common methods:

  1. Forward Selection

  2. Backward Selection

  3. Stepwise Selection

These techniques add or remove (depending on the technique) one variable at a time from your regression model to try and “improve” the model. There are a variety of different selection criteria to use to add or remove variables from a logistic regression which will be covered in more detail in the model assessment part of the code deck.

The specific details of each of these is covered in full in the Model Building portion of this code deck. Let’s revisit what stepwise selection is doing.

Stepwise Selection

Let’s revisit the stepwise selection approach with cross-validation specifically.

Stepwise Variable Selection with CV

In first step of the stepwise selection process, we create models such that each model has exactly one predictor variable and calculate a model metric for each model. However, we do this across each cross-validation training set. For 10-fold cross-validation, that would be 10 separate first steps that create models with only one variable. Next, we average the model metric across all cross-validation training sets to see which is the best variable. For example, the 5th variable is now the best on average across all of the cross-validation datasets based on our model metric, not just the training set overall.

This same process is repeated across all of the steps until a final model is selected where adding any further variables does not improve the model based on the cross-validation model metrics. Remember, the difference between forward and stepwise selection is that we are also evaluating each variable added to the model at each step to see if deleting that variable improves the model metric.

Let’s see this in each software!

Backward Selection

Let’s also revisit the backward selection approach with cross-validation.

Backward Variable Elimination with CV

In first step of the backward selection process, we create models such that each model has exactly one predictor variable removed from it and calculate a model metric for each model. However, we do this across each cross-validation training set. For 10-fold cross-validation, that would be 10 separate first steps that create models removing only one variable. Next, we average the model metric across all training sets to see which is the worst variable. For example, the second variable is now the worst on average across all of the cross-validation datasets based on our model metric, not just the training set overall.

This same process is repeated across all of the steps until a final model is selected where removing any further variables does not improve the model based on the cross-validation model metrics.

Let’s see this in each software!

Forward Selection with Interactions

Here we will work through forward selection. In forward selection, the initial model is empty (contains no variables, only the intercept). Each variable is then tried to see how much it improves a model based on a model metric. The most important variable (based on that metric) is then added to the model. Then the remaining variables are again tested and the next most impactful variable (based on the same model metric) is added to the model. This process repeats until either no more variables are available to add to the model based on the metric. This approach is the same as stepwise selection without the additional check at each step for possible removal.

Forward selection is the least used technique because stepwise selection does the same as forward selection with the added benefit of dropping insignificant variables. The main use for forward selection is to test higher order terms and interactions in models. You can start with your model determined from either stepwise or backward selection and then try to add only interactions between variables from there.

The code for this is not shown here, but it is a simple change to the code above to perform this approach.

Calibration Curves

Another evaluation/diagnostic for logistic regression is the calibration curve. The calibration curve is a goodness-of-fit measure for logistic regression. Calibration measures how well predicted probabilities agree with actual frequency counts of outcomes (estimated linearly across the data set). These curves can help detect if predictions are consistently too high or low in your model.

If the curve is above the diagonal line, this indicates the model is predicting lower probabilities than actually observed. The opposite is true if the curve is below the diagonal line.

This is best used on larger samples since we are calculating the observed proportion of events in the data. In smaller samples this relationship is extrapolated out from the center and may not as accurately reflect the truth.

Let’s look at creating these in each software!

Diagnostic Plots

Linear regression models contain residuals with properties that are very useful for model diagnostics. However, what is a residual in a logistic regression model? Since we don’t have actual probabilities to compare our predicted probabilities against, residuals are not as clearly defined. Instead we have pseudo “residuals” in logistic regression that we can explore further. Two examples of this are deviance residuals and Pearson residuals.

Deviance is a measure of how far a fitted model is from the fully saturated model. The fully saturated model is a model that predicts our data perfectly by essentially overfitting to it - a variable for each unique combination of inputs. This makes this model impractical for use, but good for comparison. The deviance is essentially our “error” from this “perfect” model. Logistic regression minimizes the sum of the squared deviances. Deviance residuals tell us how much each observation reduces the deviance.

Pearson residuals on the other hand tell us how much each observation changes the Pearson Chi-squared test for the overall model.

Other forms of measuring an observation’s influence on the logistic regression model are DFBetas and Cook’s D. Similar to their interpretation in linear regression, these two calculations tell us how each observation changes the estimation of each parameter individually (DFBeta) or how each observation changes the estimation of all the parameters holistically (Cook’s D).

Let’s see how to get all of these from each software!