Subset Selection

Preparing the Data

Previously, we looked a binary logistic regression with only a couple of variables from our Ames housing data predicting bonus eligibility. However, to explore our data at scale we need to run many \(\chi^2\) tests like we learned from the Categorical Data Analysis section of the course deck to explore relationships between our categorical target variable and categorical predictors. We also need to evaluate relationships between the categorical target variable and continuous predictor variables. To evaluate those relationships we could run individual ANOVA calculations between the continuous and target variable or simple logistic regressions. The ANOVA calculations would be quicker due to not needing to optimize maximum likelihood estimation.

Let’s run these variable explorations in each software!

Python
R

Before we can put the predictor variables in our functions for automated screening we need to convert our categorical predictor variables into dummy coded variables. Essentially, we need to one-hot encode our variables to make each category have a corresponding binary flag variable. Using the get_dummies function from Pandas we can do this easily. First, we must remove the target variable from our training data with the drop function. Just to make sure that all variables are of the same type, after converting the variables to binary flags we use the astype(float) function on all our predictor variables. Lastly, we call the predictor variables dataframe X and our target variable y.

Code

train['bonus'] = (train['SalePrice'] > 175000).astype(int)

predictors = train.drop(columns=['SalePrice', 'bonus'])
predictors = pd.get_dummies(predictors, drop_first=True)
predictors = predictors.astype(float)

predictors = predictors.drop(['GarageType_Missing', 
                              'GarageQual_Missing',
                              'GarageCond_Missing'], axis=1)

X = predictors
y = train['bonus']

Once our data is in this format we can use the SelectKBest,chi2, and f_classif functions from the popular sci-kit learn package, specifically sklearn.feature_selection. Inside SelectKBest we will use the f_classif and chi2 functions as our scoring functions for continuous and categorical predictors respectively.

First we will explore the categorical predictor variables. However, since we have already dummy coded all of our caetgorical variables, to isolate just the categorical variables we can isolate the variables with only two unique values using the unique function inside of a for loop.

Once we isolate the categorical variables into their own X_cat object we can use the SelectKBest function with the score_func = chi2 option. The k = 'all' option doesn’t eliminate any variables, but instead calculates the p-value for each variable. That way we can be the ones to decide which variables we want to keep. We then input our X_cat and y objects into the fit function on the SelectKBest object. From there we just save the column names (columns), the test statistics (scores_), and p-values (pvalues_) into a single dataframe. From here we can remove the variables with a p-value above our 0.009 cut-off by isolating the specific variables that meet that requirement.

Code

from sklearn.feature_selection import SelectKBest, chi2, f_classif


# Separate categorical (dummy) vs. continuous features
categorical_features = [col for col in X.columns if X[col].nunique() == 2]
continuous_features = [col for col in X.columns if X[col].nunique() > 2]

X_cat = X[categorical_features]
X_cont = X[continuous_features]

# Fit SelectKBest for Categorical Variables
selector = SelectKBest(score_func=chi2, k='all')  # 'all' keeps all features for scoring
selector.fit(X_cat, y)

SelectKBest(k='all', score_func=<function chi2 at 0x32a21a7a0>)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Code


# Create a DataFrame with feature names, Chi2-scores, and p-values
scores_cat_df = pd.DataFrame({
    'Feature': X_cat.columns,
    'Chi2_score': selector.scores_,
    'p_value': selector.pvalues_
})

# Filter for features with p-value < 0.009
selected_cat_features = scores_cat_df[scores_cat_df['p_value'] < 0.009]['Feature']

Next we will explore the continuous predictor variables we obtained from the previous code.

Once we isolate the continuous variables into their own X_cont object we can use the SelectKBest function with the score_func = f_classif option. The k = 'all' option doesn’t eliminate any variables, but instead calculates the p-value for each variable. That way we can be the ones to decide which variables we want to keep. We then input our X_cont and y objects into the fit function on the SelectKBest object. From there we just save the column names (columns), the test statistics (scores_), and p-values (pvalues_) into a single dataframe. From here we can remove the variables with a p-value above our 0.009 cut-off by isolating the specific variables that meet that requirement. We then select only those variables from our original X object to get our reduced variable list called X_reduced.

Code

# Fit SelectKBest for Continous Variables
selector = SelectKBest(score_func=f_classif, k='all')  # 'all' keeps all features for scoring
selector.fit(X_cont, y)

SelectKBest(k='all')

Code

# Create a DataFrame with feature names, F-scores, and p-values
scores_cont_df = pd.DataFrame({
    'Feature': X_cont.columns,
    'F_score': selector.scores_,
    'p_value': selector.pvalues_
})

# Filter for features with p-value < 0.009
selected_cont_features = scores_cont_df[scores_cont_df['p_value'] < 0.009]['Feature']

# Create a new DataFrame with only those selected columns
X_reduced = X[selected_cat_features.tolist() + selected_cont_features.tolist()]

X_reduced.head()

      GarageYrBlt_was_missing  LotShape_IR2  ...  OpenPorchSF  EnclosedPorch
618                       0.0           0.0  ...        108.0            0.0
792                       0.0           0.0  ...        130.0            0.0
483                       0.0           0.0  ...        125.0            0.0
1012                      0.0           0.0  ...          0.0          112.0
108                       1.0           0.0  ...          0.0          144.0

[5 rows x 56 columns]

Before we can put the predictor variables in our functions for automated screening we need to convert our categorical variables into dummy coded variables. Essentially, we need to one-hot encode our variables to make each category have a corresponding binary flag variable. Using the model.matrix function we can do this easily. First, we must remove the target variable from our training data with the select function. Lastly, we call the predictor variables dataframe X and our target variable y.

Code

library(broom)
library(dplyr)

train$bonus <- as.integer(train$SalePrice > 175000)

predictors <- train %>%
  select(-SalePrice, -bonus)

# Create dummy variables (drop first level to avoid multicollinearity)
predictors_dummies <- model.matrix(~ . -1, data = predictors)

predictors_dummies <- as.data.frame(predictors_dummies)

predictors_dummies <- predictors_dummies %>%
  select(-GarageTypeMissing, -GarageQualMissing, -GarageCondMissing)

# Final features and target
X <- predictors_dummies
y <- train$bonus

Code

names(X)[names(X) == "`1stFlrSF`"] <- "FirstFlrSF"
names(X)[names(X) == "`2ndFlrSF`"] <- "SecondFlrSF"
names(X)[names(X) == "`3SsnPorch`"] <- "ThreeStoryPorch"

Once our data is in this format we can use the lapply function. Inside lapply we will use the aov and chisq.test functions as our scoring functions for continuous and categorical predictors respectively.

First we will explore the categorical predictor variables. However, since we have already dummy coded all of our categorical variables, to isolate just the categorical variables we can isolate the variables with only two unique values using the length and unique functions inside of an sapply function.

Once our data is in this format we can use the lapply function and input our own function where we calculate the individual \(\chi^2\) tests. We will create our own function where we calculate individual \(\chi^2\) tests with the chisq.test function. From those individual tests we use the p.value and statistic elements to extract the p-value and test statistic information. This approach doesn’t eliminate any variables, but instead calculates the p-value for each variable. That way we can be the ones to decide which variables we want to keep. From here we can remove the variables with a p-value above our 0.009 cut-off by isolating the specific variables that meet that requirement.

Code

# Separate categorical and continuous features
categorical_features <- names(X)[sapply(X, function(col) length(unique(col)) == 2)]
continuous_features  <- names(X)[sapply(X, function(col) length(unique(col)) > 2)]

X_cat  <- X[, categorical_features, drop = FALSE]
X_cont <- X[, continuous_features, drop = FALSE]

# ---- Chi-squared test for categorical features ----
chi2_results <- lapply(names(X_cat), function(col) {
  tbl <- table(X_cat[[col]], y)
  test <- suppressWarnings(chisq.test(tbl))
  data.frame(
    Feature = col,
    Chi2_score = test$statistic,
    p_value = test$p.value
  )
})

chi2_df <- do.call(rbind, chi2_results)

# Select features with p-value < 0.009
selected_cat_features <- chi2_df %>%
  filter(p_value < 0.009) %>%
  pull(Feature)

Next we will explore the continuous predictor variables we obtained from the previous code.

Once we isolate the continuous variables into their own X_cont object we can use the lapply function with the aov function within it. Similar to above, we will create our own function where we calculate individual ANOVA F tests with the aov function. From those individual ANOVA’s we use the summary function to extract the p-value and test statistic information. This approach doesn’t eliminate any variables, but instead calculates the p-value for each variable. That way we can be the ones to decide which variables we want to keep. From here we can remove the variables with a p-value above our 0.009 cut-off by isolating the specific variables that meet that requirement. We then select only those variables from our original X object to get our reduced variable list called X_reduced.

Code

# ---- ANOVA F-test for continuous features ----
anova_results <- lapply(names(X_cont), function(col) {
  model <- aov(X_cont[[col]] ~ as.factor(y))
  test <- summary(model)[[1]]
  data.frame(
    Feature = col,
    F_score = test$`F value`[1],
    p_value = test$`Pr(>F)`[1]
  )
})

anova_df <- do.call(rbind, anova_results)

# Select features with p-value < 0.009
selected_cont_features <- anova_df %>%
  filter(p_value < 0.009) %>%
  pull(Feature)

# ---- Combine selected features ----
selected_features <- c(selected_cat_features, selected_cont_features)

# Create reduced feature set
X_reduced <- X[, selected_features, drop = FALSE]

# View first few rows
head(X_reduced, n = 5)

     LotShapeIR1 LotShapeIR2 LotShapeReg LotConfigCulDSac Condition1Feedr
618            0           0           1                0               0
792            1           0           0                1               0
483            0           0           1                0               0
1012           0           0           1                0               0
108            0           0           1                0               0
     Condition1Norm BldgType2fmCon BldgTypeDuplex BldgTypeTwnhs
618               1              0              0             0
792               1              0              0             0
483               1              0              0             1
1012              1              0              0             0
108               0              0              0             0
     HouseStyle1.5Unf HouseStyle2Story HouseStyleSFoyer HouseStyleSLvl
618                 0                0                0              0
792                 0                1                0              0
483                 0                0                0              0
1012                0                1                0              0
108                 0                0                0              0
     ExterQualGd ExterQualTA FoundationCBlock FoundationPConc FoundationSlab
618            0           0                0               1              0
792            1           0                0               1              0
483            0           1                0               1              0
1012           0           1                0               1              0
108            0           1                1               0              0
     BsmtQualFa BsmtQualGd BsmtQualMissing BsmtQualTA HeatingQCGd HeatingQCTA
618           0          0               0          0           0           0
792           0          1               0          0           0           0
483           0          0               0          0           0           0
1012          0          0               0          1           0           1
108           0          0               0          1           0           1
     CentralAirY KitchenQualFa KitchenQualGd KitchenQualTA FireplaceQuGd
618            1             0             1             0             1
792            1             0             0             1             0
483            1             0             0             1             0
1012           1             0             1             0             0
108            0             1             0             0             0
     FireplaceQuMissing FireplaceQuPo FireplaceQuTA GarageTypeAttchd
618                   0             0             0                1
792                   0             0             1                1
483                   1             0             0                1
1012                  0             0             1                0
108                   1             0             0                0
     GarageTypeBuiltIn GarageTypeDetchd GarageQualFa GarageQualTA GarageCondFa
618                  0                0            0            1            0
792                  0                0            0            1            0
483                  0                0            0            1            0
1012                 0                1            0            1            0
108                  0                0            0            0            0
     GarageCondTA PavedDriveP PavedDriveY GarageYrBlt_was_missing LotFrontage
618             1           0           1                       0          90
792             1           0           1                       0          92
483             1           0           1                       0          32
1012            1           0           1                       0          55
108             0           0           0                       1          85
     LotArea OverallQual YearBuilt YearRemodAdd TotalBsmtSF FirstFlrSF
618    11694           9      2007         2007        1822       1828
792     9920           7      1996         1997        1117       1127
483     4500           6      1998         1998        1216       1216
1012   10592           6      1923         1996         602        900
108     8500           5      1919         2005         793        997
     SecondFlrSF GrLivArea FullBath HalfBath BedroomAbvGr TotRmsAbvGrd
618            0      1828        2        0            3            9
792          886      2013        2        1            3            8
483            0      1216        2        0            2            5
1012         602      1502        1        1            3            7
108          520      1517        2        0            3            7
     Fireplaces GarageYrBlt GarageCars GarageArea WoodDeckSF OpenPorchSF
618           1        2007          3        774          0         108
792           1        1997          2        455        180         130
483           0        1998          2        402          0         125
1012          2        1923          1        180         96           0
108           0        1981          0          0          0           0
     EnclosedPorch
618              0
792              0
483              0
1012           112
108            144

Quasi-Complete Separation

In the previous section of the code deck we talked about looking at quasi-complete separation. To make sure that our data doesn’t have variables that have quasi complete separation concerns, we can write a function to check our data and then combine those categories with reference categories for the variable.

Let’s see how to do this in each software!

Python
R

In Python it is a simple task to write a function. We will call our function check_quasi_complete_separation with two inputs - our predictor variables and our target variable. Inside this function we are calculating the crosstab function from pandas and looking for any cells with 0’s with the any function. If there are any cells with values of 0’s which would lead to quasi-complete separation we will print them out.

Code

def check_quasi_complete_separation(X, y):
    """
    Checks each categorical predictor in X for quasi-complete separation with respect to binary target y.
    
    Parameters:
    - X: pd.DataFrame of predictors (categorical variables)
    - y: pd.Series of binary target variable (e.g., 0/1 or True/False)
    
    Returns:
    - List of variable names that exhibit quasi-complete separation
    """
    problematic_vars = []

    for col in X.columns:
        ct = pd.crosstab(X[col], y)

        # Check if any category (row) has a zero in any outcome class
        if (ct == 0).any(axis=1).any():
            print(f"⚠️ Quasi-complete separation detected in '{col}'")
            print(ct)
            print()
            problematic_vars.append(col)

    return problematic_vars

Let’s now put our actual data in the function. We only want to do this check on categorical predictor variables.

Code

X_cat_reduced = X_reduced[selected_cat_features.tolist()]

problem_vars = check_quasi_complete_separation(X_cat_reduced, y)

⚠️ Quasi-complete separation detected in 'HouseStyle_1.5Unf'
bonus                0    1
HouseStyle_1.5Unf          
0.0                610  472
1.0                 13    0

⚠️ Quasi-complete separation detected in 'Foundation_Slab'
bonus              0    1
Foundation_Slab          
0.0              605  472
1.0               18    0

⚠️ Quasi-complete separation detected in 'BsmtQual_Missing'
bonus               0    1
BsmtQual_Missing          
0.0               598  472
1.0                25    0

⚠️ Quasi-complete separation detected in 'FireplaceQu_Po'
bonus             0    1
FireplaceQu_Po          
0.0             611  472
1.0              12    0

Looks like we have a couple of variables with quasi-complete separation. Let’s remove these variables from our dataset before building our model.

Code

X_reduced = X_reduced.drop(problem_vars, axis = 1)

In R it is a simple task to write a function. We will call our function check_quasi_complete_separation with two inputs - our predictor variables and our target variable. Inside this function we are calculating the table function and looking for any cells with 0’s with the any and apply functions. If there are any cells with values of 0’s which would lead to quasi-complete separation we will print them out.

Code

check_quasi_complete_separation <- function(X, y) {
  # X: data.frame of categorical predictors
  # y: binary target (factor or 0/1 vector)
  
  problematic_vars <- c()
  
  for (colname in names(X)) {
    ct <- table(X[[colname]], y)
    
    # Check if any row has a zero in any column (i.e., any category lacks one class)
    has_zero <- apply(ct, 1, function(row) any(row == 0))
    
    if (any(has_zero)) {
      cat("⚠️ Quasi-complete separation detected in '", colname, "'\n", sep = "")
      print(ct)
      cat("\n")
      problematic_vars <- c(problematic_vars, colname)
    }
  }
  
  return(problematic_vars)
}

Let’s now put our actual data in the function. We only want to do this check on categorical predictor variables.

Code

X_cat_reduced <- X_reduced[selected_cat_features]

problem_vars <- check_quasi_complete_separation(X_cat_reduced, y)

⚠️ Quasi-complete separation detected in 'HouseStyle1.5Unf'
   y
      0   1
  0 610 472
  1  13   0

⚠️ Quasi-complete separation detected in 'FoundationSlab'
   y
      0   1
  0 605 472
  1  18   0

⚠️ Quasi-complete separation detected in 'BsmtQualMissing'
   y
      0   1
  0 598 472
  1  25   0

⚠️ Quasi-complete separation detected in 'FireplaceQuPo'
   y
      0   1
  0 611 472
  1  12   0

Looks like we have a couple of variables with quasi-complete separation. Let’s remove these variables from our dataset before building our model.

Code

X_reduced <- X_reduced[ , !(names(X_reduced) %in% problem_vars)]

Stepwise Regression

Which variables should you drop from your model? This is a common question for all modeling, including logistic regression. In this section we will cover a popular variable selection technique - stepwise regression. This isn’t the only possible technique, but will be the primary focus here.

Stepwise regression techniques involve the three common methods:

Forward Selection
Backward Selection
Stepwise Selection

These techniques add or remove (depending on the technique) one variable at a time from your regression model to try and “improve” the model. There are a variety of different selection criteria to use to add or remove variables from a logistic regression which will be covered in more detail in the model assessment part of the code deck.

The specific details of each of these is covered in full in the Model Building portion of this code deck. Let’s revisit what stepwise selection is doing.

Stepwise Selection

Let’s revisit the stepwise selection approach with cross-validation specifically.

In first step of the stepwise selection process, we create models such that each model has exactly one predictor variable and calculate a model metric for each model. However, we do this across each cross-validation training set. For 10-fold cross-validation, that would be 10 separate first steps that create models with only one variable. Next, we average the model metric across all cross-validation training sets to see which is the best variable. For example, the 5th variable is now the best on average across all of the cross-validation datasets based on our model metric, not just the training set overall.

This same process is repeated across all of the steps until a final model is selected where adding any further variables does not improve the model based on the cross-validation model metrics. Remember, the difference between forward and stepwise selection is that we are also evaluating each variable added to the model at each step to see if deleting that variable improves the model metric.

Let’s see this in each software!

Python
R

To gain the needed functionality for cross-validation recursive feature addition or subtraction, we need to use the mlxtend package on top of scikit-learn. Specifically, we will use the SequentialFeatureSelector function from mlxtend.feature_selection. We will need to create our logistic regression object with the LogisticRegression function, not the statsmodels.api version with the Logit function. We need the penalty = None option to perform traditional logistic regression with no additional penalties (discussed in the Regularization section of the code deck). This object will be an input to our SFS function. We also need an additional step to improve the speed of the feature selection - scaling the data. This will help improve the speed of the maximum llikelihood optimization. To do this we will use the StandardScaler function from the sklearn.processing package. We will take the reduced data frame of the predictors, X_reduced, and input it into the StandardScaler object with the fit_transform function. This function unfortunately doesn’t produce a data frame output so we have to turn this into a data frame with the pandas DataFrame function with the columns option to define the same columns as we had with the reduced dataset.

Now we can define our SFS function. In the k_features option we can put any integer value to set a certain number of features, the "best" option to pick the number of features that provides the best value of the model metric, or the "parsimonious" option to pick the number of features within one standard error of the true best model to try and make the model simpler. The forward = True option makes the function perform forward selection. The floating = True option means that once a variable is added it is evaluated at each step to be dropped out of the model - stepwise selection. The scoring option allows you to select your model metric. The options for a binary target variable are the accuracy, precision, recall, F1 score, area under the ROC curve (roc_auc), negative log loss, and negative Brier score. These will be covered in much further detail in the next section of the code deck on Model Assessment. The reason the last two are negative is that the function tries to maximize the optimization function. The cv option allows you to pick the number of cross-validation folds. Lastly, we just use the fit function with our predictor variables and target variable to fit the model. The variables are then printed below.

Code

from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_reduced)
X_scaled_df = pd.DataFrame(X_scaled, columns = X_reduced.columns)

logr = LogisticRegression(max_iter = 1000, solver = 'newton-cg', penalty = None) 

# Backward selection to find best subset of features
sfs = SFS(logr,
          k_features = "best", 
          forward = True,
          floating = True,
          scoring = 'roc_auc',
          cv = 10)

sfs = sfs.fit(X_scaled_df, y)

# Get selected feature names
selected_features = list(sfs.k_feature_names_)
print("Selected features:", selected_features)

Selected features: ['LotConfig_CulDSac', 'BldgType_Duplex', 'BldgType_Twnhs', 'BsmtQual_Gd', 'HeatingQC_TA', 'KitchenQual_TA', 'GarageType_Attchd', 'GarageType_BuiltIn', 'GarageType_Detchd', 'LotArea', 'OverallQual', 'YearBuilt', 'YearRemodAdd', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'FullBath', 'HalfBath', 'Fireplaces', 'GarageArea', 'OpenPorchSF']

Now we have variables determined by stepwise selection with cross-validation.

To gain the needed functionality for cross-validation recursive feature addition or subtraction, we need to use the caret package. Specifically, we will use the train function from caret package along with the MASS package. The caret function can not handle our target variable and predictor variables being in separate objects. Therefore, we first use cbind to combine the target variable as a column to the predictor variables dataframe. Now we can use the new dataframe to put into caret with the typical formula structure we have used previously. We also need to make our target variable a factor object using the factor function for the package to fit the correct model.

The method = "glmStepAIC" option uses the MASS package and its stepwise selection capabilities. You can think of caret as more of a wrapper function that puts a cross-validation spin on other functions.

The glmStepAIC function that does recursive feature elimination such as stepwise selection uses likelihood based metrics to selection variables. We will use two of the most common ones: AIC and BIC (can also be referred to as SBC).

The AIC, or Akaike Information Criterion, was developed by statistician Hirotugu Akaike in the 1970’s and is defined by:

\[ AIC = n \log(\frac{SSE}{n}) + 2(p + 1) \]

In this case, the \(SSE\) is the sum of squared error from the model and \(2(p+1)\) is the “penalty”, where \(p\) is the number of variables in the model. A smaller AIC indicates a better model.

BIC, also known as the Bayesian Information Criterion (also called SBC or Schwarz Bayesian Information) was first developed by Gideon E. Schwarz, also back in the 1970’s and is defined by:

\[ BIC = n \log(\frac{SSE}{n}) + (p+1)\log (n) \]

In this case, “Likelihood” is the likelihood of the data and \((p+1)\log(n)\) is the “penalty”, where \(p\) is the number of variables in the model and \(n\) is the sample size. A smaller BIC indicates a better model. This is how variables are selected at each step in the stepwise selection.

However, the metric option allows you to select your model metric across cross-validation training sets to pick the best one. The options for a binary target variable are controlled with the summaryFuction option in the trainControl function. The trControl = trainControl(method = 'cv', number = 10)) option allows you to pick the number of cross-validation folds. That summary function allows different evaluation metrics. We can pick the F1 score, precision, recall, area under the ROC curve (AUC), and area under the precision-recall curve. Using other optoins in the summaryFunction provides even more possible metrics. We also have to specify the fmaily = "binomial option to build a logistic regression and the direction = "both" option to specify we want stepwise selection. The results object will display our final performance metric and the summary function will show a summary of the final model built.

Code

library(caret)
library(MASS)
library(MLmetrics)

set.seed(1234)

df <- cbind(y, X_reduced)

df$y <- factor(df$y, levels = c(0, 1), labels = c("no", "yes"))

step_model <- caret::train(y ~ ., data = df,
                    method = "glmStepAIC", 
                    family = "binomial",
                    direction = "both",
                    metric = "AUC",
                    trControl = trainControl(method = "cv", 
                                             number = 10,
                                             classProbs = TRUE,
                                             summaryFunction = prSummary), 
                    trace = FALSE)

step_model$results

  parameter       AUC Precision    Recall         F      AUCSD PrecisionSD
1      none 0.9443919 0.9133088 0.9181516 0.9152649 0.04018161  0.03848537
    RecallSD        FSD
1 0.03427796 0.02931454

Code

summary(step_model$finalModel)


Call:
NULL

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)             -2.829e+01  2.575e+01  -1.099 0.271965    
LotShapeIR1              3.589e+00  1.441e+00   2.490 0.012775 *  
LotShapeIR2              3.719e+00  1.567e+00   2.373 0.017642 *  
LotShapeReg              2.861e+00  1.442e+00   1.984 0.047212 *  
Condition1Norm           6.660e-01  4.088e-01   1.629 0.103320    
BldgTypeDuplex          -2.165e+00  1.172e+00  -1.848 0.064629 .  
BldgTypeTwnhs           -1.757e+00  7.910e-01  -2.221 0.026375 *  
BsmtQualGd               7.694e-01  3.401e-01   2.262 0.023674 *  
HeatingQCTA             -7.535e-01  3.573e-01  -2.109 0.034971 *  
CentralAirY              3.279e+00  1.624e+00   2.019 0.043465 *  
KitchenQualFa           -3.935e+00  2.127e+00  -1.850 0.064264 .  
KitchenQualTA           -1.001e+00  3.309e-01  -3.025 0.002489 ** 
GarageTypeAttchd         3.692e+00  1.056e+00   3.495 0.000475 ***
GarageTypeBuiltIn        2.879e+00  1.184e+00   2.431 0.015039 *  
GarageTypeDetchd         2.038e+00  1.088e+00   1.873 0.061054 .  
GarageYrBlt_was_missing  4.434e+00  2.513e+00   1.764 0.077675 .  
LotFrontage             -1.139e-02  6.904e-03  -1.649 0.099108 .  
LotArea                  8.795e-05  3.013e-05   2.919 0.003507 ** 
OverallQual              8.097e-01  1.842e-01   4.394 1.11e-05 ***
YearRemodAdd             1.979e-02  1.143e-02   1.731 0.083365 .  
TotalBsmtSF              2.040e-03  7.857e-04   2.596 0.009423 ** 
FirstFlrSF              -6.783e-03  3.278e-03  -2.069 0.038516 *  
SecondFlrSF             -6.037e-03  3.180e-03  -1.898 0.057635 .  
GrLivArea                8.314e-03  3.170e-03   2.623 0.008713 ** 
FullBath                 9.963e-01  3.674e-01   2.712 0.006694 ** 
HalfBath                 6.481e-01  3.656e-01   1.773 0.076267 .  
Fireplaces               5.496e-01  2.586e-01   2.126 0.033533 *  
GarageYrBlt             -1.835e-02  1.007e-02  -1.822 0.068529 .  
GarageCars               7.120e-01  5.016e-01   1.420 0.155744    
GarageArea               3.164e-03  1.580e-03   2.002 0.045236 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1497.1  on 1094  degrees of freedom
Residual deviance:  419.7  on 1065  degrees of freedom
AIC: 479.7

Number of Fisher Scoring iterations: 8

Now we have variables determined by stepwise selection with cross-validation.

Backward Selection

Let’s also revisit the backward selection approach with cross-validation.

In first step of the backward selection process, we create models such that each model has exactly one predictor variable removed from it and calculate a model metric for each model. However, we do this across each cross-validation training set. For 10-fold cross-validation, that would be 10 separate first steps that create models removing only one variable. Next, we average the model metric across all training sets to see which is the worst variable. For example, the second variable is now the worst on average across all of the cross-validation datasets based on our model metric, not just the training set overall.

This same process is repeated across all of the steps until a final model is selected where removing any further variables does not improve the model based on the cross-validation model metrics.

Let’s see this in each software!

Python
R

Now we can define our SFS function. In the k_features option we can put any integer value to set a certain number of features, the "best" option to pick the number of features that provides the best value of the model metric, or the "parsimonious" option to pick the number of features within one standard error of the true best model to try and make the model simpler. The forward = False and floating = False options makes the function perform backward selection. The scoring option allows you to select your model metric. The options for a binary target variable are the accuracy, precision, recall, F1 score, area under the ROC curve (roc_auc), negative log loss, and negative Brier score. These will be covered in much further detail in the next section of the code deck on Model Assessment. The reason the last two are negative is that the function tries to maximize the optimization function. The cv option allows you to pick the number of cross-validation folds. Lastly, we just use the fit function with our predictor variables and target variable to fit the model. The variables are then printed below.

Code

from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_reduced)
X_scaled_df = pd.DataFrame(X_scaled, columns = X_reduced.columns)

logr = LogisticRegression(max_iter = 1000, solver = 'newton-cg', penalty = None) 

# Backward selection to find best subset of features
sfs = SFS(logr,
          k_features = "best", 
          forward = False,
          floating = False,
          scoring = 'roc_auc',
          cv = 10)

sfs = sfs.fit(X_scaled_df, y)

# Get selected feature names
selected_features = list(sfs.k_feature_names_)
print("Selected features:", selected_features)

Selected features: ['LotShape_Reg', 'LotConfig_CulDSac', 'BsmtQual_Gd', 'HeatingQC_TA', 'KitchenQual_TA', 'FireplaceQu_Missing', 'FireplaceQu_TA', 'GarageType_Attchd', 'GarageType_BuiltIn', 'GarageType_Detchd', 'LotArea', 'OverallQual', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'FullBath', 'HalfBath', 'GarageArea']

Now we have variables determined by backward selection using cross-validation. Let’s see this final model using its variables in the traditional Logit function from statsmodels.

Code

import statsmodels.api as sm

X_selected = X_reduced[selected_features]

# Build Logistic Regression
model = sm.Logit(y, X_selected)
result = model.fit()

Optimization terminated successfully.
         Current function value: 0.265205
         Iterations 8

Code

print(result.summary())

                           Logit Regression Results                           
==============================================================================
Dep. Variable:                  bonus   No. Observations:                 1095
Model:                          Logit   Df Residuals:                     1076
Method:                           MLE   Df Model:                           18
Date:                Sat, 15 Nov 2025   Pseudo R-squ.:                  0.6121
Time:                        11:03:39   Log-Likelihood:                -290.40
converged:                       True   LL-Null:                       -748.55
Covariance Type:            nonrobust   LLR p-value:                5.213e-183
=======================================================================================
                          coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------
LotShape_Reg           -1.2212      0.237     -5.144      0.000      -1.686      -0.756
LotConfig_CulDSac       0.3179      0.468      0.679      0.497      -0.600       1.236
BsmtQual_Gd             0.7811      0.255      3.066      0.002       0.282       1.280
HeatingQC_TA           -1.1433      0.278     -4.115      0.000      -1.688      -0.599
KitchenQual_TA         -2.1934      0.241     -9.115      0.000      -2.665      -1.722
FireplaceQu_Missing    -1.9489      0.271     -7.202      0.000      -2.479      -1.419
FireplaceQu_TA         -0.4320      0.300     -1.439      0.150      -1.020       0.156
GarageType_Attchd      -0.7679      0.446     -1.722      0.085      -1.642       0.106
GarageType_BuiltIn     -1.7048      0.690     -2.471      0.013      -3.057      -0.353
GarageType_Detchd      -2.9137      0.483     -6.029      0.000      -3.861      -1.966
LotArea              2.955e-05   2.32e-05      1.274      0.203   -1.59e-05     7.5e-05
OverallQual            -0.1825      0.109     -1.670      0.095      -0.397       0.032
TotalBsmtSF             0.0016      0.001      2.837      0.005       0.001       0.003
1stFlrSF               -0.0036      0.002     -1.463      0.144      -0.008       0.001
2ndFlrSF               -0.0015      0.002     -0.639      0.523      -0.006       0.003
GrLivArea               0.0030      0.002      1.267      0.205      -0.002       0.008
FullBath                0.8250      0.288      2.867      0.004       0.261       1.389
HalfBath                0.3305      0.277      1.194      0.233      -0.212       0.873
GarageArea              0.0028      0.001      3.730      0.000       0.001       0.004
=======================================================================================

Most of the variables are similar to the ones we had from stepwise selection with some slight variations.

The method = "glmStepAIC" option uses the MASS package and its backward selection capabilities. You can think of caret as more of a wrapper function that puts a cross-validation spin on other functions. The glmStepAIC function that does recursive feature elimination such as backward selection uses likelihood based metrics to selection variables just like we discussed with the stepwise selection above.

However, the metric option allows you to select your model metric across cross-validation training sets to pick the best one. The options for a binary target variable are controlled with the summaryFuction option in the trainControl function. The trControl = trainControl(method = 'cv', number = 10)) option allows you to pick the number of cross-validation folds. That summary function allows different evaluation metrics. We can pick the F1 score, precision, recall, area under the ROC curve (AUC), and area under the precision-recall curve. Using other optoins in the summaryFunction provides even more possible metrics. We also have to specify the fmaily = "binomial option to build a logistic regression and the direction = "backward" option to specify we want backward selection. The results object will display our final performance metric and the summary function will show a summary of the final model built.

Code

library(caret)
library(MASS)

set.seed(1234)

df <- cbind(y, X_reduced)

df$y <- factor(df$y, levels = c(0, 1), labels = c("no", "yes"))

back_model <- caret::train(y ~ ., data = df,
                    method = "glmStepAIC", 
                    family = "binomial",
                    direction = "backward",
                    metric = "AUC",
                    trControl = trainControl(method = "cv", 
                                             number = 10,
                                             classProbs = TRUE,
                                             summaryFunction = prSummary), 
                    trace = FALSE)

back_model$results

  parameter       AUC Precision    Recall         F     AUCSD PrecisionSD
1      none 0.9444182 0.9133088 0.9181516 0.9152649 0.0401954  0.03848537
    RecallSD        FSD
1 0.03427796 0.02931454

Code

summary(back_model$finalModel)


Call:
NULL

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)             -2.829e+01  2.575e+01  -1.099 0.271965    
LotShapeIR1              3.589e+00  1.441e+00   2.490 0.012775 *  
LotShapeIR2              3.719e+00  1.567e+00   2.373 0.017642 *  
LotShapeReg              2.861e+00  1.442e+00   1.984 0.047212 *  
Condition1Norm           6.660e-01  4.088e-01   1.629 0.103320    
BldgTypeDuplex          -2.165e+00  1.172e+00  -1.848 0.064629 .  
BldgTypeTwnhs           -1.757e+00  7.910e-01  -2.221 0.026375 *  
BsmtQualGd               7.694e-01  3.401e-01   2.262 0.023674 *  
HeatingQCTA             -7.535e-01  3.573e-01  -2.109 0.034971 *  
CentralAirY              3.279e+00  1.624e+00   2.019 0.043465 *  
KitchenQualFa           -3.935e+00  2.127e+00  -1.850 0.064264 .  
KitchenQualTA           -1.001e+00  3.309e-01  -3.025 0.002489 ** 
GarageTypeAttchd         3.692e+00  1.056e+00   3.495 0.000475 ***
GarageTypeBuiltIn        2.879e+00  1.184e+00   2.431 0.015039 *  
GarageTypeDetchd         2.038e+00  1.088e+00   1.873 0.061054 .  
GarageYrBlt_was_missing  4.434e+00  2.513e+00   1.764 0.077675 .  
LotFrontage             -1.139e-02  6.904e-03  -1.649 0.099108 .  
LotArea                  8.795e-05  3.013e-05   2.919 0.003507 ** 
OverallQual              8.097e-01  1.842e-01   4.394 1.11e-05 ***
YearRemodAdd             1.979e-02  1.143e-02   1.731 0.083365 .  
TotalBsmtSF              2.040e-03  7.857e-04   2.596 0.009423 ** 
FirstFlrSF              -6.783e-03  3.278e-03  -2.069 0.038516 *  
SecondFlrSF             -6.037e-03  3.180e-03  -1.898 0.057635 .  
GrLivArea                8.314e-03  3.170e-03   2.623 0.008713 ** 
FullBath                 9.963e-01  3.674e-01   2.712 0.006694 ** 
HalfBath                 6.481e-01  3.656e-01   1.773 0.076267 .  
Fireplaces               5.496e-01  2.586e-01   2.126 0.033533 *  
GarageYrBlt             -1.835e-02  1.007e-02  -1.822 0.068529 .  
GarageCars               7.120e-01  5.016e-01   1.420 0.155744    
GarageArea               3.164e-03  1.580e-03   2.002 0.045236 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1497.1  on 1094  degrees of freedom
Residual deviance:  419.7  on 1065  degrees of freedom
AIC: 479.7

Number of Fisher Scoring iterations: 8

Now we have variables determined by backward selection using cross-validation. Most of the variables are similar to the ones we had from stepwise selection with some slight variations.

Forward Selection with Interactions

Here we will work through forward selection. In forward selection, the initial model is empty (contains no variables, only the intercept). Each variable is then tried to see how much it improves a model based on a model metric. The most important variable (based on that metric) is then added to the model. Then the remaining variables are again tested and the next most impactful variable (based on the same model metric) is added to the model. This process repeats until either no more variables are available to add to the model based on the metric. This approach is the same as stepwise selection without the additional check at each step for possible removal.

Forward selection is the least used technique because stepwise selection does the same as forward selection with the added benefit of dropping insignificant variables. The main use for forward selection is to test higher order terms and interactions in models. You can start with your model determined from either stepwise or backward selection and then try to add only interactions between variables from there.

The code for this is not shown here, but it is a simple change to the code above to perform this approach.

Calibration Curves

Another evaluation/diagnostic for logistic regression is the calibration curve. The calibration curve is a goodness-of-fit measure for logistic regression. Calibration measures how well predicted probabilities agree with actual frequency counts of outcomes (estimated linearly across the data set). These curves can help detect if predictions are consistently too high or low in your model.

If the curve is above the diagonal line, this indicates the model is predicting lower probabilities than actually observed. The opposite is true if the curve is below the diagonal line.

This is best used on larger samples since we are calculating the observed proportion of events in the data. In smaller samples this relationship is extrapolated out from the center and may not as accurately reflect the truth.

Let’s look at creating these in each software!

Python
R

To build a calibration curve in Python we can use the calibration_curve function inside of the sklearn.calibration package. The input to this function are the target variable, bonus, and our predicted probabilities from our data. To get these predicted values we can use the predict function on our model object called result. The inputs for this predict function are a dataset we want to score. We will calculate the predicted probabilities of our training dataset. The model we are using comes from the backward selection from above. The other options in the calibration_curve function in Python is how many bins you want to compute the goodness-of-fit (here we use 10) and the strategy for this calculation. We will use the strategy = 'quantile' option. This option breaks our data into 10 groups of equal sizes of observations instead of just splitting the probabilities into groups of 0-0.1, 0.1-0.2, etc.

The calibration_curve function produces two outputs - the fraction of positives in each bin (the true probability) and the average predicted probability of each bin. These should be close to each other for the model to be well calibrated. The rest of the code is just used to plot this calibration curve.

Code

from sklearn.calibration import calibration_curve

train['pred_prob'] = result.predict(X_selected)

# Compute calibration curve
prob_true, prob_pred = calibration_curve(train['bonus'], train['pred_prob'], 
                                         n_bins = 10, strategy = 'quantile')

plt.figure(figsize = (6, 6))
plt.plot(prob_pred, prob_true, marker = 'o', label = 'Calibration curve')
plt.plot([0, 1], [0, 1], linestyle = '--', color = 'gray', label = 'Perfectly calibrated')
plt.xlabel('Predicted probability')
plt.ylabel('Observed frequency')
plt.title('Calibration Curve')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

Since the calibration curve line is consistently along the straight line, we do not notice any significant number of over or under predictions.

R produces a calibration curve using the givitiCalibrationBelt function. The inputs to this function are o = and e =. These are the observed target and expected (predicted) target respectively. To get these predicted values we can use the predict function on our model object called back_model. We will calculate the predicted probabilities of our training dataset. The model we are using comes from the backward selection from above. We place our actual target variable in the o = option and the predictions from our logistic regression model in the e = option. Since the model is being compared to training data the devel = internal option is specified. The maxDeg = option sets the maximum degree being tested for the curve.

Code

library(givitiR)

cali.curve <- givitiCalibrationBelt(o = train$bonus,
                                    e = predict(back_model, type = "prob")[, "yes"],
                                    devel = "internal",
                                    maxDeg = 5)
plot(cali.curve, main = "Bonus Eligibility Model Calibration Curve",
                 xlab = "Predicted Probability",
                 ylab = "Observed Bonus Eligibility")

$m
[1] 2

$p.value
[1] 5.706546e-13

Since the diagonal line is contained in the confidence interval for our calibration curve, we do not notice any significant number of over or under predictions.

Diagnostic Plots

Linear regression models contain residuals with properties that are very useful for model diagnostics. However, what is a residual in a logistic regression model? Since we don’t have actual probabilities to compare our predicted probabilities against, residuals are not as clearly defined. Instead we have pseudo “residuals” in logistic regression that we can explore further. Two examples of this are deviance residuals and Pearson residuals.

Deviance is a measure of how far a fitted model is from the fully saturated model. The fully saturated model is a model that predicts our data perfectly by essentially overfitting to it - a variable for each unique combination of inputs. This makes this model impractical for use, but good for comparison. The deviance is essentially our “error” from this “perfect” model. Logistic regression minimizes the sum of the squared deviances. Deviance residuals tell us how much each observation reduces the deviance.

Pearson residuals on the other hand tell us how much each observation changes the Pearson Chi-squared test for the overall model.

Other forms of measuring an observation’s influence on the logistic regression model are DFBetas and Cook’s D. Similar to their interpretation in linear regression, these two calculations tell us how each observation changes the estimation of each parameter individually (DFBeta) or how each observation changes the estimation of all the parameters holistically (Cook’s D).

Let’s see how to get all of these from each software!

Python
R

Python has some wonderful diagnostics plots to show us these residuals. Python also produces a list of these measures of influence as well as many more with the get_influence function. Below only the first 5 observations are shown using the head function, but this is calculated for each of the observations.

Code

influence = result.get_influence()
influence.summary_frame().head()

      dfb_LotShape_Reg  dfb_LotConfig_CulDSac  ...  hat_diag  dffits_internal
618           0.001394              -0.000536  ...  0.011392         0.033168
792          -0.004675               0.034775  ...  0.022432         0.046861
483          -0.001798              -0.000019  ...  0.005935        -0.016490
1012         -0.002585               0.000627  ...  0.008387        -0.015657
108          -0.011089              -0.001819  ...  0.031001        -0.069958

[5 rows x 23 columns]

The output above has all of the needed information. We can get both Cook’s D and DFBeta’s from the above summary as well as many other metrics. The following code just takes the Cook’s D values and DFBeta values and plots them. Only one of the DFBetas plots is actually shown.

Code

dfbetas = influence.dfbetas

n_obs, n_params = dfbetas.shape
param_names = result.params.index

# Rule-of-thumb threshold
threshold = 2 / np.sqrt(n_obs)

for i in range(n_params):
    plt.figure(figsize=(8, 4))
    plt.stem(dfbetas[:, i])
    plt.axhline(threshold, color='red', linestyle='--')
    plt.axhline(-threshold, color='red', linestyle='--')
    plt.title(f'DFBETAs for {param_names[i]}')
    plt.xlabel('Observation index')
    plt.ylabel('DFBETA')
    plt.grid(True)
    plt.show()

Code

cooks_d = influence.cooks_distance[0]

n = len(cooks_d)
threshold = 4 / n

plt.figure(figsize=(10, 4))
plt.stem(cooks_d)
plt.axhline(y=threshold, color='red', linestyle='--', label=f'Threshold (4/n = {threshold:.4f})')
plt.title("Cook's Distance for Each Observation")
plt.xlabel('Observation Index')
plt.ylabel("Cook's Distance")
plt.legend()
plt.grid(True)
plt.show()

R has some wonderful diagnostics plots to show us these residuals. R also produces a list of these measures of influence as well as many more with the influence.measures function. Below only the first 6 observations are shown using the head function, but this is calculated for each of the observations. The 4th plot in the plot function on the logistic regression model object is the Cook’s D plot as shown below. The dfbetasPlots function produces the DFBetas plots, but only one is shown here.

Code

final_vars <- names(coef(step_model$finalModel))

final_vars <- setdiff(final_vars, "(Intercept)")
final_formula <- as.formula(
  paste("y_numeric ~", paste(final_vars, collapse = " + "))
)

y_numeric <- ifelse(y == "yes", 1, 0)

df_numeric <- cbind(y_numeric, X_reduced)
glm_model <- glm(final_formula, data = df_numeric, family = binomial)

Code

library(car)

head(influence.measures(glm_model)$infmat)

           dfb.1_     dfb.LSIR1     dfb.LSIR2      dfb.LtSR     dfb.Cn1N
618   0.015724964 -0.0035293571 -0.0035244529 -0.0082365314 -0.008260630
792   0.040170979 -0.0113591507 -0.0041924780 -0.0063413362 -0.011182814
483   0.031470229  0.0008038286  0.0008136612  0.0003369815  0.003654716
1012  0.004464856  0.0065276831  0.0048173579  0.0032384375 -0.017327523
108   0.091486553 -0.0132434158 -0.0104033147 -0.0157872554  0.071736233
268  -0.041082470 -0.0082654343 -0.0122342112 -0.0093866515 -0.022605192
         dfb.BlTD     dfb.BlTT    dfb.BsQG     dfb.HQCT     dfb.CnAY
618   0.003823034 -0.003017949  0.04779644 -0.004407631  0.005767980
792   0.002688373 -0.002598157 -0.02137890  0.021728862  0.004880911
483   0.008578889 -0.181905121  0.06304639  0.031124117  0.011809885
1012 -0.009657070  0.017710698  0.01553803 -0.068986945 -0.006855639
108   0.047015474  0.005611461  0.02174849 -0.070640996  0.066340431
268  -0.024805079  0.002260695 -0.01928329  0.062132678 -0.008673654
         dfb.KtQF     dfb.KQTA     dfb.GrTA      dfb.GTBI     dfb.GrTD
618  -0.003576444 -0.003950049  0.001938647  0.0049119547  0.001150408
792  -0.027041142 -0.070965854 -0.001169930  0.0165020766 -0.003246826
483  -0.030719760 -0.060294625 -0.009663235 -0.0002539929  0.007771911
1012  0.030170652  0.066506558  0.009359262  0.0087434411 -0.013082491
108  -0.140376763 -0.033918325  0.005308475  0.0093510059  0.002785853
268  -0.001694621 -0.015870328 -0.018982829 -0.0210811625 -0.028307844
         dfb.GYB_      dfb.LtFr     dfb.LtAr     dfb.OvrQ     dfb.YrRA
618  -0.006510558 -0.0126217021 -0.002617192 -0.044131369 -0.003863425
792   0.011971869 -0.0350505111  0.008844833 -0.005096166 -0.021900220
483   0.018552932  0.0050227695 -0.004100301  0.009333191 -0.020878212
1012 -0.001247621  0.0059726934  0.002783668 -0.009284811 -0.056616772
108  -0.033949617 -0.0228811480 -0.007740826 -0.003475239 -0.104206184
268  -0.023245236 -0.0007353801  0.017506551 -0.014745335  0.025129963
         dfb.TBSF     dfb.FFSF      dfb.SFSF     dfb.GrLA      dfb.FllB
618  -0.004482460 -0.001587842  0.0002449179  0.003336817 -0.0034190999
792  -0.007865567 -0.003098593 -0.0085425422  0.003889688 -0.0002456674
483  -0.016492161  0.005206902  0.0066358069 -0.005289135 -0.0278649502
1012  0.018072458 -0.007606967 -0.0044642080  0.008859637  0.0051801915
108  -0.005743172 -0.038980119 -0.0408581252  0.042772928 -0.0354789010
268   0.014786951  0.009660006  0.0066213975  0.003860661 -0.0324167864
         dfb.HlfB      dfb.Frpl     dfb.GrYB     dfb.GrgC    dfb.GrgA
618   0.013610510 -2.044599e-03 -0.013052806 -0.031194736  0.02158494
792  -0.006554767 -9.985445e-03 -0.026067783 -0.005122175  0.02380698
483   0.015351624  1.651893e-02 -0.018190304 -0.038898639  0.03822403
1012 -0.037566352 -7.824507e-02  0.051609194  0.004954137  0.01902323
108   0.005989452  9.488752e-05 -0.008987809  0.003946025  0.01144701
268  -0.008693574 -6.590942e-02  0.028594980  0.170093752 -0.21027297
          dffit    cov.r       cook.d        hat
618  -0.1099082 1.012711 1.200916e-15 0.01212100
792  -0.1147833 1.013779 1.309813e-15 0.01319148
483  -0.2254170 1.049145 5.051394e-15 0.04740717
1012 -0.1812011 1.032562 3.264121e-15 0.03165517
108  -0.2724070 1.069957 7.376759e-15 0.06648537
268  -0.2417695 1.056033 5.810829e-15 0.05380444

Code

plot(glm_model, 4)

Code

dfbetasPlots(glm_model, terms = "GrLivArea", id.n = 5,
             col = ifelse(glm_model$y == 1, "red", "blue"))

	score_func	<function chi2 at 0x32a21a7a0>
	k	'all'

	score_func	<function f_c...t 0x32a21a5c0>
	k	'all'