Model Assessment

Comparing Models

One of the common concerns and questions in any model development is determining how “good” a model is or how well it performs. There are many different factors that determine this and most of them depend on the goals for the model. There are typically two different purposes for modeling - estimation and prediction. Estimation quantifies the expected change in our target variable associated with some relationship or change in the predictor variables. Prediction on the other hand is focused on predicting new target observations. However, these goals are rarely seen in isolation as most people desire a blend of these goals for their models. This section will cover many of the popular metrics for model assessment.

The first thing to remember about model assessment is that a model is only “good” in context with another model. All of these model metrics are truly model comparisons. Is an accuracy of 80% good? Depends! If the previous model used has an accuracy of 90%, then no the new model is not good. However, if the previous model used has an accuracy of 70%, then yes the model is good. Although we will show many of the calculations, at no place will we say that you must meet a certain threshold for your models to be considered “good” because these metrics are designed for comparison.

Some common model metrics are based on likelihood calculations. Likelihood calculations are limited to more statistical based models as more complicated machine learning models don’t have likelihood representations. Three common logistic regression metrics based on likelihood are the following:

AIC
BIC
Generalized (McFadden) $R^2$

Without going into too much mathematical detail, the AIC is a crude, large sample approximation of leave-one-out cross validation. The BIC on the other hand favors a smaller model than the AIC as it penalizes model complexity more. In both AIC and BIC, lower values are better. However, there is no amount of lower that is better enough. Neither the AIC or BIC is necessarily better than the other however they may not always agree on the “best” model.

There are a number of “pseudo”-$R^2$ metrics for logistic regression. Here, higher values indicate a better model. The Generalized (McFadden) $R^2$ is a metric to measure how much better a model (in terms of likelihood) is compared to the intercept only model. Therefore, we compare two models with this to see which one is “more better” than the intercept compared to the other. Essentially, they are both compared to the same baseline so whichever beats that baseline by more is the model we want. Even though it is bounded between 0 and 1, there is no interpretation to this metric like we had in linear regression.

We will be going back to the Ames, Iowa housing data set to see how well our model performed after subset selection in the previous seciton. We will use the results from the backward selection model.

Let’s see how to calculate the McFadden $R^2$ in each software!

Python
R

Python provides a lot of these metrics by default in the output of the summary function on our Logit model objects. However, we can call them separately through the aic, bic_llf, and pseudo_rquared functions as well if needed. Let’s examine the $R^2$ output from our Logit model summary function output.

Code

import statsmodels.api as sm

# Build Logistic Regression
X_selected = sm.add_constant(X_selected)
model = sm.Logit(y, X_selected)
result = model.fit()

Optimization terminated successfully.
         Current function value: 0.204629
         Iterations 9

Code

print(result.summary())

                           Logit Regression Results                           
==============================================================================
Dep. Variable:                  bonus   No. Observations:                 1095
Model:                          Logit   Df Residuals:                     1075
Method:                           MLE   Df Model:                           19
Date:                Sat, 15 Nov 2025   Pseudo R-squ.:                  0.7007
Time:                        11:08:02   Log-Likelihood:                -224.07
converged:                       True   LL-Null:                       -748.55
Covariance Type:            nonrobust   LLR p-value:                1.854e-210
=======================================================================================
                          coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                 -15.6407      1.781     -8.781      0.000     -19.132     -12.149
LotShape_Reg           -0.5463      0.272     -2.005      0.045      -1.080      -0.012
LotConfig_CulDSac       0.8056      0.586      1.375      0.169      -0.343       1.954
BsmtQual_Gd             0.6537      0.300      2.180      0.029       0.066       1.241
HeatingQC_TA           -1.0164      0.324     -3.141      0.002      -1.651      -0.382
KitchenQual_TA         -1.2001      0.283     -4.237      0.000      -1.755      -0.645
FireplaceQu_Missing    -0.8386      0.328     -2.556      0.011      -1.482      -0.195
FireplaceQu_TA          0.1265      0.334      0.379      0.705      -0.528       0.781
GarageType_Attchd       2.8157      0.818      3.441      0.001       1.212       4.420
GarageType_BuiltIn      1.8122      0.967      1.873      0.061      -0.084       3.708
GarageType_Detchd       1.2433      0.851      1.462      0.144      -0.424       2.910
LotArea              9.262e-05   2.66e-05      3.486      0.000    4.05e-05       0.000
OverallQual             0.9044      0.177      5.120      0.000       0.558       1.251
TotalBsmtSF             0.0017      0.001      2.441      0.015       0.000       0.003
1stFlrSF               -0.0060      0.003     -2.205      0.027      -0.011      -0.001
2ndFlrSF               -0.0055      0.003     -2.078      0.038      -0.011      -0.000
GrLivArea               0.0074      0.003      2.781      0.005       0.002       0.013
FullBath                1.0206      0.324      3.151      0.002       0.386       1.655
HalfBath                0.7693      0.327      2.352      0.019       0.128       1.410
GarageArea              0.0028      0.001      2.757      0.006       0.001       0.005
=======================================================================================

From the output above we can see the McFadden $R^2$ value as 0.7007. This just shows that our model is noticeably better than the intercept model. However, if we were to build another logistic regression model on the same data, we could compare them to see which had the higher value of McFadden’s $R^2$.

R provides some of these metrics by default in the output of the summary function on our logistic regression objects. However, we can call them separately through the AIC, BIC, and pR2 functions as well. Note the pR2 function comes from the pscl package in R and provides more than just the McFadden $R^2$ value.

Code

names(X_selected)[names(X_selected) == "1stFlrSF"] <- "FirstFlrSF"
names(X_selected)[names(X_selected) == "2ndFlrSF"] <- "SecondFlrSF"

logit_model <- glm(y ~ ., data = X_selected, family = binomial(link = "logit"),
                   control = glm.control(maxit = 100, epsilon = 1e-8))

summary(logit_model)


Call:
glm(formula = y ~ ., family = binomial(link = "logit"), data = X_selected, 
    control = glm.control(maxit = 100, epsilon = 1e-08))

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -1.564e+01  1.781e+00  -8.781  < 2e-16 ***
LotShape_Reg        -5.463e-01  2.724e-01  -2.005 0.044911 *  
LotConfig_CulDSac    8.056e-01  5.859e-01   1.375 0.169108    
BsmtQual_Gd          6.537e-01  2.998e-01   2.180 0.029225 *  
HeatingQC_TA        -1.016e+00  3.236e-01  -3.141 0.001683 ** 
KitchenQual_TA      -1.200e+00  2.832e-01  -4.237 2.27e-05 ***
FireplaceQu_Missing -8.386e-01  3.282e-01  -2.556 0.010601 *  
FireplaceQu_TA       1.265e-01  3.339e-01   0.379 0.704851    
GarageType_Attchd    2.816e+00  8.184e-01   3.441 0.000581 ***
GarageType_BuiltIn   1.812e+00  9.675e-01   1.873 0.061048 .  
GarageType_Detchd    1.243e+00  8.505e-01   1.462 0.143775    
LotArea              9.262e-05  2.657e-05   3.486 0.000491 ***
OverallQual          9.044e-01  1.766e-01   5.120 3.05e-07 ***
TotalBsmtSF          1.737e-03  7.116e-04   2.441 0.014641 *  
FirstFlrSF          -6.041e-03  2.740e-03  -2.205 0.027461 *  
SecondFlrSF         -5.532e-03  2.663e-03  -2.078 0.037740 *  
GrLivArea            7.355e-03  2.645e-03   2.781 0.005426 ** 
FullBath             1.021e+00  3.239e-01   3.151 0.001627 ** 
HalfBath             7.693e-01  3.271e-01   2.352 0.018681 *  
GarageArea           2.787e-03  1.011e-03   2.757 0.005828 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1497.10  on 1094  degrees of freedom
Residual deviance:  448.14  on 1075  degrees of freedom
AIC: 488.14

Number of Fisher Scoring iterations: 29

Code

library(pscl)

pR2(logit_model)

fitting null model for pseudo-r2

         llh      llhNull           G2     McFadden         r2ML         r2CU 
-224.0681978 -748.5514959 1048.9665962    0.7006643    0.6163254    0.8270770

Probability Metrics

Logistic regression is a model for predicting the probability of an event, not the occurrence of an event. Logistic regression can be used for classification as well. Good models should reflect both good metrics on probability and classification, but the importance of one over the other depends on the problem.

In this section we will focus on the probability metrics. Since we are predicting the probability of an event, we want our model to assign higher predicted probabilities to events and lower predicted probabilities to non-events.

Rank-Order Statistics

Rank-order statistics measure how well a model orders the predicted probabilities. Three common metrics that summarize things together are concordance, discordance, and ties. In these metrics every single combination of an event and non-event are compared against each other (1 event vs. 1 non-event; a 1 vs. a 0). A concordant pair is a pair with the event having a higher predicted probability than the non-event - the model got the rank correct. A discordant pair is a pair with the event having a lower predicted probability than the non-event - the model got the rank wrong. A tied pair is a pair where the event and non-event have the same predicted probability - the model isn’t sure how to distinguish between them. Models with higher concordance are considered better. The interpretation on concordance is that for all possible event and non-event combinations, the model assigned the higher predicted probability to the observation with the event concordance% of the time.

There are a host of other metrics that are based on these rank-statistics such as the $c$-statistic, Somer’s D, and Kendall’s $\tau_\alpha$. The calculations for these are as follows:

\[ c = Concordance + 1/2\times Tied \]

\[ D_{xy} = 2c - 1 \]

\[ \tau_\alpha = \frac{Condorant - discordant}{0.5*n*(n-1)} \]

With all of these, higher values of concordant pairs result in higher values of these metrics.

Let’s see how to calculate these in each software!

Python
R

Although not provided immediately in the model summary, Python can easily provide a couple of the metrics above using the roc_auc_score function from the sklearn.metrics package. The area under the ROC curve (AUC) is equivalent to the $c$-statistic mentioned above. From there we can calculate the Somer’s D statistic ourselves.

Code

from sklearn.metrics import roc_auc_score

train['p_hat'] = result.predict()

auc = roc_auc_score(y, train['p_hat'])
print("C-statistic (AUC):", auc)

C-statistic (AUC): 0.978459885191936

Code

somer_d = 2 * auc - 1
print("Somer's D:", somer_d)

Somer's D: 0.9569197703838721

As we can see from the output, our model’s $c$-statistic is rather high at 0.978 which leads to a high value of Somer’s D of 0.957. Just like with other model metrics, we cannot say whether these are “good” values of these metrics as they are meant for comparison. If these values are higher than the same values from another model, then this model is better than the other model.

Although not provided immediately in the model summary, R can easily provide a couple of the metrics above with the and somers2 function.

Code

library(Hmisc)

train$p_hat <- predict(logit_model, type = "response")

somers2(train$p_hat, train$bonus)

           C          Dxy            n      Missing 
   0.9784599    0.9569198 1095.0000000    0.0000000

Classification Metrics

In this section we will focus on the classification metrics. We want a model to correctly classify events and non-events. Classification forces the model to predict either an event or non-event for each observation based on the predicted probability for that observation. For example, $\hat{y}_1 = 1$ if $\hat{p}_i > 0.5$. These are called cut-offs or thresholds. However, strict classification-based measures completely discard any information about the actual quality of the model’s predicted probabilities.

Many of the metrics around classification try to balance different pieces of the classification table (also called the confusion matrix). An example of one is shown below.

Let’s examine the different pieces of the classification table that people jointly focus on.

Sensitivity & Specificity

Sensitivity is the proportion of times you were able to predict an event in the groups of actual events. Of the actual events, the proportion of the time you correctly predicted an event. This is also called the true positive rate. This is also just another name for recall.

This is balanced typically with the specificity. Specificity is the proportion of times you were able to predict a non-event in the group of actual non-events. Of the actual non-events, the proportion of the time you correctly predicted non-event. This is also called the true negative rate.

These offset each other in a model. One could easily maximize one of these at the cost of the other. To get maximum sensitivity you can just predict every observations is an event, however this would drop your specificity to 0. The reverse is also true. Those who focus on sensitivity and specificity want balance in each. One measure for the optimal cut-off from a model is the Youden’s Index (or Youden’s J Statistic). This is easily calculated as $J = sensitivity + specificity - 1$. The optimal cut-off for determining predicted events and non-events would be at the point where this is maximized.

Let’s see how to do this in each software!

Python
R

Python produces classification tables for any cut-off that you determine. Remember that a cut-off is simply the point where you decide to predict an event and non-event from your predicted probabilities. The crosstab function creates the classification table for us after using the map function to define our cut-off at 0.5.

Code

import pandas as pd

train['pred'] = train['p_hat'].map(lambda x: 1 if x > 0.5 else 0)

pd.crosstab(train['bonus'], train['pred'])

pred     0    1
bonus          
0      582   41
1       42  430

We want to look at all classification tables for all values of cut-offs between 0 and 1. We can easily loop through this calculation with a for loop in Python. However, the roc_curve function can do this for us. The inputs for this function are the target variable first, followed by the predicted probabilities. We are saving the output of this function into three objects fpr (the false positive rate or $1-specificity$), the tpr (the true positive rate), and the corresponding threshold to get those values. From there we combine these variables into a single dataframe. We then calculate the Youden Index as the difference between the TPR and FPR. From there we sort by this Youden Index value and print the observations.

Code

from sklearn.metrics import roc_curve, auc

fpr, tpr, thresholds = roc_curve(train['bonus'], train['p_hat'])

data = {'TPR': tpr, 'FPR': fpr, 'Cut-off': thresholds, 'Youden': tpr-fpr}
youden = pd.DataFrame(data)

youden.sort_values(by = ['Youden'], ascending = False)

          TPR       FPR       Cut-off    Youden
98   0.970339  0.102729  3.178336e-01  0.867610
96   0.968220  0.101124  3.214102e-01  0.867097
94   0.963983  0.097913  3.439631e-01  0.866070
92   0.961864  0.096308  3.515334e-01  0.865556
97   0.968220  0.102729  3.197748e-01  0.865492
..        ...       ...           ...       ...
137  1.000000  0.948636  1.309239e-04  0.051364
138  1.000000  0.951846  1.289964e-04  0.048154
0    0.000000  0.000000           inf  0.000000
139  1.000000  1.000000  6.816453e-07  0.000000
1    0.000000  0.001605  1.000000e+00 -0.001605

[140 rows x 4 columns]

Another commonly used visual for the balance of sensitivity and specificity across all of the cut-offs is the Receiver Operator Characteristic curve. Commonly known as the ROC curve. The ROC curve plots the balance of sensitivity vs. specificity. The “best” ROC curve is the one that reaches to the upper left hand side of the plot as that would imply that our model has both high levels of sensitivity and specificity. The worst ROC curve is represented by the diagonal line in the plot since that would imply our model is as good as randomly assigning events and non-events to our observations. This leads some to calculate the area under the ROC curve (typically called AUC) as a metric summarizing the curve itself. The math won’t be shown here, but the AUC is equal to the $c$-statistic in the Rank-order statistics section. Isn’t math fun!?!?

Python easily produces ROC curves using the matplotlib.pyplot function using the fpr and tpr objects we previously calculated in the code above. Using the plt function on this object gives the ROC curve. The roc_auc_score function provides us with the AUC value for the ROC curve.

Code

from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

auc = roc_auc_score(train['bonus'], train['p_hat'])

plt.cla()
plt.plot(fpr, tpr, label=f"AUC = {auc:.2f}")
plt.plot([0, 1], [0, 1], linestyle="--", color="gray")  # chance line
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.grid(True)
plt.show()

We can also see that the area under our ROC curve is 0.98. Similar to other metrics, we cannot say whether this is a “good” value of AUC, only if it is better or worse than another model’s AUC.

R produces classification tables for any cut-off that you determine. Remember that a cut-off is simply the point where you decide to predict an event and non-event from your predicted probabilities. The table function creates the classification table for us after using the ifelse function to define our cut-off at 0.5.

Code

library(tidyverse)

train <- train %>%
  mutate(pred = ifelse(p_hat > 0.5, 1, 0))

table(train$bonus, train$pred)

We want to look at all classification tables for all values of cut-offs between 0 and 1. We can easily loop through this calculation with a for loop in R. However, the measureit function can do this for us. The inputs for this function are the predicted probabilities first, followed by the target variable. The measure option allows you to define additional measures to calculate at each cut-off. In the code below we ask for accuracy (ACC), sensitivity (SENS), and specificity (SPEC). From there we combine these variables into a single dataframe and print the observations with the head function.

Code

library(ROCit)

logit_meas <- measureit(train$p_hat, train$bonus, measure = c("ACC", "SENS", "SPEC"))

youden_table <- data.frame(Cutoff = logit_meas$Cutoff, Sens = logit_meas$SENS, Spec = logit_meas$SPEC)
head(youden_table, n = 10)

      Cutoff        Sens      Spec
1        Inf 0.000000000 1.0000000
2  1.0000000 0.000000000 0.9983949
3  1.0000000 0.002118644 0.9983949
4  1.0000000 0.004237288 0.9983949
5  0.9999998 0.006355932 0.9983949
6  0.9999998 0.008474576 0.9983949
7  0.9999997 0.010593220 0.9983949
8  0.9999994 0.012711864 0.9983949
9  0.9999992 0.014830508 0.9983949
10 0.9999992 0.016949153 0.9983949

We could calculate the Youden Index by hand and then rank the new dataframe by this value, however, rocit function will gives this to us automatically in the next piece of code.

R easily produces ROC curves from a variety of functions. A popular, new function is the rocit function. Using the plot function on the rocit object gives the ROC curve. By calling the $optimal element of the plot of the rocit object, it gives the value of the Youden’s Index (value in the output) along with the respective cut-off that corresponds to that maximum Youden value. The summary function on the rocit object will report the AUC value. We can also get confidence intervals around our AUC values (ciAUC function) and ROC curves (ciROC function).

Code

logit_roc <- rocit(train$p_hat, train$bonus)
plot(logit_roc)

Code

plot(logit_roc)$optimal

    value       FPR       TPR    cutoff 
0.8676103 0.1027287 0.9703390 0.3178316

Code

summary(logit_roc)

                           
 Method used: empirical    
 Number of positive(s): 472
 Number of negative(s): 623
 Area under curve: 0.9785

Code

ciAUC(logit_roc, level = 0.99)

                                                          
   estimated AUC : 0.978459885191935                      
   AUC estimation method : empirical                      
                                                          
   CI of AUC                                              
   confidence level = 99%                                 
   lower = 0.966152003207538     upper = 0.990767767176332

Code

plot(ciROC(logit_roc))

We can see that the highest Youden J statistic had a value of 0.8676. This took place at a cut-off of 0.318. Therefore, according to the Youden Index at least, the optimal cut-off for our model is 0.318. In other words, if our model predicts a probability above 0.318 then we should call this an event. Any predicted probability below 0.318 should be called a non-event. We can also see that the area under our ROC curve is 0.98. Similar to other metrics, we cannot say whether this is a “good” value of AUC, only if it is better or worse than another model’s AUC.

Another common function is the performance function that produces many more plots than the ROC curve. Here the ROC curve is obtained by plotting the true positive rate by the false positive rate using the measure = "tpr" and "x.measure = "fpr" options. The AUC is also obtained from the performance function by calling the measure = "auc" option.

Code

library(ROCR)

pred <- prediction(train$p_hat, factor(train$bonus))

perf <- performance(pred, measure = "sens", x.measure = "fpr")
plot(perf, lwd = 3, colorize = FALSE, colorkey = FALSE)
abline(a = 0, b = 1, lty = 3)

Code

performance(pred, measure = "auc")@y.values

[[1]]
[1] 0.9784599

K-S Statistic

One of the most popular metrics for classification models in the finance and banking industry is the KS statistic. The two sample KS statistic can determine if there is a difference between two cumulative distribution functions. The two cumulative distribution functions of interest to us are the predicted probability distribution functions for the event and non-event target group. The KS $D$ statistic is the maximum distance between these two curves - calculated by the maximum difference between the true positive and false positive rates, $D = \max_{depth}{(TPR - FPR)} = \max_{depth}{(Sensitivity + Specificity - 1)}$. Notice, this is the same as maximizing the Youden Index.

The optimal cut-off for determining predicted events and non-events would be at the point where this $D$ statistic (Youden Index) is maximized.

Let’s see how to do this in each software!

Python
R

Mathematically, the KS-statistic and the Youden’s Index are the same. From the output above we can see that the maximum value of the Youden Index is 0.8676. This is the value of the KS $D$ statistic. The cut-off where this is maximized is the same as Youden at 0.318. The only thing we are doing in the code below is calculating the cumulative probability functions that the KS function is originally based off of as well as plotting the value of the Youden Index across the probability functions. The maximum distance between the probability functions is the place where Youden’s Index is maximized.

Code

from sklearn.metrics import roc_curve
import seaborn as sns

fpr, tpr, thresholds = roc_curve(train['bonus'], train['p_hat'])

# Create the Youden DataFrame
youden = pd.DataFrame({
    'Cut-off': thresholds,
    'TPR': tpr,
    'FPR': fpr,
    'Youden': tpr - fpr
})

# Sort by Cut-off and rename
youden = youden.sort_values(by='Cut-off', ascending=True)

ks_stat = youden.rename(columns={'TPR': 'PR_T', 'FPR': 'PR_F'})
ks_stat = ks_stat.melt(id_vars='Cut-off', var_name='PR', value_name='value')

ks_val = (youden['TPR'] - youden['FPR']).max()
ks_cutoff = youden.loc[(youden['TPR'] - youden['FPR']).idxmax(), 'Cut-off']

# Plot
plt.cla()
sns.lineplot(x='Cut-off', y='value', hue='PR', data=ks_stat)
plt.xlim(1, 0)

(1.0, 0.0)

Code

plt.title("KS Plot (TPR vs. FPR)")
plt.grid(True)
plt.axvline(x=ks_cutoff, linestyle='--', color='red', label=f'KS = {ks_val:.2f}')
plt.legend()
plt.show()

As we saw in the previous section, the optimal cut-off according to the KS-statistic would be at 0.318. Therefore, according to the KS statistic at least, the optimal cut-off for our model is 0.318. In other words, if our model predicts a probability above 0.318 then we should call this an event. Any predicted probability below 0.318 should be called a non-event.

Using the same rocit object from the section on sensitivity and specificity (here called logit_roc) we can also calculate the KS statistic and plot the two cumulative distribution functions it represents. The ksplot function will plot the two cumulative distribution functions as well as highlight the cut-off (or threshold) where they are most separated. This point corresponds the $D$ statistic mentioned above as well as the Youden’s Index. By calling the KS stat and KS Cutoff elements from this KS plot, we can get the optimal cut-off as well as the value at this cut-off.

Code

ksplot(logit_roc)

Code

ksplot(logit_roc)$`KS stat`

[1] 0.8676103

Code

ksplot(logit_roc)$`KS Cutoff`

[1] 0.3178316

Another way to calculate this is by hand using the performance function we saw in the previous section as well. Using the measure = "tpr" and x.measure = "fpr" options, we can calculate the true positive rate and false positive rate across all our predictions. From there we can just use the max function to calculate the value of maximum difference between the two - the KS statistic. Finding the cut-off at this point is a little trickier with some of the needed R functions, but we essentially search the for alpha values (here the cut-offs) for the point where the KS statistic is maximized.

Code

perf <- performance(pred, measure = "tpr", x.measure = "fpr")
KS <- max(perf@y.values[[1]] - perf@x.values[[1]])
cutoffAtKS <- unlist(perf@alpha.values)[which.max(perf@y.values[[1]] - perf@x.values[[1]])]
print(c(KS, cutoffAtKS))

[1] 0.8676103 0.3178316

Code

plot(x = unlist(perf@alpha.values), y = (1-unlist(perf@y.values)),
     type = "l", main = "K-S Plot (EDF)",
     xlab = 'Cut-off',
     ylab = "Proportion",
     col = "red")
lines(x = unlist(perf@alpha.values), y = (1-unlist(perf@x.values)), col = "blue")

From the output we can see the KS $D$ statistic at 0.8676. The predicted probability that this occurs at (the optimal cut-off) is defined at 0.318 as we previously saw.

Precision & Recall

Precision and recall are another way to view a classification table from a model. Recall is the proportion of times you were able to predict an event in the groups of actual events. Of the actual events, the proportion of the time you correctly predicted an event. This is also called the true positive rate. This is also just another name for sensitivity.

This is balanced here with the precision. Precision is the proportion predicted events that were actually events. Of the predicted events, the proportion of the time they actually were events. This is also called the positive predictive value. Precision is growing in popularity as a balance to recall/sensitivity as compared to specificity.

These offset each other in a model. One could easily maximize one of these at the cost of the other. To get maximum recall you can just predict all events, however this would drop your precision. Those who focus on precision and recall want balance in each. One measure for the optimal cut-off from a model is the F1 Score. This is calculated as the following:

\[ F_1 = 2\times \frac{precision \times recall}{precision + recall} \]

The optimal cut-off for determining predicted events and non-events would be at the point where this is maximized.

Let’s see how to do this in each software!

Python
R

We want to look at all classification tables for all values of cut-offs between 0 and 1. We can easily loop through this calculation with a for loop in Python using the precision_score, recall_score, and f1_score functions. The inputs for each function are the target variable first, followed by the predicted probabilities. We loop through many cut-off values to find the optimal F1-score.

Code

from sklearn.metrics import precision_score, recall_score, f1_score
import numpy as np

precision = np.array([])
recall = np.array([])
f1score = np.array([])

for y in range(100):
  train['pred'] = train['p_hat'].map(lambda x: 1 if x > y/100 else 0)
  value_p = precision_score(train['bonus'], train['pred'])
  precision = np.append(precision, value_p)
  value_r = recall_score(train['bonus'], train['pred'])
  recall = np.append(recall, value_r)
  value_f = f1_score(train['bonus'], train['pred'])
  f1score = np.append(f1score, value_f)

data = {'Precision': precision, 'Recall': recall, 'Cut-off': range(100), 'F1': f1score}
f1_s = pd.DataFrame(data)

f1_s.sort_values(by = ['F1'], ascending = False)

    Precision    Recall  Cut-off        F1
32   0.878846  0.968220       32  0.921371
41   0.897384  0.944915       41  0.920537
34   0.880077  0.963983       34  0.920121
35   0.881553  0.961864       35  0.919959
40   0.894000  0.947034       40  0.919753
..        ...       ...      ...       ...
96   0.991416  0.489407       96  0.655319
97   0.990741  0.453390       97  0.622093
0    0.431050  1.000000        0  0.602425
98   0.989744  0.408898       98  0.578711
99   0.987179  0.326271       99  0.490446

[100 rows x 4 columns]

We can see that the highest F1 score had a value of 0.921. This took place at a cut-off of 0.32. Therefore, according to the F1 score at least, the optimal cut-off for our model is 0.32. In other words, if our model predicts a probability above 0.32 then we should call this an event. Any predicted probability below 0.32 should be called a non-event. This matches up closely with the Youden’s Index from above. This is not always the case. We just got lucky in this example.

Another common calculation using the precision of a model is the model’s lift. The lift of a model is simply calculated as the ratio of precision to the population proportion of the event - $Lift = PPV/\pi_1$. The interpretation of lift is really nice for explanation. Let’s imagine that your lift was 3 and your population proportion of events was 0.2. This means that in the top 20% of your customers, your model predicted 3 times the events as compared to you selecting people at random. Sometimes people plot lift charts where they plot the precision at all the different values of the population proportion (called depth).

Python doesn’t have an easy, built-in function for plotting lift, but this isn’t a hard thing to calculate ourselves. The created function below will plot both the lift and gain plots for our data.

Code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

def plot_lift_and_gains(y_true, y_proba, n_bins=10):
    """
    Plot Lift and Cumulative Gains curves.
    
    Parameters:
    - y_true: array-like, true binary labels (0/1)
    - y_proba: array-like, predicted probabilities for the positive class
    - n_bins: number of bins/deciles to split data
    
    Returns:
    - None (plots the curves)
    """
    df = pd.DataFrame({
        'y_true': y_true,
        'y_proba': y_proba
    })
    
    # Sort descending by predicted probability
    df = df.sort_values(by='y_proba', ascending=False).reset_index(drop=True)
    
    # Add cumulative counts
    df['cum_total'] = np.arange(1, len(df) + 1)
    df['cum_positives'] = df['y_true'].cumsum()
    
    total_positives = df['y_true'].sum()
    total_samples = len(df)
    
    # Cumulative Gains: % positives captured vs % samples
    df['cum_gains'] = df['cum_positives'] / total_positives
    
    # Lift: (cumulative gains) / (cumulative % of sample)
    df['cum_lift'] = df['cum_gains'] / (df['cum_total'] / total_samples)
    
    # Sample points for plotting (deciles)
    cutoffs = np.linspace(0, total_samples, n_bins + 1, dtype=int)
    cutoffs = cutoffs[cutoffs > 0]  # remove zero
    plot_points = df.loc[cutoffs - 1, ['cum_total', 'cum_gains', 'cum_lift']].copy()
    plot_points['percent_samples'] = plot_points['cum_total'] / total_samples * 100
    
    # Plot
    fig, axes = plt.subplots(1, 2, figsize=(14, 6))
    
    # Plot Cumulative Gains
    axes[0].plot(plot_points['percent_samples'], plot_points['cum_gains'], marker='o', color='blue', label='Cumulative Gains')
    axes[0].plot([0, 100], [0, 1], linestyle='--', color='blue', alpha=0.5, label='Random Gains')
    axes[0].set_xlabel('Percent of Sample')
    axes[0].set_ylabel('Cumulative Gains')
    axes[0].set_title('Cumulative Gains Curve')
    axes[0].set_ylim(0, 1.05)
    axes[0].grid(True)
    axes[0].legend()
    
    # Plot Lift
    axes[1].plot(plot_points['percent_samples'], plot_points['cum_lift'], marker='o', color='red', label='Lift')
    axes[1].axhline(1, linestyle='--', color='red', alpha=0.5, label='Random Lift')
    axes[1].set_xlabel('Percent of Sample')
    axes[1].set_ylabel('Cumulative Lift')
    axes[1].set_title('Cumulative Lift Curve')
    axes[1].set_ylim(0, plot_points['cum_lift'].max() * 1.1)
    axes[1].grid(True)
    axes[1].legend()
    
    plt.tight_layout()
    plt.show()

# Example:
plot_lift_and_gains(train['bonus'], train['p_hat'])

The right-hand, cumulative lift chart displays how good you are up to that point in the data. Based on our output above, the top 10% of our predictions produce bonus eligible homes around 2.5 times the amount we would have gotten by random selection. The left-hand plot is the cumulative capture rate plot (also called cumulative gains). This tells us how much of the target 1’s you captured with your model. The diagonal line would be random, so the further above the line the better the model. For the first 10% of our predictions, we were able to capture a little over 20% of all bonus eligible homes. In fact, we have captured nearly all of the bonus eligible homes in the top 60% of our predictions.

R produces classification tables for any cut-off that you determine. Remember that a cut-off is simply the point where you decide to predict an event and non-event from your predicted probabilities.

We want to look at all classification tables for all values of cut-offs between 0 and 1. We can easily loop through this calculation with a for loop in R. However, the measureit function can do this for us. The inputs for this function are the predicted probabilities first, followed by the target variable. The measure option allows you to define additional measures to calculate at each cut-off. In the code below we ask for precision (PREC), sensitivity (SENS), and F1-score (FSCR). From there we combine these variables into a single dataframe and print the observations with the print function.

Code

logit_meas <- measureit(train$p_hat, train$bonus, measure = c("PREC", "REC", "FSCR"))
summary(logit_meas)

       Length Class  Mode   
Cutoff 1089   -none- numeric
Depth  1089   -none- numeric
TP     1089   -none- numeric
FP     1089   -none- numeric
TN     1089   -none- numeric
FN     1089   -none- numeric
PREC   1089   -none- numeric
REC    1089   -none- numeric
FSCR   1089   -none- numeric

Code

fscore_table <- data.frame(Cutoff = logit_meas$Cutoff, FScore = logit_meas$FSCR)
head(arrange(fscore_table, desc(FScore)), n = 10)

      Cutoff    FScore
1  0.3178316 0.9215292
2  0.3214090 0.9213710
3  0.3439624 0.9210526
4  0.3515312 0.9208925
5  0.3546447 0.9207317
6  0.3110115 0.9206030
7  0.3584653 0.9205703
8  0.4156336 0.9205366
9  0.3197739 0.9204431
10 0.3269262 0.9202825

We can see that the highest F1 score had a value of 0.9215. This took place at a cut-off of 0.318. Therefore, according to the F1 score at least, the optimal cut-off for our model is 0.3158. In other words, if our model predicts a probability above 0.318 then we should call this an event. Any predicted probability below 0.318 should be called a non-event. This matches up with the Youden’s Index from above. This is not always the case. We just got lucky in this example.

Another common calculation using the precision of a model is the model’s lift. The lift of a model is simply calculated as the ratio of Precision to the population proportion of the event - $Lift = PPV/\pi_1$. The interpretation of lift is really nice for explanation. Let’s imagine that your lift was 3 and your population proportion of events was 0.2. This means that in the top 20% of your customers, your model predicted 3 times the events as compared to you selecting people at random. Sometimes people plot lift charts where they plot the precision at all the different values of the population proportion (called depth).

Again, we can use the rocit object (called logit_roc) from earlier. The gainstable function breaks the data down into 10 groups (or buckets) ranked from highest predicted probability to lowest. We can use the plot function on this new object along with the type option to get a variety of useful plots. If you want more than 10 buckets for your data you can always use the ngroup option to specify how many you want.

Code

logit_lift <- gainstable(logit_roc)
print(logit_lift)

   Bucket Obs CObs Depth Resp CResp RespRate CRespRate CCapRate  Lift CLift
1       1 110  110   0.1  109   109    0.991     0.991    0.231 2.299 2.299
2       2 109  219   0.2  108   217    0.991     0.991    0.460 2.299 2.299
3       3 109  328   0.3  100   317    0.917     0.966    0.672 2.128 2.242
4       4 110  438   0.4   92   409    0.836     0.934    0.867 1.940 2.166
5       5 110  548   0.5   51   460    0.464     0.839    0.975 1.076 1.947
6       6 109  657   0.6   10   470    0.092     0.715    0.996 0.213 1.660
7       7 109  766   0.7    2   472    0.018     0.616    1.000 0.043 1.430
8       8 110  876   0.8    0   472    0.000     0.539    1.000 0.000 1.250
9       9 110  986   0.9    0   472    0.000     0.479    1.000 0.000 1.111
10     10 109 1095   1.0    0   472    0.000     0.431    1.000 0.000 1.000

Code

plot(logit_lift, type = 1)

Code

plot(logit_lift, type = 2)

Code

plot(logit_lift, type = 3)

Code

logit_lift <- gainstable(logit_roc, ngroup = 15)
print(logit_lift)

   Bucket Obs CObs Depth Resp CResp RespRate CRespRate CCapRate  Lift CLift
1       1  73   73 0.067   72    72    0.986     0.986    0.153 2.288 2.288
2       2  73  146 0.133   72   144    0.986     0.986    0.305 2.288 2.288
3       3  73  219 0.200   73   217    1.000     0.991    0.460 2.320 2.299
4       4  73  292 0.267   69   286    0.945     0.979    0.606 2.193 2.272
5       5  73  365 0.333   65   351    0.890     0.962    0.744 2.066 2.231
6       6  73  438 0.400   58   409    0.795     0.934    0.867 1.843 2.166
7       7  73  511 0.467   43   452    0.589     0.885    0.958 1.367 2.052
8       8  73  584 0.533   13   465    0.178     0.796    0.985 0.413 1.847
9       9  73  657 0.600    5   470    0.068     0.715    0.996 0.159 1.660
10     10  73  730 0.667    2   472    0.027     0.647    1.000 0.064 1.500
11     11  73  803 0.733    0   472    0.000     0.588    1.000 0.000 1.364
12     12  73  876 0.800    0   472    0.000     0.539    1.000 0.000 1.250
13     13  73  949 0.867    0   472    0.000     0.497    1.000 0.000 1.154
14     14  73 1022 0.933    0   472    0.000     0.462    1.000 0.000 1.071
15     15  73 1095 1.000    0   472    0.000     0.431    1.000 0.000 1.000

Let’s examine the output above. In the first table with the data split into 10 buckets, let’s examine the first row. Here we have 110 observations (1/10 of our data, or a depth of 0.1). Remember, these observations have been ranked by predicted probability so these observations have the highest probability of being a 1 according to our model. In these 110 observations, 109 of them had the response (target value of 1) which is a response rate of 0.991. Our original data had a total response rate (proportion of 1’s) of only 0.431. This means that we did 2.299 (=0.991 / 0.431) times better than random with our top 10% of customers. Another way to think about this would be that if we were to randomly pick 10% of our data, we would have only expected to see 47 responses (target values of 1). Our best 10% had 109 responses. Again, this ratio is a value of 2.299. The table continues this calculation for each of the buckets of 10% of our data.

The lift chart displays how good that bucket is alone, while the cumulative lift chart (more popular one) displays how good you are up to that point. The cumulative lift and lift charts are the first plot displayed. The second plot is the response rate and cumulative response rate plot. Each point divided by the horizontal line at 0.431 (population response rate) gives the lift value in the first chart. The last chart is the cumulative capture rate plot. This tells us how much of the target 1’s you captured with your model. The diagonal line would be random, so the further above the line the better the model.

Another way to calculate this is the performance function in R. This can easily calculate and plot the lift chart for us using the measure = "lift" and x.measure = "rpp" options. This plots the lift vs. the rate of positive predictions.

Code

perf <- performance(pred, measure = "lift", x.measure = "rpp")
plot(perf, lwd = 3, colorize = TRUE, colorkey = TRUE,
     colorize.palette = rev(gray.colors(256)),
     main = "Lift Chart for Training Data")
abline(h = 1, lty = 3)

A common place to evaluate lift is at the population proportion. In our example above, the population proportion is approximately 0.431. At that point, we have a lift of approximately 2. In other words, if we were to pick the top 43.1% of homes identified by our model, we would be 2 times as likely to find a bonus eligible home as compared to randomly selecting from the population. This shows the value in our model in an interpretable way.

Accuracy & Error

Accuracy and error rate are typically thought of when it comes to measuring the ability of a logistic regression model. Accuracy is essentially what percentage of events and non-events were predicted correctly.

The error would be the opposite of this.

However, caution should be used with these metrics as they can easily be fooled if only focusing on them. If your data has 10% events and 90% non-events, you can easily have a 90% accurate model by guessing non-events for every observation. Instead, having less accuracy might be better if we can predict both events and non-events. These numbers are still great to report! They are just not the best to decide which model is best.

Let’s see how to do this in each software!

Python
R

We want to look at all classification tables for all values of cut-offs between 0 and 1. We can easily loop through this calculation with a for loop in Python using the accuracy_score function. The input for the function are the target variable first, followed by the predicted probabilities. We loop through many cut-off values to find the optimal accuracy.

Code

from sklearn.metrics import accuracy_score

accuracy = np.array([])

for y in range(100):
  train['pred'] = train['p_hat'].map(lambda x: 1 if x > y/100 else 0)
  value_a = accuracy_score(train['bonus'], train['pred'])
  accuracy = np.append(accuracy, value_a)


data = {'Accuracy': accuracy, 'Cut-off': range(100)}
acc_s = pd.DataFrame(data)

acc_s.sort_values(by = ['Accuracy'], ascending = False)

    Accuracy  Cut-off
41  0.929680       41
44  0.928767       44
32  0.928767       32
40  0.928767       40
45  0.928767       45
..       ...      ...
2   0.752511        2
98  0.743379       98
1   0.723288        1
99  0.707763       99
0   0.431050        0

[100 rows x 2 columns]

From the output we can see the accuracy is maximized at 92.97%. The predicted probability that this occurs at (the optimal cut-off) is defined as 0.41. In other words, if our model predicts a probability above 0.41 then we should call this an event. Any predicted probability below 0.41 should be called a non-event, according to the accuracy metric.

There is more to model building than simply maximizing overall classification accuracy. These are good numbers to report, but not necessarily to choose models on.

R produces classification tables for any cut-off that you determine. Remember that a cut-off is simply the point where you decide to predict an event and non-event from your predicted probabilities.

We want to look at all classification tables for all values of cut-offs between 0 and 1. We can easily loop through this calculation with a for loop in R. However, the measureit function can do this for us. The inputs for this function are the predicted probabilities first, followed by the target variable. The measure option allows you to define additional measures to calculate at each cut-off. In the code below we ask for accuracy (ACC) and F1-score (FSCR). From there we combine these variables into a single dataframe and print the observations with the print function.

Code

logit_meas <- measureit(train$p_hat, train$bonus, measure = c("ACC", "FSCR"))
summary(logit_meas)

       Length Class  Mode   
Cutoff 1089   -none- numeric
Depth  1089   -none- numeric
TP     1089   -none- numeric
FP     1089   -none- numeric
TN     1089   -none- numeric
FN     1089   -none- numeric
ACC    1089   -none- numeric
FSCR   1089   -none- numeric

Code

acc_table <- data.frame(Cutoff = logit_meas$Cutoff, Acc = logit_meas$ACC)
head(arrange(acc_table, desc(Acc)), n = 10)

      Cutoff       Acc
1  0.4564870 0.9296804
2  0.4435353 0.9296804
3  0.4380600 0.9296804
4  0.4156336 0.9296804
5  0.4624439 0.9287671
6  0.4592758 0.9287671
7  0.4527016 0.9287671
8  0.4450037 0.9287671
9  0.4425696 0.9287671
10 0.4358744 0.9287671

From the output we can see the accuracy is maximized at 92.97%. The predicted probability that this occurs at (the optimal cut-off) is defined as 0.456. In other words, if our model predicts a probability above 0.456 then we should call this an event. Any predicted probability below 0.456 should be called a non-event, according to the accuracy metric.

There is more to model building than simply maximizing overall classification accuracy. These are good numbers to report, but not necessarily to choose models on.

Optimal Cut-off

Classification is a decision that is extraneous to statistical modeling. Although logistic regressions tend to work well in classification, it is a probability model and does not output events and non-events.

Classification assumes cost for each observation is the same, which is rarely true. It is always better to consider the costs of false positives and false negatives when considering cut-offs in classification. The previous sections talk about many ways to determine “optimal” cut-offs when costs are either not known or not necessary. However, if costs are known, they should drive the cut-off decision rather than modeling metrics that do not account for cost.