Credit Score Modeling

Introduction to Credit Scoring

Credit scoring is best summed up by this quote from David Edelman who at the time was the Credit Director of the Royal Bank of Scotland:

Credit scoring is “one of the oldest applications of data mining, because it is one of the earliest uses of data to predict consumer behavior.”

Credit scoring is a statistical model that assigns a risk value to prospective or existing credit accounts. Typically, we think of credit scoring in the context of loans. We are trying to determine the likelihood of an individual to default on a loan.

Scorecards are a common way of displaying the patterns found in a binary response model. Typically, the models that underly scorecards are logistic regression models. The main benefits of scorecards are their clear and intuitive way of presenting the regression coefficients from a model. These scorecards are typically thought of in a credit modeling framework, but are not limited there as they are used in fraud detection, healthcare, and marketing fields as well.

Credit Scorecards

Credit scorecards, much like your FICO score, are a statistical risk model that was put into a special format designed for ease of interpretation. These are used to make strategic decisions such as accepting/rejecting applicants and deciding when to raise a credit line, as well as other decisions. The credit scorecard format is very popular and successful in the consumer credit world for three primary reasons:

  1. People at all levels within an organization generally find it easy to understand and use.
  2. Regulatory agencies are accustomed to credit risk models presented in this fashion.
  3. Credit scorecards are straightforward to implement and monitor over time.

Let’s examine a simple example of a scorecard to see these benefits. Below is a simple sccorecard built off of a three variable logistic regression model trying to predict deafult on a loan. The three variables are miss, which represents the number of months since the last missed payment for the applicant, home, which represents whether an applicant owns or rents their home, and income which is the income bracket for the applicant.

Example Scorecard


Imagine we had an applicant who had last had a missed payment 32 months ago, who owned their home, and had a salary of $30,000. They would have a score of 525 (120 + 225 + 180). Let’s assume our cut-off for giving a loan was a score of 500. If this was the case, the applicant would be given the loan. Now imagine we had another applicant who last missed a payment 22 months ago, owned their home, but only had an income of $8,000. They would have a score of 445 (100 + 225 + 120). They would not be given a loan.

This is extremely ease for anyone to use and implement in any computing system or database. This way the person making the loan decision has easy cut-offs and variable groupings to bucket an applicant in. They also have the ability to let an applicant know why they were rejected for a loan much easier. From our second individual, they had an income level that was in the lowest point bin for that variable. This is also true for their months since last last payment variable. These are the reasons they were rejected for a loan.

This ease of interpretation protects the consumer as it is their right to ask why they were rejected for a loan. This is why regulators appreciate the format of scorecards so much in the credit world.

Discrete vs. Continuous Time

Credit scoring tries to understand the probability of default on a customer (or business). However, default depends on time for its definition. When a customer or business will default is just as valuable as if they will. How we incorporate time into the evaluation of credit scoring is important for this reason.

Accounting for time is typically broken down into two approaches:

  1. Discrete time
  2. Continuous time

Discrete time evaluates binary decisions on predetermined intervals of time. For example, are you going to default in the next 30, 60, or 90 days. Each of these intervals have a separate binary credit scoring model. This approach is very popular in credit scoring consumers as people don’t actually care about the exact day of default as compared to the number of missed payments. Defaulting at 72 days isn’t needed as long as I know if a consumer defaulted between 60 and 90 days. These models used together can piece together windows of time where it is believed a consumer will default.

Continuous time evaluates the probability of default as it changes over continuous points in time. Instead of a series of binary classification models, survival analysis models are used for this approach as they can predict the exact day of default. This is more important in credit scoring businesses to determine the exact time a business may declare bankruptcy and default on a loan. This approach is starting to gain more popularity in consumer credit modeling to better help with the amount of capital to keep on hand with consumers defaulting at specific times as compared to windows of time.

Data Description

The first thing that we need to do is load up all of the needed libraries in R that we will be using in these course notes. This isn’t needed for the SAS sections of the code.

We will be using the auto loan data to build a credit scorecard for applicants for an auto loan. This credit scorecard predicts the likelihood of default for these applicants. There are actually two data sets we will use. The first is a data set on 5,837 people who were ultimately given auto loans. The variables in the accepts data set are the following:

Auto Loan Accepts Data Dictionary
Variable Description
Age_oldest_tr Age of oldest trade
App_id Application ID
Bad Godd/Bad loan
Bankruptcy Bankruptcy (1) or not (0)
Bureau_score Bureau score
Down_pyt Amount of down payment on vehicle
Loan_amt Amount of loan
Loan_term How many months vehicle was financed
Ltv Loan to value
MSRP Manufacturer suggested retail price
Purch_price Purchase price of vehicle
Purpose Lease or own
Rev_util Revolving utilization (balance/credit limit)
Tot_derog Total number of derogatory trades (go past due)
Tot_income Applicant’s income
Tot_open_tr Number of open trades
Tot_rev_debt Total revolving debt
Tot_rev_line Total revolving line
Tot_rev_tr Total revolving trades
Tot_tr Total number of trades
Used_ind Used car indicator
Weight Weight variable

The accepts data set has actually been oversampled for us because the event of defaulting on the auto loans is only 5% in the population. Our sample has closer to a 20% default rate.

The second data set is 4,233 applicants who were ultimately rejected for a loan. We have the same information on these applicants except for the target variable of whether they defaulted since they were never given a loan. These individuals are still key to our analysis. Sampling bias occurs when we only use people who were given loans to make decisions on individuals who apply for loans. In order to correct for this bias, we perform reject inference on the individuals who applied, but were not given loans.

We will deal with this data set near the end of our credit scoring model process.

Data Collection and Cleaning

Defining the Target

When dealing with credit scoring data, the first major hurdle is to define the target variable of default. This might be harder than initially expected. When does someone actually default? Do you wait for the loan to be charged-off by the bank? There were probably plenty of signs before then that the customer would stop paying on their loan.

People use to always use 90 days past due (DPD) as the typical definition of default. If a customer goes 90 days past their due date for a payment on a loan, they would be considered a default. Now, default ranges between 90 and 180 days past due based on the types of loans, business sector, and country regulations. For example, in the United States, 180 days past due is the default standard on mortgage loans.

Predictor Variables

Selecting predictor variables in credit scoring models also takes care. Credit scoring models desire variables that have a high amount of predictability for default, but that isn’t the only criteria. Since regulators will be checking these models, they have to have vairables that are easily interpretable from a business stand-point. They also must be reliably and easily collected both now and in the future since you don’t want to make a mistake in a loan decision. These variables also must be though of ethically to ensure that the bank is being fair and equitable to all of their customers.

Feature engineering is an important part of the process for developing good predictor variables. In fact, good feature engineering can replace the need for more advanced modeling techniques that lack the interpretation needed for a good credit scorecard model. Features may be created based on business reasoning, such as the loan to value ratio, the expense to income ratio, or the credit line utilization across time. Variable clustering may also be used to omit variables that are highly dependent on each other.

Sampling

When it comes to sample size, there are no hard, fast rules on how much data is needed for building credit scoring models. The FDIC suggests that samples “normally include at least 1,000 good, 1,000 bad, and 750 rejected applicants.” However, sample size really depends on the overall dize of the portfolio, the number of predictor variables being planned for the model, and the number of defaults in the data.

Sampling must also be characteristic of the population to which the scorecard will be applied. For example, if the scorecard is to be applied in the subprime lending program, then we must use a sample that captures the characteristics of the subprime population targeted. Here are the two main steps for sampling for credit scoring models:

  1. Gather data for accounts opened during a specific time frame.
  2. Monitor the performance of these accounts for another specific length of time to determine if they were good or bad.

This approach raises natural concerns. Accounts that are opened more recently are more similar to account that will be opened in the near future so we don’t want to go too far back in time to sample. However, we want to minimize the chances of misclassifying the performance of the account so we need to monitor the accounts long enough to let them fail. Banks develop cohort graphs to help them determine how long a typical customer takes to default on a loan. Essentially, watch customer accounts and their default rates. When these default rates tend to level off, then a majority of the customers who will have defaulted will be done. This relies on the empirical data that shows that customers that default typically do so early in the life of a loan. From these cohort charts comes the concepts of sample and performance windows.

For example, let’s imagine the typicaly amount of time a customer takes to default on a loan is 14 months. If our analysis is to be performed on March of this year, we will select our sample from 12-16 months back. This will give us an average of 14 months for our performance window. An example of this is shown below.

12-16 Month Performance Window


Now that our data is se, we can move into truly preparing our variables for modeling.

Variable Grouping and Selection

Feature creation and selection is one of the most important pieces to any modeling process. It is no different for credit score modeling. Before selecting the variables, we need to transform them. Specifically in credit score modeling, we need to take our continuous variables and bin them into categorical versions.

Variable Grouping

Scorecards end up with only just bins within a variable. There are two primary objectives when deciding on how to bin the variables:

  1. Eliminate weak variables or those that do not conform to good business logic.
  2. Group the strongest variables’ attribute levels (values) in order to produce a model in the scorecard format.

Binning continuous variables help simplify analysis. We no longer need to explain coefficients that imply some notion of constant effect or linearity, but instead are just comparisons of categories. This process of binning also models non-linearities in an easily interpretable way. We are not restricted to linearity of the continuous variables as some models assume. Outliers are also easily accounted for as they are typically contained within the smallest or largest bin. Lastly, missing values are no longer a problem and do not need imputation. Missing values can get their own bin making all observations available to be modeled.

There are a variety of different approaches to statistically bin variables. We will focus on the two most popular ones here:

  1. Prebinning and Grouping of Bins
  2. Conditional Inference Trees

The first is by prebinning the variables followed by the grouping of these bins. Imagine you had a variable whose range is from 3 to 63. This approach would first break this variable into quantile bins. Softwares typically use anywhere from 20 to 100 equally sized quantiles for this initial step. From there, we use chi-square tests to compare each adjacent pair of bins. If the bins are statistically the same with respect to the target variable using two by two contingency table Chi-square tests (Mantel-Haenzel for example), then we combine the bins. We repeat this process until no more adjacent pairs of bins can be statistically combined. Below is a visual of this process.

Prebinning and Grouping of Bins


The second common approach is through conditional inference trees. These are a variation on the common CART decision tree. CART methods for decision trees have inherent bias - variables with more levels are more likely to be split on if split using the Gini and entropy criterion. Conditional inference trees on the other hand add an extra step to this process. Conditional inference trees evaluate which variable is most significant first, then evaluate what is the best split of a continuous variable through the Chi-square decision tree approach on that specific variable only, not all variables. They repeat this process until no more significant variables are left to be split. How does this apply to binning though? When binning a continuous variable, we are predicting the target variable using only our one continuous variable in the conditional inference tree. It evaluates if the variable is significant at predicting the target variable. If so, it finds the most significant statistical split using Chi-square tests in between each value of the continuous variable and then comparing the two groups formed by this split. After finding the most significant split you have two continuous variables - one below the split and one above. The process repeats itself until the algorithm can no longer find significant splits leading to the definition of your bins. Below is a visual of this process.

Conditional Inference Tree Bins


Cut-offs (or cut points) from the decision tree algorithms might be rather rough. Sometimes we override the automatically generated cut points to more closely conform to business rules. These overrides might make the bins suboptimal, but hopefully not too much to impact the analysis.

Overrides of Bins Example


Imagine a similar scenario for linear regression. Suppose you had two models with the first model having a \(R^2_A = 0.8\) and the second model having an \(R^2_A = 0.78\). However, the second model made more intuitive business sense than the first. You would probably choose the second model willing to sacrifice a small amount of predictive power for a model that made more intuitive sense. The same can be thought of when slightly altering the bins from these two approaches described above.

Let’s see how each of our softwares approaches binning continuous variables!

R

The R package that you choose will determine the technique that is used for the binning of the continuous variables. The scorecard package more closely aligns with the SAS approach of prebinning the variable and combining the bins. The smbinning package as shown below uses the conditional inference tree approach.

The smbinning function inside the smbinning package is the primary function to bin continuous variables. Our data set has a variable bad that flags when an observation has a default. However, the smbinning function needs a variable that defines the people in our data set that did not have the event - those who did not default. Below we create this new good variable in our training data set.

## 
##    0    1 
## 1196 4641

We also need to make the categorical variables in our data set into factor variables in R so the function will not automatically assume they are numeric just because they have numerical values. We can do this with the as.factor function.

Before any binning is done, we need to split our data into training and testing because the binning evaluates relationships between the target variable and the predictor variables. This is easily done in R using the sample function to sample row numbers. The size = option identifies the number of observations to be sampled. It was set as 75% of the number of rows in the dataset.

From there we can just identify the sampled rows as the training set and the remaining rows as testing set. The set.seed function is used to replicate the results.

Now we are ready to bin our variables. Let’s go through an example of binning the bureau_score variable using the smbinning function. The three main inputs to the smbinning function are the df = option which defines the data frame for your data, the y = option that defines the target variable by name, and the ``x = ``` option that defines the predictor variable to be binned by name.

##   Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate
## 1   <= 603    223     112    111       223        112       111 0.0509   0.5022
## 2   <= 662   1056     678    378      1279        790       489 0.2413   0.6420
## 3   <= 699    939     754    185      2218       1544       674 0.2145   0.8030
## 4   <= 717    514     440     74      2732       1984       748 0.1174   0.8560
## 5   <= 765    899     824     75      3631       2808       823 0.2054   0.9166
## 6    > 765    513     498     15      4144       3306       838 0.1172   0.9708
## 7  Missing    233     153     80      4377       3459       918 0.0532   0.6567
## 8    Total   4377    3459    918        NA         NA        NA 1.0000   0.7903
##   BadRate    Odds LnOdds     WoE     IV
## 1  0.4978  1.0090 0.0090 -1.3176 0.1167
## 2  0.3580  1.7937 0.5843 -0.7423 0.1602
## 3  0.1970  4.0757 1.4050  0.0785 0.0013
## 4  0.1440  5.9459 1.7827  0.4562 0.0213
## 5  0.0834 10.9867 2.3967  1.0701 0.1675
## 6  0.0292 33.2000 3.5025  2.1760 0.2777
## 7  0.3433  1.9125 0.6484 -0.6781 0.0291
## 8  0.2097  3.7680 1.3265  0.0000 0.7738
## [1] 603 662 699 717 765

The ivtable element contains a summary of the splits as well as some information regarding each split. Working from left to right, the columns represent the number of observations in each bin, the number of goods (non-defaulters) and bads (defaulters) in each bin, as well as the cumulative versions of all of the above. Next comes the percentage of observations that are in the bin as well as percentage of observations in the bin that are both good and bad. Finally, the tbale lists the odds, natural log of the odds, weight of evidence (WoE), and information value component. These last few are explained in the next section below.

The cut element in the smbinning object contains a vector of the actual split points for the bins.

The smbinning.plot function will make barplots of some of the above metrics for each bin. Specifically, we are plotting the percentage of observations that are in the bin as well as percentage of observations in the bin that are both good and bad using the option = "dist", option = "goodrate", and option = "badrate" respectively. The sub = option makes a subtitle for each plot.

SAS

SAS takes the approach of prebinning and then combining the bins statistically to bin continuous variables. However, before any binning is done, we need to split our data into training and testing because the binning evaluates relationships between the target variable and the predictor variables. This is easily done in SAS using the SURVEYSELECT procedure to sample row numbers. The data = option identifies the data we are interested in splitting. The method = srs option specifies that we want simple random sampling to split our data into training and testing. The out = option names the dataset that has the flagged observations. The samprate = 0.75 specifies that number of observations to be sampled. It was set as 75%. The outall option is key to keep both the training and testing observations in the final data set.

From there we can just split the data using a DATA step. We create both the train and valid datasets using the if and else statements to put the selected observations (from PROC SURVEYSELECT) into the training set and the others in the testing set.

Now we are ready to bin our variables. Let’s go through an example of binning the bureau_score variable using the BINNING procedure The main options to PROC BINNING statement are the DATA = option which defines the data frame for your data and the METHOD = TREE option that defines tree based approach to binning. The initbin = 100 option specifies how many initial bins to split the variable into. The maxnbins = 100 option specifies that SAS can split the variable into at most 100 levels. The TARGET statement defines the target variable. The INPUT statement defines the variable we are binning, bureau_score. The ODS statement is used for calculation of weight of evidence (WoE) that we will discuss in the next section.

PROC SMBINNING Results


Working from left to right, the columns represent the bin number, the lower and upper bounds for the bin, the width of the bin, the number of observations in each bin, and some summary statistics for each bin (mean, standard deviation, minimum, and maximum).

Weight of Evidence (WOE)

Weight of evidence (WOE) measures the strength of the attributes (bins) of a variable in separating events and non-events in a binary target variable. In credit scoring, that implies separating bad and good accounts respectively.

Weight of evidence is based on comparing the proportion of goods to bads at each bin level and is calculated as follows for each bin within a variable:

\[ WOE_i = \log(\frac{Dist. Good_i}{Dist.Bad_i}) \] The distribution of goods for each bin is the number of goods in that bin divided by the total number of goods across all bins. The distribution of bads for each bin is the number of bads in that bin divided by the total number of bads across all bins. An example is shown below:

WOE Calculation Example


WOE summarizes the separation between events and non-events (bads and goods) as shown in the following table:

WOE Calculation Example


For WOE we are looking for big differences in WOE between bins. Ideally, we would like to see monotonic increases for variables that have ordered bins. This isn’t always required as long as the WOE pattern in the bins makes business sense. However, if a variable’s bins go back and forth between positive and negative WOE values across bins, then the variable typically has trouble separating goods and bads. Graphically, the WOE values for all the bins in the bureau_score variable look as follows:

WOE approximately zero implies percentages of non-events (goods) are approximately equal to percentages of events (bads) so that bin doesn’t do a good job of separating these events and non-events. WOE of positives values implies the bin identifies observations that are non-events (goods), while WOE of negative values implies bin identifies observations that are events (bads).

One quick side note. WOE values can take a value of infinity or negative infinity when quasi-complete separation exists in a category (zero events or non-events). Some people adjust the WOE calculation to include a small smoothing parameter to make the numerator or denominator of the WOE calculation not equal to zero.

Let’s see how to get the weight of evidence values in each of our softwares!

R

The smbinning function inside the smbinning package is the primary function to bin continuous variables. Let’s go through an example of binning the bureau_score variable using the smbinning function. The three main inputs to the smbinning function are the df = option which defines the data frame for your data, the y = option that defines the target variable by name, and the ``x = ``` option that defines the predictor variable to be binned by name.

As we previously saw, the ivtable element contains a summary of the splits as well as some information regarding each split including weight of evidence.

  Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate
1   <= 603    223     112    111       223        112       111 0.0509   0.5022
2   <= 662   1056     678    378      1279        790       489 0.2413   0.6420
3   <= 699    939     754    185      2218       1544       674 0.2145   0.8030
4   <= 717    514     440     74      2732       1984       748 0.1174   0.8560
5   <= 765    899     824     75      3631       2808       823 0.2054   0.9166
6    > 765    513     498     15      4144       3306       838 0.1172   0.9708
7  Missing    233     153     80      4377       3459       918 0.0532   0.6567
8    Total   4377    3459    918        NA         NA        NA 1.0000   0.7903
  BadRate    Odds LnOdds     WoE     IV
1  0.4978  1.0090 0.0090 -1.3176 0.1167
2  0.3580  1.7937 0.5843 -0.7423 0.1602
3  0.1970  4.0757 1.4050  0.0785 0.0013
4  0.1440  5.9459 1.7827  0.4562 0.0213
5  0.0834 10.9867 2.3967  1.0701 0.1675
6  0.0292 33.2000 3.5025  2.1760 0.2777
7  0.3433  1.9125 0.6484 -0.6781 0.0291
8  0.2097  3.7680 1.3265  0.0000 0.7738

The weight of evidence values are listed in the WoE column and the same values as shown above. We can easily get the plot of the WOE values using the smbinning.plot function with the option = "WoE" option. The resulting plot is the same as the WOE plot above.

You can get weight of evidence values for a factor variable as well without needing to rebin the values. This is done using the smbinning.factor function on the purpose variable as shown below:

   Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec
1 = 'LEASE'   1466    1149    317      1466       1149       317 0.3349
2  = 'LOAN'   2911    2310    601      4377       3459       918 0.6651
3   Missing      0       0      0      4377       3459       918 0.0000
4     Total   4377    3459    918        NA         NA        NA 1.0000
  GoodRate BadRate   Odds LnOdds     WoE     IV
1   0.7838  0.2162 3.6246 1.2877 -0.0388 0.0005
2   0.7935  0.2065 3.8436 1.3464  0.0199 0.0003
3      NaN     NaN    NaN    NaN     NaN    NaN
4   0.7903  0.2097 3.7680 1.3265  0.0000 0.0008

SAS

SAS takes the approach of prebinning and then combining the bins statistically to bin continuous variables. Let’s go through an example of binning the bureau_score variable using the BINNING procedure The main options to PROC BINNING statement are the DATA = option which defines the data frame for your data and the METHOD = TREE option that defines tree based approach to binning. The initbin = 100 option specifies how many initial bins to split the variable into. The maxnbins = 100 option specifies that SAS can split the variable into at most 100 levels. The TARGET statement defines the target variable. The INPUT statement defines the variable we are binning, bureau_score. The ODS statement is used for calculation of weight of evidence (WoE) as we need to output the specific points of the splits for the bins using the BinDetails = option and the number of bins using the VarTransInfo = option.

From there we use the DATA step to create a MACRO variable called numbin in SAS to define the number of bins using the CALL SYMPUT functionality. To get a dataset that contains the values where the bins separate we need to use PROC SQL where we select the variable Max from the bincuts dataset created from PROC BINNING. We place the values of the Max variable into a MACRO variable called cuts.

Lastly, we reuse PROC BINNING. We can calculate the WOE values using the woe option. However, the woe option can only be used when the bins are defined by the user, which is why we needed the optimal bins calculated first before getting the WOE values. The only difference for the second instance of PROC BINNING is defining the value of the target variable that is an event in the TARGET statement using the event = option.

PROC SMBINNING Results


You can get weight of evidence values for a factor variable as well without needing to rebin the values. This is done by us calculating the WOE values ourselves using the PROC TABULATE and PROC TRANSPOSE procedures on the purpose variable as shown below:

PROC SMBINNING Results


Information Value

Weight of evidence summarizes the individual categories or bins of a variable. However, we need a measure of how well all the categories in a variable do at separating the events from non-events. That is what information value (IV) is for. IV uses the WOE from each category as a piece of its calculation:

\[ IV = \sum_{i=1}^L (Dist.Good_i - Dist.Bad_i)\times \log(\frac{Dist.Good}{Dist.Bad}) \] In credit modeling, IV is used in some instances to actually select which variables belong in the model. Here are some typical IV ranges for determining the strength of a predictor variable at predicting the target variable:

  • \(0 \le IV < 0.02\) - Not predictor
  • \(0.02 \le IV < 0.1\) - Weak predictor
  • \(0.1 \le IV < 0.25\) - Moderate (medium) predictor
  • \(0.25 \le IV\) - Strong predictor

Variables with information values greater than 0.1 are typically used in credit modeling.

Some resources will say that IV values greater than 0.5 might signal over-predicting of the target. In other words, maybe the variable is too strong of a predictor because of how that variable was originally constructed. For example, if all previous loan decisions have been made only on bureau score, then of course that variable would be highly predictive and possibly the only significant variable. In these situations, good practice is to build one model with only bureau score and another model without bureau score but with other important factors. We then ensemble these models together.

Let’s see how to get the information values for our variables in each of our softwares!

R

The smbinning function inside the smbinning package is the primary function to bin continuous variables. Let’s go through an example of binning the bureau_score variable using the smbinning function. The three main inputs to the smbinning function are the df = option which defines the data frame for your data, the y = option that defines the target variable by name, and the ``x = ``` option that defines the predictor variable to be binned by name.

As we previously saw, the ivtable element contains a summary of the splits as well as some information regarding each split including information value.

  Cutpoint CntRec CntGood CntBad CntCumRec CntCumGood CntCumBad PctRec GoodRate
1   <= 603    223     112    111       223        112       111 0.0509   0.5022
2   <= 662   1056     678    378      1279        790       489 0.2413   0.6420
3   <= 699    939     754    185      2218       1544       674 0.2145   0.8030
4   <= 717    514     440     74      2732       1984       748 0.1174   0.8560
5   <= 765    899     824     75      3631       2808       823 0.2054   0.9166
6    > 765    513     498     15      4144       3306       838 0.1172   0.9708
7  Missing    233     153     80      4377       3459       918 0.0532   0.6567
8    Total   4377    3459    918        NA         NA        NA 1.0000   0.7903
  BadRate    Odds LnOdds     WoE     IV
1  0.4978  1.0090 0.0090 -1.3176 0.1167
2  0.3580  1.7937 0.5843 -0.7423 0.1602
3  0.1970  4.0757 1.4050  0.0785 0.0013
4  0.1440  5.9459 1.7827  0.4562 0.0213
5  0.0834 10.9867 2.3967  1.0701 0.1675
6  0.0292 33.2000 3.5025  2.1760 0.2777
7  0.3433  1.9125 0.6484 -0.6781 0.0291
8  0.2097  3.7680 1.3265  0.0000 0.7738

The information value is listed in the IV column and the last row. The IV numbers in each of the rows for the bins is the component of the IV from each of the categories. The final row is the sum of the previous rows which is the overall variable IV.

Another way to view the information value for every variable in the dataset is to use the smbinning.sumiv function. The only two inputs to this function are the data = option where you define the dataset and the y = option to define the target variable. The function then calculates the IV for each variable in the dataset with the target variable.

To view these information values for each variable we can just print out a table of the results by calling the object by name. We can also use the smbinning.sumiv.plot function on the object to view them in a plot:

            Char     IV               Process
11  bureau_score 0.7738    Numeric binning OK
9   tot_rev_line 0.3987    Numeric binning OK
10      rev_util 0.3007    Numeric binning OK
5  age_oldest_tr 0.2512    Numeric binning OK
3      tot_derog 0.2443    Numeric binning OK
18           ltv 0.1456    Numeric binning OK
4         tot_tr 0.1304    Numeric binning OK
14      down_pyt 0.0848    Numeric binning OK
8   tot_rev_debt 0.0782    Numeric binning OK
19    tot_income 0.0512    Numeric binning OK
16     loan_term 0.0496    Numeric binning OK
13          msrp 0.0360    Numeric binning OK
12   purch_price 0.0204    Numeric binning OK
20      used_ind 0.0183     Factor binning OK
1     bankruptcy 0.0009     Factor binning OK
15       purpose 0.0008     Factor binning OK
2         app_id     NA No significant splits
6    tot_open_tr     NA No significant splits
7     tot_rev_tr     NA No significant splits
17      loan_amt     NA No significant splits
21           bad     NA    Uniques values < 5
22        weight     NA    Uniques values < 5

As we can see from the output above, the strong predictors of default are bureau_score, tot_rev_line, and rev_util. The moderate or medium predictors are age_oldest_tr, tot_derog, ltv, and tot_tr. These would be the variables typically used in credit modeling due to having IV scores above 0.1.

SAS

SAS takes the approach of prebinning and then combining the bins statistically to bin continuous variables. Let’s go through an example of binning the bureau_score variable using the BINNING procedure The main options to PROC BINNING statement are the DATA = option which defines the data frame for your data and the METHOD = TREE option that defines tree based approach to binning. The initbin = 100 option specifies how many initial bins to split the variable into. The maxnbins = 100 option specifies that SAS can split the variable into at most 100 levels. The TARGET statement defines the target variable. The INPUT statement defines the variable we are binning, bureau_score. The ODS statement is used for calculation of weight of evidence (WoE) as we need to output the specific points of the splits for the bins using the BinDetails = option and the number of bins using the VarTransInfo = option.

From there we use the DATA step to create a MACRO variable called numbin in SAS to define the number of bins using the CALL SYMPUT functionality. To get a dataset that contains the values where the bins separate we need to use PROC SQL where we select the variable Max from the bincuts dataset created from PROC BINNING. We place the values of the Max variable into a MACRO variable called cuts.

Lastly, we reuse PROC BINNING. We can calculate the WOE values and information value using the woe option. However, the woe option can only be used when the bins are defined by the user, which is why we needed the optimal bins calculated first before getting the WOE values. The only difference for the second instance of PROC BINNING is defining the value of the target variable that is an event in the TARGET statement using the event = option.

PROC SMBINNING Results


Gini Statistic

The Gini statistic is an optional technique that tries to answer the same question as information value - which variables are strong enough to enter the scorecard model. Since information value is more in line with weight of evidence calculations it is used much more often in practice.

The Gini statistic ranges between 0 and 100 where bigger values are better. A majority of the time the Gini statistic and IV will agree on variable importance, but might differ on borderline cases. The more complicated technique is calculated as follows:

\[ Gini = (1 - \frac{(2\sum_{i=2}^L(n_{i,event} \times \sum_{i=1}^{i-1}n_{i,non-event})+\sum_{i=1}^L(n_{i,event} \times n_{i,non-event}))}{N_{event}\times N_{non-event}}) \times 100 \]

Scorecard Creation

Now that we have transformed our variables for modeling, we can start with the process of building our model. In building credit models, we first build an initial credit scoring model. From there we will incorporate our rejected customers through reject inference to build our full, final credit score model.

Initial Scorecard Creation

In each of the models that we build we must take the following three steps:

  1. Build the model
  2. Evaluate the model
  3. Convert the model to a scorecard

Building the Model

The scorecard is typically based on a logistic regression model:

\[ logit(p) = \log(\frac{p}{1-p}) = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k \] In the above equation, \(p\) is the probability of default given the inputs in the model. However, instead of using hte original variables for the model, credit scoring models and their complimenting scorecards are based on binned variables as their foundation. Instead of treating the binned variables are categorical, the values of the bins are replaced by the WOE values for the categories:
WOE Values for Bureau Score


In other words, if a person has a bureau score of 705, then their observation takes the value of 0.46 as seen in the table above. These inputs are still treated as continuous even though they only take a limited number of values. Additionally, the variables are all on the same scale. You can think of the transformation we performed as scaling all the variables based on their predictive ability for the target variable.

Since all the variables are on the same scale, the \(\beta\) coefficients from the logistic regression model now serve as variable importance measures. These coefficients are actually the only thing we desire to gain from the logistic regression as they help define the point schemes for the scorecard.

Let’s see how to build our model in our software!

R

Since most credit models are built off variables with information values of at least 0.1, the following R code takes all the continuous variables in your dataset, uses the smbinning function to bin the variables, and stores all the results in a list called result_all_sig only for variables with \(IV \ge 0.1\).

The smbinning.gen function will create binned, factor variables in R based on the results from the smbinning function. The df = option defines the dataset. The ivout = option defines the specific results for the variable of interest as you can only do this function one variable at a time. The chrname = function defines the names of the new binned variable in your dataset. The following example uses the bureau_score variable.

To create a variable that has WOE values instead of the binned values from the smbinning function you need to use a loop. The following R code is an example for bureau_score that takes the newly created binned variable from smbinning.gen and creates a new variable where the WOE values are used for each observation instead of the binned values.

We would then repeat the process for all of our variables in the dataset. The following R code loops through all the variables in the result_all_sig object and does just that.

Now that we have created our variables that are replaced with their WOE values, we can build our logistic regression model. The glm function in R will provide us the ability to model binary logistic regressions. Similar to most modeling functions in R, you can specify a model formula. The family = binomial(link = "logit") option is there to specify that we are building a logisitc model. Generalized Linear models (GLM) are a general class of models where logistic regression is a special case where the link function is the logit function. Use the summary function to look at the necessary output.


Call:
glm(formula = bad ~ tot_derog_WOE + tot_tr_WOE + age_oldest_tr_WOE + 
    tot_rev_line_WOE + rev_util_WOE + bureau_score_WOE + ltv_WOE, 
    family = "binomial", data = train, weights = train$weight)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6982  -0.7675  -0.4499  -0.1756   3.3564  

Coefficients:
                  Estimate Std. Error z value             Pr(>|z|)    
(Intercept)       -2.90953    0.03942 -73.813 < 0.0000000000000002 ***
tot_derog_WOE     -0.10407    0.08130  -1.280               0.2005    
tot_tr_WOE        -0.03890    0.13225  -0.294               0.7687    
age_oldest_tr_WOE -0.38879    0.09640  -4.033            0.0000551 ***
tot_rev_line_WOE  -0.33535    0.08381  -4.001            0.0000631 ***
rev_util_WOE      -0.18548    0.08020  -2.313               0.0207 *  
bureau_score_WOE  -0.81875    0.05659 -14.468 < 0.0000000000000002 ***
ltv_WOE           -0.93623    0.10052  -9.314 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 7182.6  on 4376  degrees of freedom
Residual deviance: 6363.4  on 4369  degrees of freedom
AIC: 6470.9

Number of Fisher Scoring iterations: 6

Let’s examine the output above. Scanning down the output, you can see the actual logistic regression equation itself for each of the variables. Again, credit models are typically built with all variables that have information values of at least 0.1 regardless of their significance in the model. However, at a reasonable significance level (we used an 0.005 significance level for this analysis based on the sample size) it appears that the variables tot_derog_WOE, tot_tr_WOE, and rev_util_WOE are not significant. You can easily perform variable selection based on other metrics like BIC, significance level, etc.

Model Evaluation

Credit models are evaluated as most classification models. Overall model performance is typically evaluated on area on the ROC curve as well as the K-S statistic.

Let’s see how to perform this in our software!

R

Luckily, the smbinning package has great functionality for evaluating model performance. The smbinning.metrics function provides many summary statistics and plots to evaluate our models. First, we must get the predictions from our model by creating a new variable pred in our dataset from the fitted.values element of our glm model object. This new pred variable is one of the inputs of the smbinning.metrics function. The dataset = option defines our dataset. The prediction = option is where we define the variable in the dataset with the predictions from our model. The actualclass = option defines the target variable from our dataset. The report = 1 option prints out a report with a variety of summary statistics as shown below:


  Overall Performance Metrics 
  -------------------------------------------------- 
                    KS : 0.4063 (Good)
                   AUC : 0.7638 (Fair)

  Classification Matrix 
  -------------------------------------------------- 
           Cutoff (>=) : 0.0617 (Optimal)
   True Positives (TP) : 653
  False Positives (FP) : 1055
  False Negatives (FN) : 265
   True Negatives (TN) : 2404
   Total Positives (P) : 918
   Total Negatives (N) : 3459

  Business/Performance Metrics 
  -------------------------------------------------- 
      %Records>=Cutoff : 0.3902
             Good Rate : 0.3823 (Vs 0.2097 Overall)
              Bad Rate : 0.6177 (Vs 0.7903 Overall)
        Accuracy (ACC) : 0.6984
     Sensitivity (TPR) : 0.7113
 False Neg. Rate (FNR) : 0.2887
 False Pos. Rate (FPR) : 0.3050
     Specificity (TNR) : 0.6950
       Precision (PPV) : 0.3823
  False Discovery Rate : 0.6177
    False Omision Rate : 0.0993
  Inv. Precision (NPV) : 0.9007

  Note: 0 rows deleted due to missing data.

The report provides multiple pieces of model evaluation. At the top it provides the KS and AUC metrics for the model. Next, the report summarizes metrics from the classification matrix. At the top of this section it provides the optimal cut-off level based on the Youden J statistic. At this cut-off it provides the number of true positives, false positives, true negatives, false negatives, total positives, and total negatives. The last section of the report provides many business performance metrics such as sensitivity, specificity, precision, and many more as seen above.

By using the plot = option in the smbinning.metrics function you can plot either the KS plot or ROC curve.

We can perform the same evaluation of our initial model on the testing dataset as well. We need to create our WOE variables in our testing dataset which is easy to do with the smbinning.gen function on the test dataset. Remember, we are just scoring the test dataset so we do not want to build new bins, just create the same ones from our training in the test set. By using the same looping process as above we can create our variables. We then use the predict function on the test dataset to get the predictions. The same smbinning.metrics function is used to graph and report metrics for the testing set predictions.


  Overall Performance Metrics 
  -------------------------------------------------- 
                    KS : 0.4589 (Good)
                   AUC : 0.7798 (Fair)

  Classification Matrix 
  -------------------------------------------------- 
           Cutoff (>=) : 0.0577 (Optimal)
   True Positives (TP) : 216
  False Positives (FP) : 376
  False Negatives (FN) : 62
   True Negatives (TN) : 806
   Total Positives (P) : 278
   Total Negatives (N) : 1182

  Business/Performance Metrics 
  -------------------------------------------------- 
      %Records>=Cutoff : 0.4055
             Good Rate : 0.3649 (Vs 0.1904 Overall)
              Bad Rate : 0.6351 (Vs 0.8096 Overall)
        Accuracy (ACC) : 0.7000
     Sensitivity (TPR) : 0.7770
 False Neg. Rate (FNR) : 0.2230
 False Pos. Rate (FPR) : 0.3181
     Specificity (TNR) : 0.6819
       Precision (PPV) : 0.3649
  False Discovery Rate : 0.6351
    False Omision Rate : 0.0714
  Inv. Precision (NPV) : 0.9286

  Note: 0 rows deleted due to missing data.

Scaling the Scorecard

The last step of the credit modeling process is building the scorecard itself. To create the scorecard we need to relate the predicted odds from our logistic regression model to the scorecard. The relationship between the odds and scores is represented by a linear function:

\[ Score = Offset + Factor \times \log(odds) \]

All that we need to define is the amount of points to double the odds (called PDO) and the corresponding scorecard points. From there we have the following extra equation:

\[ Score + PDO = Offset + Factor \times \log(2 \times odds) \]

Through some basic algebra, the solution to the \(Factor\) and \(Offset\) is shown to be:

\[ Factor = \frac{PDO}{\log(2)} \]

\[ Offset = Score - Factor \times \log(odds) \]

For example, if a scorecard were scaled where the developer wanted odds of 50:1 at 600 points and wanted the PDO = 20. Through the above equations we calculate \(Factor = 28.85\) and \(Offset = 487.12\). Therefore, the corresponding score for each predicted odds from the logistic regression model is calculated as:

\[ Score = 487.12 + 28.85\times \log(odds) \]

For this example, we would then calculate the score for each individual in our dataset. Notice how the above equation has the \(\log(odds)\) which is the prediction from a logistic regression model \(\log(odds) = \hat{\beta}_0 + \hat{\beta}_1 x_1 \cdots\). This is one of the reasons logistic regression is still very popular in the credit modeling world.

The next step in the scorecard is to allocate the scorecard points to each of the categories (bins) in each of the variables. The points that are allocated to the \(i^{th}\) bin of variable \(j\) are computed as follows:

\[ Points_{i,j} = -(WOE_{i,j} \times \hat{\beta}_j + \frac{\hat{\beta}_0}{L}) \times Factor + \frac{Offset}{L} \] The \(WOE_{i,j}\) is the weight of evidence of the \(i^{th}\) bin of variable \(j\). The coefficient of the variable \(j\), \(\hat{\beta}_j\), as well as the intercept \(\hat{\beta}_0\), come from the logistic regression model. \(L\) is the number of variables in the model. With the \(Factor\) and \(Offset\) defined above as well as the bureau_score coefficient of \(-0.81875\) and intercept of \(-2.90953\) we calculate the points for each category of bureau_score as follows:

WOE Values for Bureau Score


The other variables in dataset would go through a similar process to build out the full scorecard.

Let’s see how to do this in our software!

R

Since we have variables with WOE values for each variable in the dataset, allocating the points to each category of the variable is easy to do. We just use a for loop to perform the above calculations for each variable:

The above code also calculates a Score variable for our training data set that sums up all the points for each observation. We can do this same process for the testing dataset as well.

Lastly, we can view the distribution of all of the scorecard values for each observation in our dataset across both the training and testing datasets. This gives us a visual of the range of values to expect from our scorecard model.

Reject Inference

The previous scorecard that we have built is a behavioral scorecard because it models the behavior of current customers. However, it doesn’t fully capture the effects of applicants because it is only based on current customers who had approved applications. We still have customers who were rejected for loans. Reject inference is the process of inferring the status of the rejected applicants based on the accepted applicants model (behavioral model) in an attempt to use their information to build a scorecard that is representative of the entire applicant population. Reject inference is about solving sample bias so that the development sample is similar to the population to which the scorecard will be applied. Scorecards using reject inference are referred to as application scorecards since they more mimic the “through-the-foor” population. Reject inference also helps comply with regulatory requirements like ones provided by the FDIC and Basel Accords.

There are three common techniques for reject inference:

  1. Hard Cut-off Augmentation
  2. Parceling Augmentation
  3. Fuzzy Augmentation

Hard Cut-off Augmentation

The hard cut-ff augmentation essentially scores all the rejected individuals using the behavioral scorecard model and infers whether the rejected individuals would have defaulted based on some predetermined cut-off score. The following are the steps to perform the hard cut-off augmentation method for reject inference:

  1. Build a behavioral scorecard model using the known defaulters and non-defaulters from the accepted applicants.
  2. Score the rejected applications with the behavioral scorecard model to obtain each rejected applicant’s probability of default and their score on the scorecard model.
  3. If the rejected applicants to accepted applicants ratio doesn’t match the population ratio, then create weighted cases for the rejected applicants. Similar to rare event modeling in classification models, we want to adjust the number of sampled rejects in comparison to our sampled accepts to accurately reflect the number of rejects in comparison to accepts from the population.
  4. Set a cut-off score level above which applicant is deemed a non-defaulter and below the applicant is deemed a defaulter.
  5. Add inferred defaulters and non-defaulters with known defaulters and non-defaulters and rebuild the scorecard.

Let’s see how to do this in our software!

R

Before scoring our rejected applicants, we need to perform the same data transformations on the reject dataset as we did on the accepts dataset. The following R code generates the same bins for our rejects dataset variables that we had in the accepts dataset so we can score these new observations. It also calculates each applicants scorecard score.

Next, we just use the predict function to score the reject dataset. The first input is the model object from our behavioral model. The newdata = option defines the reject dataset that we need to score. The type = option defines that we want the predicted probability of default for each observation in the reject dataset. The next two lines of code create a bad and good variable in the rejects dataset based on the optimal cut-off defined in the previous section. The next few lines calculate the new weight for the observations in our data set accounting both for the rare event sampling as well as the accepts to rejects ratio in the population. Lastly, we combine our newly inferred rejected observations with our original accepted applicants for rebuilding our credit scoring model.

Parceling Augmentation

The parceling augmentation essentially scores all the rejected individuals using the behavioral scorecard model. However, instead of using a single cut-off, the parceling method splits the predicte scores into buckets (or parcels). The observations in these groups are randomly assigned to default or non-default based on that group’s rate of default in the accepts sample. The following are the steps to perform the parceling augmentation method for reject inference:

  1. Build a behavioral scorecard model using the known defaulters and non-defaulters from the accepted applicants.
  2. Score the rejected applications with the behavioral scorecard model to obtain each rejected applicant’s probability of default and their score on the scorecard model.
  3. If the rejected applicants to accepted applicants ratio doesn’t match the population ratio, then create weighted cases for the rejected applicants. Similar to rare event modeling in classification models, we want to adjust the number of sampled rejects in comparison to our sampled accepts to accurately reflect the number of rejects in comparison to accepts from the population.
  4. Define score ranges manually or automatically with simple bucketing.
  5. The inferred default status of the rejected applicants will be assigned randomly and proportional to the number defaulters and non-defaulters in the accepted sample within each score range.
  6. (OPTIONAL) If desired, apply an event rate increase factor to the probability of default for each bucket to increase in the proportion of defaulters among the rejects.
  7. Add the inferred defaulters and non-defaulters back in with the known defaulters and non-defaulters and rebuild the scorecard.

The chart below goes through an example for a bucket between the scores of 655 and 665.

WOE Values for Bureau Score


Let’s see how to do this in our software!

R

Before scoring our rejected applicants, we need to perform the same data transformations on the reject dataset as we did on the accepts dataset. The R code in the hard cut-off section generates the same bins for our rejects dataset variables that we had in the accepts dataset so we can score these new observations. It also calculates each applicants scorecard score.

Next, we just use the seq function to create buckets between 500 and 725 by groups of 25. We then use the cut function to split our scored observations from each of the accepts and rejects datasets into these buckets. We then use the table function to calculate the default rate of the accepts dataset in each bucket. We apply an optional event rate increase of 25% from the optional step 6 above. Next, we loop through each bucket and randomly assign defaulters and non-defaulters based on the accepts default rate in that bucket and added adjustment. Lastly, we combine our newly inferred rejected observations with our original accepted applicants for rebuilding our credit scoring model.

           
               0    1
  (500,525]   46   66
  (525,550]  630  478
  (550,575] 1166  410
  (575,600] 1161  178
  (600,625]  840   52
  (625,650]  555   12
  (650,675]  243    0
  (675,700]    0    0
  (700,725]    0    0
           
              0   1
  (500,525]   5  14
  (525,550] 345 392
  (550,575] 976 489
  (575,600] 848 156
  (600,625] 543  39
  (625,650] 295  12
  (650,675] 119   0
  (675,700]   0   0
  (700,725]   0   0

Fuzzy Augmentation

The fuzzy augmentation essentially scores all the rejected individuals using the behavioral scorecard model. It then creates two observations for each observations in the reject dataset. One observation is assigned as a defaulter, while the other a non-defaulter. These observations are then weighted based on the probability of default from the behavioral scorecard. The following are the steps to perform the fuzzy augmentation method for reject inference:

  1. Build a behavioral scorecard model using the known defaulters and non-defaulters from the accepted applicants.
  2. Score the rejected applications with the behavioral scorecard model to obtain each rejected applicant’s probability of default and their score on the scorecard model.
  3. Do not assign a reject to a default or non-default. Instead create two weighted cases for each rejected applicant using the probability of default and probability of non-default respectively.
  4. Multiply the probability of default and the probability of non-default by the user-specific rejection rate to form frequency variables.
  5. For each rejected applicant, create two observations - one observation has a frequency variable (rejection rate \(\times\) probability of default) and a target class of default; the other observation has a frequency variable (rejection weight \(\times\) probability of non-default) and a target class of non-default.
  6. Add inferred defaulters and non-defaulters back in with the known defaulters and non-defaulters and rebuild the scorecard.

Let’s see how to do this in our software!

R

Before scoring our rejected applicants, we need to perform the same data transformations on the reject dataset as we did on the accepts dataset. The R code in the hard cut-off section generates the same bins for our rejects dataset variables that we had in the accepts dataset so we can score these new observations. It also calculates each applicants scorecard score.

Next, we just use the predict function to score the reject dataset. The first input is the model object from our behavioral model. The newdata = option defines the reject dataset that we need to score. The type = option defines that we want the predicted probability of default for each observation in the reject dataset. The next two lines of code create a good and bad version of the reject dataset. In the non-defaulter version of the rejects dataset, the target variable is assigned to non-default for all observations and the weight is calculated as described above. The opposite is done for the defaulter version of the rejects dataset. Lastly, we combine our newly inferred rejected observations with our original accepted applicants for rebuilding our credit scoring model.

Other Reject Inference Approaches

Although the three above approaches are the most widely accepted and used appproaches, there have been others proposed in the industry.

  1. Assign all rejects to default. This approach would only be valid if the current process for accepting loans is extremely good at actually determining who would and would not default as well as has a high acceptance rate (97% or higher). It is easy, but not highly recommended for potential biases.
  2. Randomly assign rejects in the same proportion of defaulters and non-defaulters as reflected in the accepted applicant dataset. The only problem here is this approach implies that our rejected applicants are the same as the accepted applicants which makes our current process rather random and ineffective.
  3. Similar in-house model on different data. If the rejected applicants have other loans with the institution, you could use their default probability from the other product’s behavioral scoring model.
  4. Approve all applicants for certain period of time. Although this approach is unbiased in theory, it is rather impractical in reality as it would likely not pass regulation.
  5. Clustering algorithms (unsupervised learning) to group rejected applicants and accepted applicants into clusters. The rejects would then be randomly assigned the default rate within a similar cluster. This is a similar approach to parceling.

Final Scorecard

Now that we have built the initial scorecard and accounted for our reject inference problem, we can move on to the building of the final application scorecard.

Building the Final Scorecard

The mechanics of building the final scorecard model are identical with the initial with the initial scorecard creation except that analysis is performed after reject inference.

Let’s see how to do this in our software!

R

The first line of code is a place holder for whatever type of reject inference you used from above. Here, we are using the dataset after using the parceling augmentation method of reject inference.

Next, we go through all the normal steps of model building.

  1. Separate into training and testing datasets
  2. Evaluate variables based on information value
  3. Bin the variables with \(IV \ge 0.1\)
  4. Transform the variables into weight of evidence representations
  5. Build the logistic regression model
  6. Evaluate the regression model on training and testing datasets
  7. Allocate the points for the scorecard

Only certain pieces of the output are shown to keep the output at a minimum.

 

  |                                                        
  |                                                  |   0%
  |                                                        
  |--                                                |   4%
  |                                                        
  |----                                              |   9%
  |                                                        
  |-------                                           |  13%
  |                                                        
  |---------                                         |  17%
  |                                                        
  |-----------                                       |  22%
  |                                                        
  |-------------                                     |  26%
  |                                                        
  |---------------                                   |  30%
  |                                                        
  |-----------------                                 |  35%
  |                                                        
  |--------------------                              |  39%
  |                                                        
  |----------------------                            |  43%
  |                                                        
  |------------------------                          |  48%
  |                                                        
  |--------------------------                        |  52%
  |                                                        
  |----------------------------                      |  57%
  |                                                        
  |------------------------------                    |  61%
  |                                                        
  |---------------------------------                 |  65%
  |                                                        
  |-----------------------------------               |  70%
  |                                                        
  |-------------------------------------             |  74%
  |                                                        
  |---------------------------------------           |  78%
  |                                                        
  |-----------------------------------------         |  83%
  |                                                        
  |-------------------------------------------       |  87%
  |                                                        
  |----------------------------------------------    |  91%
  |                                                        
  |------------------------------------------------  |  96%
  |                                                        
  |--------------------------------------------------| 100%
 


Call:
glm(formula = bad ~ tot_tr_WOE + tot_derog_WOE + age_oldest_tr_WOE + 
    tot_rev_line_WOE + rev_util_WOE + bureau_score_WOE + ltv_WOE, 
    family = "binomial", data = train_comb, weights = train_comb$weight_ar)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1674  -0.9320  -0.5407  -0.1769   4.3996  

Coefficients:
                  Estimate Std. Error  z value             Pr(>|z|)    
(Intercept)       -2.82279    0.02445 -115.442 < 0.0000000000000002 ***
tot_tr_WOE         0.02077    0.08722    0.238                0.812    
tot_derog_WOE     -0.08350    0.06033   -1.384                0.166    
age_oldest_tr_WOE -0.42120    0.05744   -7.333    0.000000000000225 ***
tot_rev_line_WOE  -0.40164    0.05411   -7.422    0.000000000000115 ***
rev_util_WOE      -0.21023    0.04947   -4.250    0.000021397754860 ***
bureau_score_WOE  -0.79630    0.03797  -20.971 < 0.0000000000000002 ***
ltv_WOE           -0.91261    0.07763  -11.755 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 18052  on 7551  degrees of freedom
Residual deviance: 16105  on 7544  degrees of freedom
AIC: 17543

Number of Fisher Scoring iterations: 5

  Overall Performance Metrics 
  -------------------------------------------------- 
                    KS : 0.3934 (Fair)
                   AUC : 0.7555 (Fair)

  Classification Matrix 
  -------------------------------------------------- 
           Cutoff (>=) : 0.0591 (Optimal)
   True Positives (TP) : 1309
  False Positives (FP) : 2115
  False Negatives (FN) : 421
   True Negatives (TN) : 3707
   Total Positives (P) : 1730
   Total Negatives (N) : 5822

  Business/Performance Metrics 
  -------------------------------------------------- 
      %Records>=Cutoff : 0.4534
             Good Rate : 0.3823 (Vs 0.2291 Overall)
              Bad Rate : 0.6177 (Vs 0.7709 Overall)
        Accuracy (ACC) : 0.6642
     Sensitivity (TPR) : 0.7566
 False Neg. Rate (FNR) : 0.2434
 False Pos. Rate (FPR) : 0.3633
     Specificity (TNR) : 0.6367
       Precision (PPV) : 0.3823
  False Discovery Rate : 0.6177
    False Omision Rate : 0.1020
  Inv. Precision (NPV) : 0.8980

  Note: 0 rows deleted due to missing data.


  Overall Performance Metrics 
  -------------------------------------------------- 
                    KS : 0.3892 (Fair)
                   AUC : 0.7495 (Fair)

  Classification Matrix 
  -------------------------------------------------- 
           Cutoff (>=) : 0.0609 (Optimal)
   True Positives (TP) : 427
  False Positives (FP) : 707
  False Negatives (FN) : 141
   True Negatives (TN) : 1243
   Total Positives (P) : 568
   Total Negatives (N) : 1950

  Business/Performance Metrics 
  -------------------------------------------------- 
      %Records>=Cutoff : 0.4504
             Good Rate : 0.3765 (Vs 0.2256 Overall)
              Bad Rate : 0.6235 (Vs 0.7744 Overall)
        Accuracy (ACC) : 0.6632
     Sensitivity (TPR) : 0.7518
 False Neg. Rate (FNR) : 0.2482
 False Pos. Rate (FPR) : 0.3626
     Specificity (TNR) : 0.6374
       Precision (PPV) : 0.3765
  False Discovery Rate : 0.6235
    False Omision Rate : 0.1019
  Inv. Precision (NPV) : 0.8981

  Note: 0 rows deleted due to missing data.

Business Evaluation and Cut-off Selection

The above models have been evaluated with statistical metrics like KS and AUC. However, it is good to look at more business-type evaluations of the model as well. One of the primary ways to do this is to create a default decile plot. A default decile plot buckets the predicted scores from the scorecard into 10 equally sized (same number of applicants) buckets. From there, we evaluate the true default rate of the individuals in each bucket. If our model does a good job of separating the defaulters from the non-defaulters, we should see a consistent drop in default rate as the scores get higher as seen below:

[518,538] (538,547] (547,555] (555,562] (562,571] (571,582] (582,594] (594,609] 
    18.87     12.47     10.18      7.99      6.16      3.95      3.42      2.06 
(609,631] (631,682] 
     1.18      0.40 

The next step is to decide a decision cut-off value for the scorecard. Above, this cut-off, someone is approved for a loan, while below someone is rejected. The new scorecard should be better than the last method in terms of at least one fo the following:

  • Lower default rate for the same approval rate
  • Higher approval rate for the same default rate
  • Highest amount of profit available

To address the first two points above we plot the acceptance rate by the default rate across different levels of cut-off (values of the scorecard) and compare. The interactive plot below shows an example of this:

We can move our cursor along the plots to see what the default rate and acceptance rate combination are at any scorecard cut-off. We can see that with our current acceptance rate of 70%, we have a lower default rate of approximately 2.9% at a cut-ff of 559 for a score. We could also use a cut-off of 537 to keep our default rate close to 5%, but raise our acceptance rate to 93.4%.

We could also balance the above choices with profit. Everytime we make a correct decision and give a loan to an individual who pays us back, we make approximately $1,200 on average in profit. However, every mistake we make where a customer defaults, we lose $50,000 on average. The number of people who get loans who default is controlled by the cut-off. Similar to the plot above we can plot the acceptance rate by the average profit across different levels of cut-off (values of the scorecard) and compare. The interactive plot below shows an example of this:

We can move our cursor along the plots to see what the profit and acceptance rate combination are at any scorecard cut-off. We can see that with our current acceptance rate of 70%, we are barely making a profit at a cut-ff of 559 for a score. We could also use a cut-off of 596 to maximize our profit, but our acceptance rate falls to roughly 33%.

From these two plots we can make a more informed decision on the cut-off process for our model. A good strategy might be to use two cut-offs. The first, would be 596 and above would be an automatic acceptance to ensure we get profitable customers. Below a cut-off of 537 and the applicant is automatically rejected due to high risk of default beyond the bank’s comfort. The range in the middle can be sent to a team for further investigation into whether the loan is worth lending.

Let’s see how to do this in software!

R

There are a variety of different ways to plot the plots described in the previous section. The following code generates the plots you have seen above:

# Decile Default Plot #
cutpoints <- quantile(accepts_scored_comb$Score, probs = seq(0,1,0.10))
accepts_scored_comb$Score.QBin <- cut(accepts_scored_comb$Score, breaks=cutpoints, include.lowest=TRUE)
Default.QBin.pop <- round(table(accepts_scored_comb$Score.QBin, accepts_scored_comb$bad)[,2]/(table(accepts_scored_comb$Score.QBin, accepts_scored_comb$bad)[,2] + table(accepts_scored_comb$Score.QBin, accepts_scored_comb$bad)[,1]*4.75)*100,2)

print(Default.QBin.pop)

barplot(Default.QBin.pop, 
        main = "Default Decile Plot", 
        xlab = "Deciles of Scorecard",
        ylab = "Default Rate (%)", ylim = c(0,20),
        col = saturation(heat.colors, scalefac(0.8))(10))
abline(h = 5, lwd = 2, lty = "dashed")
text(11, 6, "Current = 5.00%")

# Calculations of Default, Acceptance Rate, and Profit by Cut-off Score #
def <- NULL
acc <- NULL
prof <- NULL
score <- NULL

cost <- 50000
profit <- 1500
for(i in min(floor(train_comb$Score)):max(floor(train_comb$Score))){
  score[i - min(floor(train_comb$Score)) + 1] <- i
  def[i - min(floor(train_comb$Score)) + 1] <- 100*sum(train_comb$bad[which(train_comb$Score >= i)])/(length(train_comb$bad[which(train_comb$Score >= i & train_comb$bad == 1)]) + 4.75*length(train_comb$bad[which(train_comb$Score >= i & train_comb$bad == 0)]))
  acc[i - min(floor(train_comb$Score)) + 1] <- 100*(length(train_comb$bad[which(train_comb$Score >= i & train_comb$bad == 1)]) + 4.75*length(train_comb$bad[which(train_comb$Score >= i & train_comb$bad == 0)]))/(length(train_comb$bad[which(train_comb$bad == 1)]) + 4.75*length(train_comb$bad[which(train_comb$bad == 0)]))
  prof[i - min(floor(train_comb$Score)) + 1] <- length(train_comb$bad[which(train_comb$Score >= i & train_comb$bad == 1)])*(-cost) + 4.75*length(train_comb$bad[which(train_comb$Score >= i & train_comb$bad == 0)])*profit
}

plot_data <- data.frame(def, acc, prof, score)

# Plot of Acceptance Rate by Default Rate #
ay1 <- list(
  title = "Default Rate (%)",
  range = c(0, 10)
)
ay2 <- list(
  tickfont = list(),
  range = c(0, 100),
  overlaying = "y",
  side = "right",
  title = "Acceptance Rate (%)"
)
fig <- plot_ly()
fig <- fig %>% add_lines(x = ~score, y = ~def, name = "Default Rate (%)")
fig <- fig %>% add_lines(x = ~score, y = ~acc, name = "Acceptance Rate (%)", yaxis = "y2")
fig <- fig %>% layout(
  title = "Default Rate by Acceptance Across Score", yaxis = ay1, yaxis2 = ay2,
  xaxis = list(title="Scorecard Value"),
  legend = list(x = 1.2, y = 0.8)
)

fig

# Plot of Acceptance Rate by Profit #
ay1 <- list(
  title = "Profit ($)",
  showline = FALSE,
  showgrid = FALSE
)
ay2 <- list(
  tickfont = list(),
  range = c(0, 100),
  overlaying = "y",
  side = "right",
  title = "Acceptance Rate (%)"
)
fig <- plot_ly()
fig <- fig %>% add_lines(x = ~score, y = ~prof, name = "Profit ($)")
fig <- fig %>% add_lines(x = ~score, y = ~acc, name = "Acceptance Rate (%)", yaxis = "y2")
fig <- fig %>% layout(
  title = "Profit by Acceptance Across Score", yaxis = ay1, yaxis2 = ay2,
  xaxis = list(title="Scorecard Value"),
  legend = list(x = 1.2, y = 0.8)
)

fig

Credit Scoring Model Extensions

With the growth of machine learning models, the credit modeling industry is adapting their process on building scorecards. Below are some extensions that have been proposed in literature.

One extension to scorecard modeling is a multi-stage approach. Decision trees (and most tree-based algorithms) have the benefit of including interactions at every split of the tree. However, this also makes interpretation a little harder for scorecards as you would have scorecard points changing at every branch of the tree. In the multi-stage model approach, the first stage is to build a decision tree on the dataset to get an initial couple of layers and splits. The second stage is to build logistic regression based scorecards in each of the limited number of initial splits from the decision tree. The interpretation is now within a split (sub-group) of the dataset.

For credit modeling, model interpretation is key. This makes the use of machine learning algorithms hard to pass by regulators. Scorecard layerse on top of machine learning algorithms help drive the interpretation of the algorithm, however, regulators are still hesitant. That doesn’t mean that we can not use these techniques. These techniques are still very valuable for internal comparison and variable selection. For example, you could build a neural network, tree-based algorithm, boosting algorithm, etc. to see if that model is statistically different than the logistic regression scorecard. If so, what variables might be different between the models that the scorecard might be able to add in through some kind of feature engineering. However, empirical studies have shown that scorecards built on logistic regression models with good feature engineering for the variables are still hard to out perform, even with advanced machine learning models.