Decision Tree

Linear models and regression-based modeling has many different forms and variations to try and predict a target variable. Tree-based frameworks to modeling are a different approach. A decision tree is a supervised learning algorithm built by recursively splitting the data into successively purer subsets of data that resemble a decision-making process (decision tree).

Data is split based on groupings of predictor variable values. Although algorithms exist that can do multi-way splits, the focus here will be on binary splits - just splitting the data into two groups. These trees are useful for predicting either continuous or categorical target variables.

Let’s look at an example with a continuous target variable. We are going to predict a target variable y with one predictor variable X using a decision tree algorithm.

We must split the data into 2 pieces based on the values of X. Here are all of the possible splits:

Let’s split the data in the first split point on the left hand side - an X value of 1.5. That would mean we have two groups - A and B.

The predictions from group A would be the average of the values of our target variable y from group A, \(\hat{y} = 2\). The predictions of group B would be the average of y from group B, \(\hat{y} = 5.5\).

Splitting

The main question is how we choose the best split. Splitting is done based on some condition. The classification and regression tree (CART) approach to decision trees uses mean square error (MSE) to decide the best split for continuous target variables and measurements of purity like Gini or entropy for classification target variables. The chi-square automatic interaction detector (CHAID) approach to decision trees uses \(\chi^2\) tests and p-values to determine the best split.

Let’s continue the above example with the CART approach to decision trees. We will make the best split based on MSE. We want a split that will minimize the MSE for our predictions. Since we have the predictions for the first split above, let’s calculate the MSE for each group:

\[ MSE_A = \frac{(2-2)^2}{1} = 0 \]

\[ MSE_B = \frac{(4-5.5)^2 + (5-5.5)^2 + (6-5.5)^2 + (7-5.5)^2}{4} = 1.25 \]

Now, let’s combine this into the overall MSE:

\[ MSE_{1.5} = 0.2 \times 0 + 0.8 \times 1.25 = 1 \]

Now let’s move the split to the next possible location, 2.5:

Now our predictions for group A and B are \(3\) and \(6\) respectively. Let’s again calculate the MSE for each group and the overall MSE for the split at 2.5:

\[ MSE_A = \frac{(2-3)^2 + (4-3)^2}{2} = 1 \]

\[ MSE_B = \frac{(5-6)^2 + (6-6)^2 + (7-6)^2}{3} = 0.67 \]

\[ MSE_{2.5} = 0.4 \times 1 + 0.6 \times 0.67 = 0.8 \]

This split has a lower overall MSE than the first split. In fact, if we were to continue this process for every split we would see that this split is the best of all possible binary splits.

Now we have our best decision tree model if we were to limit ourselves to splitting our data only once. Since our predictions are just the average of each group, we have a step function for our model as we can see below:

DecisionTreeRegressor(max_depth=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Split Threshold: 2.5
Left Node Prediction: [[3.]]
Right Node Prediction: [[6.]]

Let’s see how to build a simple CART decision tree in each software!

The exact same process would occur if we wanted to go 2 levels deep into our decision tree. Each piece of the decision tree would have the same process to see if additional splits within each of group A or group B would help improve the model’s MSE. The predictions for a 2 level decision tree would be a more complicated step function as seen below:

DecisionTreeRegressor(max_depth=2, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Let’s see how to do this in each software!

Attributes

One of the major reasons why decision trees are such a popular algorithm is that they are interpretable. With a series of binary decisions, you can easily follow why a prediction is the way that it is. You can also see if that specific split increases or decreases a prediction. These binary decisions are also very easy to implement into a system that scores customers.

The models of decision trees are also more complicated than one might realize from an initial look at them. Decision trees and the step functions they produce for predictions are not limited to linear associations. The prediction of each final split is just the average of all of the observations in that group. This means the pattern could easily be nonlinear and that wouldn’t impact the predictions from the decision tree. Each decision tree is also full of interactions between variables. Since each subsequent split after the first split is within each group, that implies that different groups after a split could be impacted differently by different variables. For example, A categorical X2 variable might be important to split values below 2.5 from the variable X, but not values above 2.5.

That previous point about variable splits impacting the decisions of variable splits after them implies that decision trees are greedy algorithms. Greedy algorithms makes the locally optimal choice at each step with the hope that these local decisions lead to a globally optimal solution. However, the decision tree only picks the next best option and doesn’t go back to reconsider the choice after looking at the next decision points. This means there is not a guarantee of an optimal solution.

Tuning / Optimizing Decision Trees

With multiple variables, every variable is evaluated at every split to find the “best” single split across all possible splits across all possible variables. This includes categorical variables that have been dummy coded and each dummy coded variable is evaluated separately.

With all of these variables and possible splits we have to worry about over-fitting our data. To prevent over-fitting of our data there are some common factors in a CART decision tree we can tune:

  • Max depth of tree - how many split levels overall in the tree

  • Minimum samples in split - how many observations must be in a node to even try to split

  • Minimum samples in leaf - how many observations left over in a leaf for a split to happen

When it comes to tuning machine learning models like decision trees there are many options for tuning. Cross-validation should be included in each of these approaches. Two of the most common tuning approaches are:

  • Grid search

  • Bayesian optimization

Grid search algorithms look at a fixed grid of values for each parameter you can tune - called hyperparameters. This means that every possible combination of values no matter is they are good or bad as it doesn’t learn from previous results. This is good for small samples with limited number of variables as it can be time consuming.

Bayesian optimization on the other hand is typically less computationally expensive as a grid search approach. The Bayesian optimization approach starts with random values for each hyperparameter you need to tune. It then fits a probabilistic model (a type of sequential model-based optimization) that “learns” the relationship between the parameters and performance. This process tries to estimate which hyperparameters are likely to produce good results and point it in the “correct” direction for values of hyperparameters. The next step in the optimization essentially “learns” from the previous combinations of hyperparameters. This approach is much more valuable for large samples with large numbers of variables.

We will continue to use our Ames housing dataset where we are predicting the sale price of a home. Even with tree-based algorithms, we still need to take the same steps to clean our data and do initial variable screening with business logic, low variability, missingness, and single variable comparison to the target. Machine learning models don’t protect you to a point where you can just throw in horrible data!

Let’s tune these models in each software!

Variable Importance

Most machine learning models are not interpretable in the classical sense - as one predictor variable increases, the target variable tends to ___. This is because the relationships are not linear. The relationships are more complicated than a linear relationship, so the interpretations are as well. Decision trees, however, are still interpretable based on the decision framework they are built with.

Although decision trees have an interpretable framework, we don’t have p-values like we do in statistical regression modeling to help us understand variable importance. Variable importance is a way of telling which variables “mean more” to the model than other variables.

In decision trees, the metric for variable importance is the MDI - mean decrease in impurity. Impurity has a different meaning depending on the type of target variable. In regression based trees with continuous target variables, the impurity metric is typically the variance of the residuals which is calculated by MSE. For classification based trees with a categorical target variable, the impurity metric is typically the Gini metric.

The MSE calculation has been used many times before in this code deck, however, the Gini impurity metric is new. Let’s look at its calculation:

\[ Gini = 1 - \sum_{i=1}^C p_i^2 \]

The calculation is quite simple. It calculates the proportion of each class in the target variable for the node and sums them together. This measure of impurity is a measure of disorder so we want lower values. Lower Gini impurity implies that the child nodes are “purer.”

Every time a feature is used to split a node, the software measures how much that split reduces the impurity. These reductions are accumulated across all trees for each feature (variable). These reductions are then normalized to sum to 1. The one problem with this approach is that it tends to overemphasize variables with many unique values because there are more places to split the variable.

Let’s calculate variable importance in each software!

Predictions

Models are nothing but potentially complicated formulas or rules. Once we determine the optimal model, we can score any new observations we want with the equation.

Scoring

It’s the process of applying a fitted model to input data to generate outputs like predicted values, probabilities, classifications, or scores.

Scoring data does not mean that we are re-running the model/algorithm on this new data. It just means that we are asking the model for predictions on this new data - plugging the new data into the equation and recording the output. This means that our data must be in the same format as the data put into the model. Therefore, if you created any new variables, made any variable transformations, or imputed missing values, the same process must be taken with the new data you are about to score.

For this problem we will score our test dataset that we have previously set aside as we were building our model. The test dataset is for comparing final models and reporting final model metrics. Make sure that you do NOT go back and rebuild your model after you score the test dataset. This would no longer make the test dataset an honest assessment of how good your model is actually doing. That also means that we should NOT just build hundreds of iterations of our same model to compare to the test dataset. That would essentially be doing the same thing as going back and rebuilding the model as you would be letting the test dataset decide on your final model.

Let’s score our data in each software!

Summary

In summary, decision tree models are interpretable, decision based framework models. Some of the advantages of using decision trees:

  • Computationally fast

  • Still interpretable due to decision framework output

  • Easy to implement

  • Variable importance provided

There are some disadvantages though:

  • Typically, not as predictive as more advanced models

  • Greedy algorithm means every decision is completely based on the previous decision

  • Bias towards features with more levels

Bagging

Before understanding the concept of bagging, we need to know what a bootstrap sample is. A Bootstrap sample is a random sample of your data with replacement that are the same size as your original data set. By random chance, some of the observations will not be sampled. These observations are called out-of-bag observations. Three example bootstrap samples of 10 observations labelled 1 through 10 are listed below:

Bootstrap Sample Examples

Mathematically, a bootstrap sample will contain approximately 63% of the observations in the data set. The sample size of the bootstrap sample is the same as the original data set, just some observations are repeated. Bootstrap samples are used in simulation to help develop confidence intervals of complicated forecasting techniques. Bootstrap samples are also used to ensemble models using different training data sets - called bagging.

Bootstrap aggregating (bagging) is where you take k bootstrap samples and create a model using each of the bootstrap samples as the training data set. This will build k different models. We will ensemble these k different models together.

Let’s work through an example to see the potential value. The following 10 observations have the values of X and Y shown in the following table:

We will build a decision stump (a decision tree with only one split). From building a decision stump, the best accuracy is obtained by making the split at either 3.5 or 7.5. Either of these splits would lead to an accuracy of predicting Y at 70%. For example, let’s use the split at 3.5. That would mean we think everything above 3.5 is a 0 and everything below 3.5 is a 1. We would get observations 1, 2, and 3 all correct. However, since everything above 3.5 is considered a 0, we would only get observations 4, 5, 6, and 7 correct on that piece - missing observations 8, 9, and 10.

To try and make this prediction better we will do the following:

  1. Take 10 bootstrap samples
  2. Build a decision stump for each
  3. Aggregate these rules into a voting ensemble
  4. Test the performance of the voting ensemble on the whole dataset

The following is the 10 bootstrap samples with their optimal splits. Remember, these bootstrap samples will not contain all of the observations in the original data set.

Some of these samples contain only one value of the target variable and so the predictions are the same for that bootstrap sample. However, we will use the optimal cut-offs from each of those samples to predict 1’s and 0’s for the original data set as shown in the table below:

Summary of Bootstrap Sample Predictions on Training Data

The table above has one row for each of the predictions from the 10 bootstrap samples. We can average these predictions of 1’s and 0’s together for each observation to get the average row near the bottom.

Let’s take a cut-off of 0.25 from our average values from the 1 and 0 predictions from each bootstrap sample. That would mean anything above 0.25 in the average row would be predicted as a 1 while everything below would be predicted as a 0. Based on those predictions (the Pred. row above) we get a perfect prediction compared to the true values of Y from our original data set.

In summary, bagging improves generalization error on models with high variance (simple tree-based models for example). If base classifier is stable (not suffering from high variance), then bagging can actually make predictions worse! Bagging does not focus on particular observations in the training data set (unlike boosting which is covered below).

Random Forests

Random forests are ensembles of decision trees (similar to the bagging example from above). Ensembles of decision trees work best when they find different patterns in the data. However, bagging tends to create trees that pick up the same pattern.

Random forests get around this correlation between trees by not only using bootstrapped samples, but also uses subsets of variables for each split and unpruned decision trees. For these unpruned trees, with a classification target it goes all the way until each unique observation is left in its own leaf. With regression trees the unpruned trees will continue until 5 observations are left per leaf. The results from all of these trees are ensembled together into one voting system. There are many choices of parameters to tune:

  1. Number of trees in the ensemble
  2. Number of variables for each split
  3. Depth of the tree (defaults to unpruned)

Building Random Forest

Let’s see random forests in each software!

Tuning / Optimizing Random Forests

There are a few things we can tune with a random forest. One is the number of trees used to build the model. Another thing to tune is the number of variables considered for each split - called mtry. By default, \(mtry = \sqrt{p}\), with \(p\) being the number of total variables. We can use cross-validation to tune these values. Just like with decision trees, we can use either a grid search approach or Bayesian optimization.

Let’s see how to do this in each software!

Variable Importance

Most machine learning models are not interpretable in the classical sense - as one predictor variable increases, the target variable tends to ___. This is because the relationships are not linear. The relationships are more complicated than a linear relationship, so the interpretations are as well. Random forests are an aggregation of many decision trees which makes this general interpretation difficult. We cannot view the tree like we can with a simple decision tree. Again, we will have to use our metrics for variable importance like we did with decision trees.

Just like in decision trees, the metric for variable importance in random forests is the MDI - mean decrease in impurity. Impurity has a different meaning depending on the type of target variable. In regression based trees with continuous target variables, the impurity metric is typically the variance of the residuals which is calculated by MSE. For classification based trees with a categorical target variable, the impurity metric is typically the Gini metric.

Every time a feature is used to split a node, the software measures how much that split reduces the impurity. These reductions are accumulated across all trees for each feature (variable). These reductions are then normalized to sum to 1. The one problem with this approach is that it tends to overemphasize variables with many unique values because there are more places to split the variable.

Let’s calculate variable importance in each software!

Variable Selection

Another thing to tune would be which variables to include in the random forest. By default, random forests use all the variables since they are averaged across all the trees used to build the model. There are a couple of ways to perform variable selection for random forests:

  • Many permutations of including/excluding variables

  • Compare variables to random variable

The first approach is rather straight forward but time consuming because you have to try and build models over and over again taking a variable out each time. The second approach is much easier to try. In that second approach we will create a completely random variable and put it in the model. We will then look at the variable importance of all the variables. The variables that are below the random variable should be considered for removal.

Let’s see this in each software!

“Interpretability”

Most machine learning models are not interpretable in the classical sense - as one predictor variable increases, the target variable tends to ___. This is because the relationships are not linear. The relationships are more complicated than a linear relationship, so the interpretations are as well. Similar to GAM’s we can get a general idea of an overall pattern for a predictor variable compared to a target variable - partial dependence plots.

These plots will be covered in much more detail in the final section of this code deck under Model Agnostic Interpretability.

<sklearn.inspection._plot.partial_dependence.PartialDependenceDisplay object at 0x31c1fcec0>

This nonlinear and complex relationship between GrLivArea and SalePrice is similar to the plot we saw earlier with the GAMs. This shouldn’t be too surprising. Both sets of algorithms are trying to relate these two variables together, just in different ways.

Predictions

Models are nothing but potentially complicated formulas or rules. Once we determine the optimal model, we can score any new observations we want with the equation.

Scoring data does not mean that we are re-running the model/algorithm on this new data. It just means that we are asking the model for predictions on this new data - plugging the new data into the equation and recording the output. This means that our data must be in the same format as the data put into the model. Therefore, if you created any new variables, made any variable transformations, or imputed missing values, the same process must be taken with the new data you are about to score.

For this problem we will score our test dataset that we have previously set aside as we were building our model. The test dataset is for comparing final models and reporting final model metrics. Make sure that you do NOT go back and rebuild your model after you score the test dataset. This would no longer make the test dataset an honest assessment of how good your model is actually doing. That also means that we should NOT just build hundreds of iterations of our same model to compare to the test dataset. That would essentially be doing the same thing as going back and rebuilding the model as you would be letting the test dataset decide on your final model.

Let’s score our data in each software!

Summary

In summary, random forest models are good models to use for prediction, but explanation becomes more difficult and complex. Some of the advantages of using random forests:

  • Computationally fast (handles thousands of variables)

  • Trees trained simultaneously

  • Accurate classification model

  • Variable importance available

  • Missing data is OK

There are some disadvantages though:

  • “Interpretability” is a little different than the classical sense

  • There are parameters that need to be tuned to make the model the best

Boosting

Boosting is similar to bagging in the sense that we will draw a sample of observations from a data set with replacement. However, unlike bagging, observations are not sampled randomly. Boosting assigns weights to each training observation and uses the weight as a sampling distribution. Observations with higher weights are more likely to be drawn in the next sample. These weights are changed adaptively in each round. The weights for observations that are harder to classify are higher for the next sample - trying to increase chance of them being selected. We want these missed observations to have a higher chance of being in the next model to improve the chances of predicting those observations correctly the next time. An example of the difference in weights for a boosted sample as compared to a bagging sample is the following:

With bagging we are only trying to create variability in the models by using training data set variation. The ensemble model is built simultaneously. However, in boosting, observations with higher sampling probability were harder to predict accurately. We want to improve the predictions from the model sequentially.

Let’s use the same example we used in bagging. The following 10 observations have the values of X and Y shown in the following table:

We will build a decision stump (a decision tree with only one split). From building a decision stump, the best accuracy is obtained by making the split at either 3.5 or 7.5. Either of these splits would lead to an accuracy of predicting Y at 70%. For example, let’s use the split at 3.5. That would mean we think everything above 3.5 is a 0 and everything below 3.5 is a 1. We would get observations 1, 2, and 3 all correct. However, since everything above 3.5 is considered a 0, we would only get observations 4, 5, 6, and 7 correct on that half - missing observations 8, 9, and 10.

To try and make this prediction better we will do the following:

  1. Take 3 bootstrap samples

  2. Build a decision stump for each

  3. At each step we will adjust the weights based on the previous step’s errors

The following is the first 3 bootstrap samples with their optimal splits. Remember, these bootstrap samples will not contain all of the observations in the original data set.

Boosting Samples with Weights

The first sample has the same weight for each observation. In that bootstrap sample the best split occurs at 7.5. However, when predicting the original observations, we would incorrectly predict observations 1, 2, and 3. Therefore, in the second sample we will overweight observations 1, 2, and 3 to help us predict those observations better (correct for the previous mistakes). In the second bootstrap sample we have observations 1, 2, and 3 over-represented by design. However, this leads to incorrect predictions in observations 4 through 7. This will lead to us over-weighting those observations as seen in the third row of the weights table. Observations 8, 9, and 10 have always been correctly predicted so their weight is the smallest, while the other observations have higher weights.

The original boosted sample ensemble was called AdaBoost. Unlike bagging, boosted ensembles usually weight the votes of each classifier by a function of their accuracy. If the classifier gets the higher weighted observations wrong, it has a higher error rate. More accurate classifiers get a higher weight in the prediction. In simple terms, more accurate guesses are more important. We will not detail this algorithm here as there are more common advancements that have come since.

Gradient Boosting

Some more recent algorithms are moving away from the direct boosting approach applied to bootstrap samples. However, the main idea of these algorithms is still rooted in the notion of finding where you made mistakes and trying to improve on those mistakes.

One of the original algorithms to go with this approach is the gradient boosted machine (GBM). The idea behind the GBM is to build a simple model to predict the target (much like a decision tree or even decision stump):

\[ y = f_1(x) + \varepsilon_1 \]

That simple model has an error of \(\varepsilon_1\). The next step is to try to predict that error with another simple model:

\[ \varepsilon_1 = f_2(x) + \varepsilon_2 \]

This model has an error of \(\varepsilon_2\). We can continue to add model after model, each one predicting the error (residuals) from the previous round:

\[ y = f_1(x) + f_2(x) + \cdots + f_k(x) + \varepsilon_k \]

We will continue this process until there is a really small error which will lead to over-fitting. To protect against over-fitting, we will build in gradient boosting regularization through parameters to tune. One of those parameters to tune is \(\eta\), which is the weight applied to the error models:

\[ y = f_1(x) + \eta \times f_2(x) + \eta \times f_3(x) + \cdots + \eta \times f_k(x) + \varepsilon_k \]

Smaller values of \(\eta\) lead to less over-fitting. Other things to tune include the number of trees to use in the prediction (more trees typically lead to more over-fitting), and other parameters added over the years - \(\lambda\), \(\gamma\), \(L_2\), etc.

Gradient boosting as defined above yields an additive ensemble model. There is no voting or averaging of individual models as the models are built sequentially, not simultaneously. The predictions from these models are added together to get the final prediction. One of the big keys to gradient boosting is using weak learners. Weak learners are simple models. Shallow decision / regression trees are the best. Each of these models would make poor predictions on their own, but the additive fashion of the ensemble provides really good predictions.

The following plot is a series of ensembles of more and more weak learners to try and predict a complicated pattern:

These models are optimized to some form of a loss function. For example, linear regression looks at minimizing the sum of squared errors - its loss function. The sum of squared errors from many predictions would aggregate into a single number - the loss of the whole model. Gradient boosting is much the same, but not limited to sum of squared errors for the loss function.

How does the gradient boosting model find this optimal loss function value? It uses gradient descent. Gradient descent is a method that iteratively updates parameters in order to minimize a loss function (the sum of squared errors for example) by moving in the direction of the steepest descent as seen in the figure below:

Gradient Descent on Loss Function

This function minimizes the loss function by taking iteratively smaller steps until it finds the minimum. The step size is updated at each step to help with the minimization. The step size is updated with the learning rate. Without the learning rate, we might take steps too big and miss the minimum or too small and take a long time to optimize.

Not all functions are convex as some have local minimums or plateaus that make finding the global minimum difficult. Stochastic gradient descent attempts to solve this problem by randomly sampling a fraction of the training observations for each tree in the ensemble - not all trees contain the all of the observations. This makes the algorithm faster and more reliable, but may not always find the true overall minimum.

Grid search for these algorithms are very time consuming because of the time it takes ot build these models since they are built sequentially. To speed up this process, we typically set the tuning parameters to default and optimize the number of trees in the GBM. We then fix the tree count and start to tune other parameters. Then we go back and iteratively tune back and forth until we find an optimal solution.

There are many different adaptations to the gradient boosting approach:

  • XGBoost

  • LightGBM

  • CatBoost

  • etc.

XGBoost

Extreme gradient boosting (XGBoost) is one of the most popular versions of these algorithms. XGBoost has some great advantages over the traditional GBM:

  1. Additional regularization parameters to prevent over-fitting (problem of more things to tune).
  2. Settings to stop model assessment when adding more trees isn’t helpful.
  3. Supports parallel processing, but still must be trained sequentially.
  4. Variety of loss functions.
  5. Allows generalized linear models as well as tree-based models (all still weak learners though).
  6. Implemented in R, Python, Julia, Scala, Java, C++.

Let’s see how to build this in each software!

Tuning / Optimizing XGBoost Models

There are many things we can tune with an XGBoost model. Here are some common factors people typically tune in an XGBoost model:

  • Number of trees

  • Max depth of a tree (pruning)

  • Learning rate, \(\eta\)

  • Subsample percentage

We can use cross-validation to tune these values. Just like with the previous models, we can use either a grid search approach or Bayesian optimization.

Let’s see how to do this in each software!

Variable Importance

Most machine learning models are not interpretable in the classical sense - as one predictor variable increases, the target variable tends to ___. This is because the relationships are not linear. The relationships are more complicated than a linear relationship, so the interpretations are as well. XGBoost models are an aggregation of many decision trees which makes this general interpretation difficult. We cannot view the tree like we can with a simple decision tree. Again, we will have to use similar metrics for variable importance like we did with decision trees.

In XGBoost models we typically think of the variable importance metric as gain. In regression based trees with continuous target variables, the gain metric is the average reduction in the loss function (MSE). For classification based trees with a categorical target variable, the gain metric is the average reduction in the loss function but instead measured by log-loss.

Every time a feature is used to split a node, the software measures how much that split improves the gain. These gains are accumulated across all trees for each feature (variable). These gains are then normalized to sum to 1. The one problem with this approach is that it tends to overemphasize variables with many unique values because there are more places to split the variable.

XGBoost actually provides three built-in measures of variable importance:

  1. Gain - as defined above
  2. Coverage - measures the relative number of observations influenced by the variable
  3. Frequency - percentage of splits in the whole ensemble that use this variable

Let’s calculate variable importance in each software!

Variable Selection

Another thing to tune would be which variables to include in the XGBoost model. By default, XGBoost models might use all the variables since they are averaged across all the trees used to build the model. There are a couple of ways to perform variable selection for XGBoost models:

  • Many permutations of including/excluding variables

  • Compare variables to random variable

The first approach is rather straight forward but time consuming because you have to try and build models over and over again taking a variable out each time. The second approach is much easier to try. In that second approach we will create a completely random variable and put it in the model. We will then look at the variable importance of all the variables. The variables that are below the random variable should be considered for removal.

Let’s see this in each software!

“Interpretability”

Most machine learning models are not interpretable in the classical sense - as one predictor variable increases, the target variable tends to ___. This is because the relationships are not linear. The relationships are more complicated than a linear relationship, so the interpretations are as well. Similar to random forests we can get a general idea of an overall pattern for a predictor variable compared to a target variable - partial dependence plots.

These plots will be covered in much more detail in the final section of this code deck under Model Agnostic Interpretability.

<sklearn.inspection._plot.partial_dependence.PartialDependenceDisplay object at 0x339dbc190>

This nonlinear and complex relationship between GarageArea and SalePrice is similar to the plot we saw earlier with the GAMs and random forests. This shouldn’t be too surprising. All of these algorithms are trying to relate these two variables together, just in different ways.

Predictions

Models are nothing but potentially complicated formulas or rules. Once we determine the optimal model, we can score any new observations we want with the equation.

Scoring data does not mean that we are re-running the model/algorithm on this new data. It just means that we are asking the model for predictions on this new data - plugging the new data into the equation and recording the output. This means that our data must be in the same format as the data put into the model. Therefore, if you created any new variables, made any variable transformations, or imputed missing values, the same process must be taken with the new data you are about to score.

For this problem we will score our test dataset that we have previously set aside as we were building our model. The test dataset is for comparing final models and reporting final model metrics. Make sure that you do NOT go back and rebuild your model after you score the test dataset. This would no longer make the test dataset an honest assessment of how good your model is actually doing. That also means that we should NOT just build hundreds of iterations of our same model to compare to the test dataset. That would essentially be doing the same thing as going back and rebuilding the model as you would be letting the test dataset decide on your final model.

Let’s score our data in each software!

Summary

In summary, XGBoost models are good models to use for prediction, but explanation becomes more difficult and complex. Some of the advantages of using XGBoost models:

  • Very accurate

  • Tend to outperform random forests when properly trained and tuned.

  • Variable importance provided

There are some disadvantages though:

  • “Interpretability” is a little different than the classical sense

  • Computationally slower than random forests due to sequentially building trees

  • More tuning parameters than random forests

  • Harder to optimize

  • More sensitive to tuning of parameters