Explainable Boosting Machine

Explainable Boosting Machine (EBM)

A newer algorithm has come onto the scene that tries to have the predictive power seen in XGBoost models but maintain the kind of interpretability that GAM models have. This model is called the explainable boosting machine (EBM).

We covered GAM’s in a previous section, but as a reminder, GAM’s provide a general framework for the adding of non-linear functions together instead of the typical linear structure. The structure of GAM’s are the following:

\[ y = \beta_0 + f_1(x_1) + \cdots + f_p(x_p) + \varepsilon \]

The \(f_i(x_i)\) functions are complex, nonlinear functions on the predictor variables. GAM’s add these complex, yet individual functions together. This allows for many complex relationships to try and model with to potentially predict your target variable better.

EBM’s use machine learning algorithms (like random forests and boosting trees) to build these individual pieces, \(f_i (x_i)\), before adding them together in the GAM. To do this, they build out the same GBM structure but only use one variable at a time. The idea behind the GBM is to build a simple model to predict the target (much like a decision tree or even decision stump):

\[ y = g_1(x_1) + \varepsilon_1 \]

However, in the EBM that simple model is built only off of one variable, say \(x_1\). That simple model has an error of \(\varepsilon_1\). The next step is to try to predict that error with another simple model only using one other variable, say \(x_2\):

\[ \varepsilon_1 = g_2(x_2) + \varepsilon_2 \]

This model has an error of \(\varepsilon_2\). We can continue to add model after model, each one predicting the error (residuals) from the previous round, but only ever using one variable:

\[ y = g_1(x_1) + g_2(x_2) + \cdots + g_k(x_k) + \varepsilon_k \]

We will continue this process until we use all of the variables. We will then repeat this process in a round robin format, but still looking at the residuals from the previous set of models:

\[ y = g_1(x_1) + g_2(x_2) + \cdots + g_k(x_k) + \varepsilon_k \]

\[ \varepsilon_k = g_{1^*}(x_1) + g_{2^*}(x_2) + \cdots + g_{k^*}(x_k) + \varepsilon_{k^*} \]

\[ \varepsilon_{k^{*}} = g_{1^{**}}(x_1) + g_{2^{**}}(x_2) + \cdots + g_{k^{**}}(x_k) + \varepsilon_{k^{**}} \]

\[ \vdots \]

This will be repeated in a round robin format for thousands of iterations similar to a random forest approach where size of the iterations help find all the signal. We will apply a small learning rate to each of the subsequent models (each trees contribution to the running residual is scaled by a small number) so that the order of the variables will not matter. For each of the small models containing only the variable \(x_1\), \(g_1(x_1)\), \(g_{1^*}(x_1)\), \(g_{1^{**}}(x_1)\), etc. , we will aggregate them together to form our idea of the overall relationship between \(x_1\) and \(y\), our \(f_1(x_1)\) for our GAM:

\[ y = \beta_0 + f_1(x_1) + \cdots + f_p(x_p) + \varepsilon \]

We will then repeat this process for all of the variables in the model. Essentially, we are developing each of the pieces of the GAM (the individual variable relationships) and then adding them together to form our overall model.

These allow for individually developed relationships for variables which makes the interpretability more similar to that of the traditional GAM’s, but with the predictive power of the more advanced machine learning boosting algorithms.

Let’s see this in each software!

Variable Importance / Interpretations

Now we have an EBM model built. We can use some of the functionality of that model to try and interpret the variables. Remember the GAM structure to the EBM as described above? This allows for individually developed relationships for variables which makes the interpretability more similar to that of the traditional GAM’s, but with the predictive power of the more advanced machine learning boosting algorithms.

Let’s see this in each software!

Variable Selection

Another thing to tune would be which variables to include in the EBM model. By default, EBM models use all the variables since they are aggregated across all the trees used to build the model. There are a couple of ways to perform variable selection for EBM’s:

  • Many permutations of including/excluding variables

  • Compare variables to random variable

The first approach is rather straight forward but time consuming because you have to try and build models over and over again taking a variable out each time. The second approach is much easier to try. In that second approach we will create a completely random variable and put it in the model. We will then look at the variable importance of all the variables. The variables that are below the random variable should be considered for removal.

Let’s see how to do this in each software!

Local Interpretability

Another valuable piece of output from the Python functionality of EBM models is the ability to perform local interpretability as well as the global interpretability we saw earlier. Python’s functionality automatically plots these importances while R can only give the raw importance values. Here we will just go over the Python functionality.

Code
ebm_local = ebm_ames.explain_local(X_reduced[:1], y[:1])

plotly_fig = ebm_local.visualize(0)

plotly_fig.show()

The above plot breaks down the individual prediction from the model into how each variable impacts prediction. The plot above breaks down the model’s predictions for that one specific observation into all of the components that contribute to that prediction. For example, the overall quality of the home being a 9 out of 10 contributes $40,701.35 to the predicted price of $315,000.

More details on these kinds of calculations are listed in the Model Agnostic Interpretability section of the notes.

Predictions

Models are nothing but potentially complicated formulas or rules. Once we determine the optimal model, we can score any new observations we want with the equation.

Scoring data does not mean that we are re-running the model/algorithm on this new data. It just means that we are asking the model for predictions on this new data - plugging the new data into the equation and recording the output. This means that our data must be in the same format as the data put into the model. Therefore, if you created any new variables, made any variable transformations, or imputed missing values, the same process must be taken with the new data you are about to score.

For this problem we will score our test dataset that we have previously set aside as we were building our model. The test dataset is for comparing final models and reporting final model metrics. Make sure that you do NOT go back and rebuild your model after you score the test dataset. This would no longer make the test dataset an honest assessment of how good your model is actually doing. That also means that we should NOT just build hundreds of iterations of our same model to compare to the test dataset. That would essentially be doing the same thing as going back and rebuilding the model as you would be letting the test dataset decide on your final model.

Let’s score our data in Python!

Summary

In summary, EBM models are good models to use for prediction and keeps similar interpretability of GAM models. Some of the advantages of using EBM models:

  • Early results show they are powerful at predicting (almost to the level of XGBoost).

  • Interpretation available due to GAM nature (individually estimated relationships)

There are some disadvantages though:

  • Computationally slower than random forests due to sequentially building trees

  • More tuning parameters than random forests

  • Limited implementations across all softwares