Generalized Additive Models

General Structure

Generalized Additive Models (GAM’s) provide a general framework for adding of non-linear functions together instead of the typical linear structure of linear regression. GAM’s can be used for either continuous or categorical target variables. The structure of GAM’s are the following:

\[ y = \beta_0 + f_1(x_1) + \cdots + f_p(x_p) + \varepsilon \]

The \(f_i(x_i)\) functions are complex, nonlinear functions on the predictor variables. GAM’s add these complex, yet individual functions together. This allows for many complex relationships to try and model with to potentially predict your target variable better. We will examine a few different forms of GAM’s below.

Piecewise Linear Regression

The slope of the linear relationship between a predictor variable and a target variable can change over different values of the predictor variable. The typical straight-line model \(\hat{y} = \beta_0 + \beta_1x_1\) will not be a good fit for this type of data.

Here is an example of a data set that exhibits this behavior - comprehensive strength of concrete and the proportion of water mixed with cement. The comprehensive strength decreases at a much faster rate for batches with a greater than 70% water/cement ratio.

Linear Regression

Piecewise Linear Regression

If you were to fit a linear regression as the first image above, it wouldn’t represent the data very well. However, this is perfect for a piecewise linear regression. Piecewise linear regression is a model where there are different straight-line relationships for different intervals in the predictor variable. The following piecewise linear regression is for two slopes:

\[ y = \beta_0 + \beta_1x_1 + \beta_2(x_1-k)x_2 + \varepsilon \]

The \(k\) value in the equation above is called the knot value for \(x_1\). The \(x_2\) variable is defined as a value of 1 when \(x_1 > k\) and a value of 0 when \(x_1 \le k\). With \(x_2\) defined this way, when \(x_1 \le k\), the equation becomes \(y = \beta_0 + \beta_1x_1 + \varepsilon\). When \(x1 > k\), the equation gets a new intercept and slope: \(y = (\beta_0 - k\beta_2) + (\beta_1 + \beta_2)x_1 + \varepsilon\).

Let’s see this in each software!

The piecewise linear regression equation can be extended to have as many pieces as you want. An example with three lines (two knots) is as follows:

\[ y = \beta_0 +\beta_1x_1 + \beta_2(x_1-k_1)x_2 + \beta_3(x_1-k_2)x_3 + \varepsilon \]

One of the problems with this structure is that we have to define the knot values ourselves. The next set of models can help do that for us!

MARS (and EARTH)

Multivariate adaptive regression splines (MARS) is a non-parametric technique that still has a linear form to the model (additive) but has nonlinearities and interaction between variables. Essentially, MARS uses piecewise regression approach to split into pieces then potentially uses either linear or nonlinear patterns for each piece.

MARS first looks for the point in the range of a predictor \(x_i\) where two linear functions on either side of the point provides the least squared error (linear regression).

The algorithm continues on each piece of the piecewise function until many knots are found. Take the example below that is more complicated than just splitting the data with one knot.

This will eventually overfit your data. However, the algorithm then works backwards to “prune” (or remove) the knots that do not contribute significantly to out of sample accuracy. This out of sample accuracy calculation is performed by using the generalized cross-validation (GCV) procedure - a computational short-cut for leave-one-out cross-validation. The algorithm does this for all of the variables in the data set and combines the outcomes together.

The actual MARS algorithm is trademarked by Salford Systems, so instead the common implementation in most software is enhanced adaptive regression through hinges - called EARTH.

Let’s see how to do this in each software!

Interpretability of relationships between predictor variables and the target variable starts to get more complicated with the EARTH (or MARS) algorithm. You can plot the relationship as we see above, but those relationships can still be rather complicated and hard to explain to a client.

Smoothing

Generalized additive models can be made up of any non-parametric function of the predictor variables. Another popular technique is to use smoothing functions so the piecewise linear regressions are not so jagged. The following are different types of smoothing functions:

  • LOESS (localized regression)

  • Smoothing splines & regression splines

LOESS

Locally estimated scatterplot smoothing (LOESS) is a popular smoothing technique. The idea of LOESS is to perform weighted linear regression in small windows of a scatterplot of data between two variables. This weighted linear regression is done around each point as the window moves from the low end of the scatterplot values to the high end. An example is shown below:

LOESS Regression

The predictions of the these regression lines in each window are connected together to form the smoothed curve through the scatterplot as shown above.

Smoothing Splines

Smoothing splines take a different approach as compared to LOESS. Smoothing splines have a knot at every single observation for piecewise regression which leads to overfitting. There is a penalty parameter used to counterbalance the “wiggle” of the spline curve.

Smoothing splines try to find the function \(s(x_i)\) that optimally fits \(x_i\) to the target variable through the following equation:

\[ \min\sum_{i=1}^n (y_i - s(x_i))^2 + \lambda\int s''(t_i)^2 dt \]

By thinking of \(s(x_i)\) as a prediction of \(y\), the front half of the equation is equal to the sum of squared errors in your model. The second half of the equation above has the \(\lambda\) penalty applied to the integral of the second derivative of the smoothing function. To conceptually think of this second derivative we can think of it as the “slope of slopes” which is large when the curve has a lot of “wiggle”. The optimal value of the \(\lambda\) penalty is estimated with another approximation of leave-one-out cross-validation.

These splines allow curves to be fit to your data instead of just piecewise lines. Take the same data shown in the section on piecewise regression, but now fit with cubic splines.

Regression splines are just a computationally nicer version of smoothing splines so they will not be covered in detail here.

Let’s see how to do GAM’s with splines in each software!

Predictions

Models are nothing but potentially complicated formulas or rules. Once we determine the optimal model, we can score any new observations we want with the equation.

Scoring

It’s the process of applying a fitted model to input data to generate outputs like predicted values, probabilities, classifications, or scores.

Scoring data does not mean that we are re-running the model/algorithm on this new data. It just means that we are asking the model for predictions on this new data - plugging the new data into the equation and recording the output. This means that our data must be in the same format as the data put into the model. Therefore, if you created any new variables, made any variable transformations, or imputed missing values, the same process must be taken with the new data you are about to score.

For this problem we will score our test dataset that we have previously set aside as we were building our model. The test dataset is for comparing final models and reporting final model metrics. Make sure that you do NOT go back and rebuild your model after you score the test dataset. This would no longer make the test dataset an honest assessment of how good your model is actually doing. That also means that we should NOT just build hundreds of iterations of our same model to compare to the test dataset. That would essentially be doing the same thing as going back and rebuilding the model as you would be letting the test dataset decide on your final model.

Let’s score our data in each software!

Summary

In summary, GAM’s are good models to use for prediction, but explanation becomes more difficult and complex. Some of the advantages of using GAM’s:

  • Allows nonlinear relationships without trying out many transformations manually

  • Improved predictions

  • Limited “interpretation” still available

  • Computationally fast for small numbers of variables

There are some disadvantages though:

  • Interactions are possible, but computationally intensive

  • Not good for large number of variables so prescreening is needed

  • Multicollinearity still a problem.