Regularized Regression

Regularization

Linear regression is a great initial approach to take to model building. In fact, in the realm of statistical models, linear regression (calculated by ordinary least squares) is the best linear unbiased estimator. The two key pieces to that previous statement are “best” and “unbiased.”

What does it mean to be unbiased? Each of the sample coefficients (\(\hat{\beta}\)’s) in the regression model are estimates of the true coefficients. These sample coefficients have sampling distributions - specifically, normally distributed sampling distributions. The mean of the sampling distribution of \(\hat{\beta}_j\) is the true (unknown) coefficient \(\beta_j\). This means the coefficient is unbiased.

What does it mean to be best? IF the assumptions of ordinary least squares are met fully, then the sampling distributions of the coefficients in the model have the minimum variance of all unbiased estimators.

These two things combined seem like what we want in a model - estimating what we want (unbiased) and doing it in a way that has the minimum amount of variation (best among the unbiased). Again, these rely on the assumptions of linear regression holding true. Another approach to regression would be to use regularized regression instead as a different approach to building the model.

As the number of variables in a linear regression model increase, the chances of having a model that meets all of the assumptions starts to diminish. Multicollinearity can also pose a large problem with bigger regression models. The coefficients of a linear regression vary widely in the presence of multicollinearity. These variations lead to over-fitting of a regression model. More formally, these models have higher variance than desired. In those scenarios, moving out of the realm of unbiased estimates may provide a lower variance in the model, even though the model is no longer unbiased as described above. We wouldn’t want to be too biased, but some small degree of bias might improve the model’s fit overall and even lead to better predictions at the cost of a lack of interpretability.

Another potential problem for linear regression is when we have more variables than observations in our data set. This is a common problem in the space of genetic modeling. In this scenario, the ordinary least squares approach leads to multiple solutions instead of just one. Unfortunately, most of these infinite solutions over-fit the problem at hand anyway.

Regularized (or penalized or shrinkage) regression techniques potentially alleviate these problems. Regularized regression puts constraints on the estimated coefficients in our model and shrink these estimates to zero. This helps reduce the variation in the coefficients (improving the variance of the model), but at the cost of biasing the coefficients. The specific constraints that are put on the regression inform the three common approaches - ridge regression, LASSO, and elastic nets.

Penalties in Models

In ordinary least squares linear regression, we minimize the sum of the squared residuals (or errors).

\[ min(\sum_{i=1}^n(y_i - \hat{y}_i)^2) = min(SSE) \]

In regularized regression, however, we add a penalty term to the \(SSE\) as follows:

\[ min(\sum_{i=1}^n(y_i - \hat{y}_i)^2 + Penalty) = min(SSE + Penalty) \]

As mentioned above, the penalties we choose constrain the estimated coefficients in our model and shrink these estimates to zero. Different penalties have different effects on the estimated coefficients. Two common approaches to adding penalties are the ridge and LASSO approaches. The elastic net approach is a combination of these two. Let’s explore each of these in further detail.

Ridge Regression

Ridge regression adds what is commonly referred to as an “\(L_2\)” penalty:

\[ min(\sum_{i=1}^n(y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^p \hat{\beta}^2_j) = min(SSE + \lambda \sum_{j=1}^p \hat{\beta}^2_j) \]

This penalty is controlled by the tuning parameter \(\lambda\). If \(\lambda = 0\), then we have typical OLS linear regression. However, as \(\lambda \rightarrow \infty\), the coefficients in the model shrink to zero. This makes intuitive sense. Since the estimated coefficients, \(\hat{\beta}_j\)’s, are the only thing changing to minimize this equation, then as \(\lambda \rightarrow \infty\), the equation is best minimized by forcing the coefficients to be smaller and smaller. We will see how to optimize this penalty term later in this section.

Here is a visual of the impact of adding more bias on the coefficients from our Ames housing model using a ridge regression:

Ridge(alpha=np.float64(1000000.0))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
(np.float64(0.6382634861905486), np.float64(1972422.7361148535), np.float64(-0.07767910458329648), np.float64(0.11805262642046022))

Notice in the above plot that as the bias (represented by the x-axis) gets larger, the coefficients get closer to 0. However, those coefficients are asymptotically approaching 0 while never equaling it. This is different than the next approach!

LASSO

Least absolute shrinkage and selection operator (LASSO) regression adds what is commonly referred to as an “\(L_1\)” penalty:

\[ min(\sum_{i=1}^n(y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^p |\hat{\beta}_j|) = min(SSE + \lambda \sum_{j=1}^p |\hat{\beta}_j|) \]

This penalty is controlled by the tuning parameter \(\lambda\). If \(\lambda = 0\), then we have typical OLS linear regression. However, as \(\lambda \rightarrow \infty\), the coefficients in the model shrink to zero. This makes intuitive sense. Since the estimated coefficients, \(\hat{\beta}_j\)’s, are the only thing changing to minimize this equation, then as \(\lambda \rightarrow \infty\), the equation is best minimized by forcing the coefficients to be smaller and smaller. We will see how to optimize this penalty term later in this section.

However, unlike ridge regression that has the coefficient estimates approach zero asymptotically, in LASSO regression the coefficients can actually equal zero. This may not be as intuitive when initially looking at the penalty terms themselves. It becomes easier to see when dealing with the solutions to the coefficient estimates. Without going into too much mathematical detail, this is done by taking the derivative of the minimization function (objective function) and setting it equal to zero. From there we can determine the optimal solution for the estimated coefficients. In OLS regression the estimates for the coefficients can be shown to equal the following (in matrix form):

\[ \hat{\beta} = (X^TX)^{-1}X^TY \]

This changes in the presence of penalty terms. For ridge regression, the solution becomes the following:

\[ \hat{\beta} = (X^TX + \lambda I)^{-1}X^TY \]

There is no value for \(\lambda\) that can force the coefficients to be zero by itself. Therefore, unless the data makes the coefficient zero, the penalty term can only force the estimated coefficient to zero asymptotically as \(\lambda \rightarrow \infty\).

However, for LASSO, the solution isn’t quite as easy to show. However, in this scenario, there is a possible penalty value that will force the estimated coefficients to equal zero. There is some benefit to this. This makes LASSO also function as a variable selection criteria as well.

Here is a visual of the impact of adding more bias on the coefficients from our Ames housing model using a LASSO regression:

Lasso(alpha=np.float64(1.0), max_iter=5000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
(np.float64(6.309573444801929e-05), np.float64(1.5848931924611136), np.float64(-0.0795021220423198), np.float64(0.1732086230598232))

Notice in the above plot that as the bias (represented by the x-axis) gets larger, the coefficients get closer to 0. However, unlike ridge regression, there comes a point where a coefficient effectively gets zeroed out. Once that coefficient gets zeroed out, the model no longer is impacted by that variable which essentially removes the variable from the model.

Elastic Net

Which approach is better, ridge or LASSO? Both have advantages and disadvantages. LASSO performs variable selection while ridge keeps all variables in the model. However, reducing the number of variables might impact minimum error. Also, if you have two correlated variables, which one LASSO chooses to zero out is relatively arbitrary to the context of the problem.

Elastic nets were designed to take advantage of both penalty approaches. In elastic nets, we are using both penalties in the minimization:

\[ min(SSE + \lambda_1 \sum_{j=1}^p |\hat{\beta}_j| + \lambda_2 \sum_{j=1}^p \hat{\beta}^2_j) \]

Both R and Python take a slightly different approach to the elastic net implementation with the following:

\[ min(SSE + \alpha[ \rho \sum_{j=1}^p |\hat{\beta}_j| + (1-\rho) \sum_{j=1}^p \hat{\beta}^2_j]) \]

They still have one penalty \(\alpha\), however, it includes the \(\rho\) parameter to balance between the two penalty terms. Any value in between zero and one will provide an elastic net.

Here is a visual of the impact of adding more bias on the coefficients from our Ames housing model using an elastic net regression:

ElasticNet(alpha=np.float64(10.0), max_iter=10000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
(np.float64(5.623413251903491e-05), np.float64(17.78279410038923), np.float64(-0.07900814700020174), np.float64(0.14401732454748187))

Notice in the above plot that as the bias (represented by the x-axis) gets larger, the coefficients get closer to 0. Elastic net models blend together the penalties from both the ridge and LASSO regressions. That means that variables get zeroed out in their impact (by zeroing out the coefficients) just like in LASSO, but the curves are slightly smoother like in ridge.

Optimizing Penalties

In regularized regression models there is a fear of overfitting. We need to select the best value for our tuning parameter in our penalties, \(\lambda\). We don’t want the bias to be so high that we minimize variance to the point of overfitting our data and not making the model a good generalization of the overall pattern in the data.

As discussed in previous sections, cross-validation is a common approach to prevent overfitting when tuning a model parameter - here \(\lambda\). As a brief refresher, the concept of cross-validation is to:

  • Split training data into multiple pieces

  • Build model of majority of pieces

  • Evaluate model on remaining piece

  • Repeat process with switching out pieces for building and evaluation

Both ridge and LASSO regressions only have one parameter to tune - the bias penalty. However, elastic net regression has two parameters to tune, the bias penalty as well as the balance between the ridge and LASSO penalties. Due to the ridge and LASSO being special cases of the elastic net, it is sometimes easier to start off by optimizing the elastic net model first. If the ridge or LASSO appears to be the best option within the elastic net optimization, then we can fit one of those models.

Let’s see how to build these models in each software!

Comparing Predictions

Models are nothing but potentially complicated formulas or rules. Once we determine the optimal model, we can score any new observations we want with the equation.

Scoring

It’s the process of applying a fitted model to input data to generate outputs like predicted values, probabilities, classifications, or scores.

Scoring data does not mean that we are re-running the model/algorithm on this new data. It just means that we are asking the model for predictions on this new data - plugging the new data into the equation and recording the output. This means that our data must be in the same format as the data put into the model. Therefore, if you created any new variables, made any variable transformations, or imputed missing values, the same process must be taken with the new data you are about to score.

For this problem we will score our test dataset that we have previously set aside as we were building our model. The test dataset is for comparing final models and reporting final model metrics. Make sure that you do NOT go back and rebuild your model after you score the test dataset. This would no longer make the test dataset an honest assessment of how good your model is actually doing. That also means that we should NOT just build hundreds of iterations of our same model to compare to the test dataset. That would essentially be doing the same thing as going back and rebuilding the model as you would be letting the test dataset decide on your final model.

Let’s score our data in each software!