Review of Modeling

This code deck assumes some basic knowledge of linear regression. Let’s review the basic concepts around linear regression:

  • Simple Linear Regression: one predictor variable for continuous target variable.

  • Multiple Linear Regression: two or more predictor variables (continuous or categorical) for continuous target variable.

  • Ordinary Least Squares: primary method of estimating linear regression where we find the regression line that minimizes the sum of the squares residuals (difference between prediction and truth).

Train and Test Split

For this analysis we are going to the popular Ames housing dataset. This dataset contains information on home values for a sample of nearly 1,500 houses in Ames, Iowa in the early 2000s. The data have already been reduced by removing variables using business logic, missingness, and low variability from the previous section.

Before exploring any relationships between predictor variables and the target variable SalePrice, we need to split our data set into training and testing pieces. Because models are prone to discovering small, spurious patterns on the data that is used to create them (the training data), we set aside the testing data to get a clear view of how they might perform on new data that the models have never seen before.

Code
from sklearn.model_selection import train_test_split

train, test = train_test_split(ames, test_size = 0.25, random_state = 1234)

This split is done randomly. However, to make sure we can replicate this random split we use the random_state option. The data is split randomly so that the testing data set will approximate an honest assessment of how the model will perform in a real world setting. A visual representation of this is below:

Train and Test Splitting

Exploring the Data

Once we have our data split into training and testing datasets, we can now start to explore the relationships in the training data.

Missing Values - Continuous Variables

Before we start fully exploring our data we can now finish handling our missing values. We should not replace continuous predictor variable missing values on the whole dataset because that would impute the mean/median of the whole variable, not just the training data. We are not allowed to run any analysis on the whole dataset. This is especially the case if you wanted to impute missing values of a continuous variable using other variables in the dataset.

Let’s do this in each software!

Initial Univariate Relationships

Once we have data that is ready to explore relationships we can quickly explore some individual predictor variable relationships with our target variable. If we have a small list of variables, this can be done by looking over scatterplots between your continuous predictor variables and your continuous target and boxplots for categorical predictors variables with your continuous target. However, when the list of variables starts to get large this becomes a burden. In these scenarios we can use automated techniques to quickly evaluate basic relationships between our predictors and our target. A quick word of caution; any time you use automated techniques to quickly go through large lists of variables, there is a chance that a couple of variables with really complicated relationships might slip through these simple relationship screenings. We are hoping that the large number of variables will overcome this problem and we are left with a large number still to pick from. That is why with smaller numbers of variables, visual exploration is always preferred.

The approach we will take is to look at univariate statistical tests between one predictor variable at a time with our target variable. We will screen out variables with p-values that are too high. Initial p-value selection is a common approach to remove variables with very low linear predictive power with our continuous target variable. The cut-off for these p-values is a critical question. A lot of introductory statistical test use a cut-off of 0.05. Anything below that would be considered statistically significant and worth moving on to the next stage of model building. However, this is a common misconception. The original author of the p-value, hypothesis test approach used 0.05 for their sample sizes of 50. It has been shown since that larger sample sizes need far lower cut-offs (referred to as \(\alpha\)-levels) to be reasonable. These \(\alpha\)-levels correspond to the probability of making a Type I error - in our scenario, falsely thinking a variable is significant when it really is not.

In a paper from 1994, Raftery provides more reasonable \(\alpha\)-levels based on sample size of the training dataset:

As we can see, an \(\alpha\)-level of 0.05 is good for sample sizes of about 50 for finding weak relationships. For the sample size of our data, a cut-off of about 0.009 is much more reasonable. We still want to find anything with at least a weak relationship with our target since we don’t want to be too restrictive this early on.

The p-value we use comes from the F-test from a simple linear regression. We will compare each individual variable in a simple linear regression with our target variable and record the F statistic p-value. The F-statistic is calculated as:

\[ F = \frac{MSR}{MSE} \]

where \(MSR\) refers to the mean square calculation from the regression model and \(MSE\) refers to the mean square calculation from the errors. More formally, \(MSR\) is the variance that is explained by the regression model. This is calculated by comparing the predictions from the regression model, \(\hat{y}_i\), to the overall average prediction, \(\bar{y}\):

\[ SSR = \sum_{i=1}^n (\hat{y}_i - \bar{y})^2 \]

This \(SSR\) is then divided by how many variables we have in the model, \(p\). In our scenario, \(p=1\). This can be thought of as the “average” of the variance explained per variable in the model.

This is then compared to the \(MSE\). The \(MSE\) is the variance that is unexplained from the model. This is calculated by comparing the true values of our target variable, \(y_i\), to our regression model’s predictions, \(\hat{y}_i\):

\[ SSE = \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]

This \(SSE\) is then divided by the degrees of freedom in our error: \(n-p-1\). This can be thought of as the “average” amount of the remaining, unexplained variance per observation left over from our model.

Combining these together gives us our original F statistic:

\[ F = \frac{MSR}{MSE} = \frac{SSR/p}{SSE/(n-p-1)} \]

This ratio compares the amount of variation our model was able to explain (per variable) to how much variation is still left over (per observation). The more our model can explain, the larger this ratio becomes, which then makes our p-value smaller.

Let’s look at these automated approaches in each software!

Model Building & Automatic Selection

Now we have a reduced list of variables from our data exploration. Before diving into model building concepts let’s build an initial model at our current state.

Initial Linear Regression

Now that we have a shorter variable list, let’s build an initial multiple linear regression model:

\[ \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_{1,i} + \cdots + \hat{\beta}_p x_{p,i} \]

Once we take a look at this initial linear regression, then we will evaluate which variables we want to keep.

Let’s see how to do this in each of our software!

Model Building Concepts

Just throwing all of our variables in a model, even if the list is reduced like we did above, is not the best strategy for model building. When it comes to model building there are a couple of things we need to consider - model metrics and selection algorithms.

Model metrics are evaluation calculations on models that we typically use to “select” variables for the model. Selection algorithms are automated techniques that quickly evaluate variables based on some selection criteria or model metric. There are two common types of algorithm groupings:

  1. Stepwise Selection - step through decisions to build a model without evaluating every possible combination.

  2. All-regression Selection - trying out all possible combinations of variables.

However, in the world of machine learning, we have a lot of aspects of a model that we need to “tune” or adjust to make our models more predictive. We can not perform this on the training data alone as it would lead to over-fitting (predicting too well) the training data and no longer be generalizable. Take the following plot:

The red line is overfitted to the dataset and picks up too much of the unimportant pattern. The orange dotted line is underfit as it does not pick up enough of the pattern. The light blue, solid line is fit well to the dataset as it picks up the general pattern while not overfitting to the dataset.

To help with over-fitting, we could split the training data set again into a training and validation data set where the validation data set is what we “tune” our models on. The downside of this approach is that we will tend to start over-fitting the validation data set.

One approach to dealing with this is to use cross-validation. The idea of cross-validation is to divide your data set into k equally sized groups - called folds or samples. You leave one of the groups aside while building a model on the rest. You repeat this process of leaving one of the k folds out while building a model on the rest until each of the k folds has been left out once. We can evaluate a model by averaging the models’ effectiveness across all of the iterations. This process is highlighted below:

10-fold Cross-Validation Example

We will look at model building first without the use of cross-validation, but then with it.

Model Metrics

Many different model metrics are used to help determine the validity of a model. Here of only SOME of the popular ones:

  • MAE (Mean Absolute Error) - the average of the absolute differences your predictions are from the truth.

  • MAPE (Mean Absolute Percentage Error) - the average of the absolute percentage differences your predictions are from the error.

  • MSE (Mean Squared Error) - the average of the squared differences your predictions are from the truth; also, the estimate of the variance of your errors from your model.

  • RMSE (Root Mean Squared Error) - the square root of the MSE; also, the standard deviation of your errors from your model.

In all of the above metrics, lower values represent a model that more highly predicts the target variable. However, these metrics might not always agree on which of the models is the best model.

Scale Dependent and Symmetric

Three of the above metrics are dependent on scale. Those metrics are the MAE, MSE, and RMSE:

Mean Absolute Error (MAE):

\[ MAE = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i| \]

Mean Squared Error (MSE):

\[ MSE = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]

Root Mean Squared Error (RMSE):

\[ RMSE = \sqrt(\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2) \]

These metrics above are both scale dependent and symmetric. Scale dependent metrics are metrics that can only be compared to data with similar values for the data points. For example, we cannot compare a model trying to predict the price of a car with another model trying to predict the GDP of a country with MAE. An average error of $100,000 might be good for a model predicting GDP, but not as good for a model predicting price of a car. A symmetric metric is a metric that has the same meaning if the prediction is above the actual value as below the actual value. Again, for example, an error of $10,000 above a prediction is the same as an error $10,000 below a prediction.

Scale Independent and Asymmetric

The MAPE metric does not depend on scale, however, it is also not a symmetric metric.

Mean Absolute Percentage Error (MAPE):

\[ MAPE = \frac{1}{n} \sum_{i=1}^n |\frac{y_i - \hat{y}_i}{y_i}| \]

This final metric is both scale independent and asymmetric. Scale independent metrics are metrics that can be compared to data with any values for the data points. For example, we can compare a model trying to predict the price of a car with another model trying to predict the GDP of a country with MAPE. An average percentage error of 4% would be a worse model in context compared to a model with an average percentage error of 2.5%. An asymmetric metric is a metric that has a different meaning if the prediction is above the actual value as below the actual value. For example, a 5% error above is not the same as a 5% error below.

Automatic Selection / Stepwise Regression

Which variables should you drop from your model? This is a common question for all modeling, but especially linear regression. In this section we will cover a popular variable selection technique - stepwise regression. This isn’t the only possible technique, but will be the primary focus here.

Stepwise regression techniques involve the three common methods:

  1. Forward Selection

  2. Backward Selection

  3. Stepwise Selection

These techniques add or remove (depending on the technique) one variable at a time from your regression model to try and “improve” the model. There are a variety of different selection criteria to use to add or remove variables from a linear regression.

Forward Selection

Let’s work through the process of forward selection. In forward selection, the initial model is empty (contains no variables, only the intercept). Each variable is then tried to see if it adds value based on a model metric. The variable that adds the most value is then added to the model. Then the remaining variables are again tested and the next most impactful variable is added to the model. This process repeats until either no more variables are available or no other variables improve the model metric.

Take a look at the picture below:

Forward Feature Addition

In the above example, after all of the variables were tested in models by themselves with the target, the 5th variable was deemed the best by the model metric. That variable is now in the model from this point forward. Next, all two variable models were built with every other variable added one at a time with the fifth variable. The combination of the 5th and 9th variables was deemed as the best combination from these options. Those two variables are now in the model going forward. This process stopped when no more variables improved the model metric on the overall model.

Forward selection is the least used technique because stepwise selection does the same as forward selection with an added benefit as discussed below.

Backward Selection

In backward selection, the initial model is full (contains all variables, including the intercept). Each variable is then tried to see if it is worth keeping based on a specific model metric. The least worthy variable (or the one that hinders the model the most) is then dropped from the model. Then the remaining variables are again tested and the next least important variable is dropped from the model.

Take a look at the picture below:

Backward Selection Example

In the above example, the 2nd variable was deemed as the worst variable because the model improved the most (based on a model metric) with its deletion. Therefore, the 2nd variable is dropped from the model and cannot come back into the model. Next, each variable is again evaluated to see how the model improves with that variable’s deletion. The 11th variable is now deemed the worst and dropped. This process repeats until the model no longer improves with the deletion of another variable.

Backward selection is one of the most popular approaches to automatic model selection.

Stepwise Selection

In stepwise selection, the initial model is empty (contains no variables, only the intercept) similar to forward selection. Each variable is then added one at a time to evaluate all single variable models to see which variable improves the model the most. The most important variable based on our model metric is then added to the model. Up until this point, it has been the same as forward selection. Next, all variables in the model (here only one) are tested to see if they are still helpful to the model by dropping all of them one at a time. This is similar to a backward selection at only this step. If they are hindering the model, they are dropped. Then the remaining variables are again tested and the next most impactful variable is added to the model. This process repeats until either no more variables are improving the model metric by adding or deleting variables.

Take a look at the picture below:

Stepwise Feature Selection

In the above picture the process starts off very similar to forward selection until we get to step 5. In step 5, the 6th variable is dropped from the model. In forward selection, if the 6th variable was added it could not be dropped. However, here that variable was evaluated to see if it still helped the model metric by being in the model. Since the model metric improved with the deletion of that variable, the variable was dropped.

All of the above techniques can be built with and without cross-validation. Both approaches are discussed in the following sections.

No Cross-Validation

We will start without the use of cross-validation and only focusing on backward selection.

Let’s see how this works in each software!

There are better options when it comes to doing forward, backward, and stepwise selection algorithms using cross-validation which is discussed in the next section.

Cross-Validation

Instead of looking at any of the stepwise regression techniques by using only the training data, from a machine learning standpoint, each step of the process is not evaluated on the training data, but on the cross-validation data sets. The model with the best average MSE (for example) in cross-validation is the one that moves on to the next step of the selection algorithm.

Backward Selection

Let’s revisit the backward selection from above under this premise.

Backward Variable Elimination with CV

In first step of the backward selection process, we create models such that each model has exactly one predictor variable removed from it and calculate a model metric for each model. However, we do this across each cross-validation training set. For 10-fold cross-validation, that would be 10 separate first steps that create models removing only one variable. Next, we average the model metric across all training sets to see which is the worst variable. For example, the second variable is now the worst on average across all of the cross-validation datasets based on our model metric, not just the training set overall.

This same process is repeated across all of the steps until a final model is selected where removing any further variables does not improve the model based on the cross-validation model metrics.

Let’s see this in each software!

Forward Selection

Let’s revisit the forward selection approach with cross-validation.

Forward Variable Selection with CV

In first step of the forward selection process, we create models such that each model has exactly one predictor variable and calculate a model metric for each model. However, we do this across each cross-validation training set. For 10-fold cross-validation, that would be 10 separate first steps that create models with only one variable. Next, we average the model metric across all training sets to see which is the best variable. For example, the 5th variable is now the best on average across all of the cross-validation datasets based on our model metric, not just the training set overall.

This same process is repeated across all of the steps until a final model is selected where adding any further variables does not improve the model based on the cross-validation model metrics.

Let’s see this in each software!

Stepwise Selection

Let’s revisit the stepwise selection approach with cross-validation.

Stepwise Variable Selection with CV

In first step of the stepwise selection process, we create models such that each model has exactly one predictor variable and calculate a model metric for each model. However, we do this across each cross-validation training set. For 10-fold cross-validation, that would be 10 separate first steps that create models with only one variable. Next, we average the model metric across all training sets to see which is the best variable. For example, the 5th variable is now the best on average across all of the cross-validation datasets based on our model metric, not just the training set overall.

This same process is repeated across all of the steps until a final model is selected where adding any further variables does not improve the model based on the cross-validation model metrics. Remember, the difference between forward and stepwise selection is that we are also evaluating each variable added to the model at each step to see if deleting that variable improves the model metric.

Let’s see this in each software!

Cautions

There are some cautions that come with automatic selection techniques.

The first caution is that not all of these techniques will agree with each other. If you explore the variable outputs above you will notice that they do not agree on which variables should included in the final model. While most of the variables are the same, there are some subtle differences. That puts the responsibility back into the hands of the model builder to understand the business context of the problem and decide which might be best.

The second caution is that none of these techniques evaluated the models they were building at each step. Linear regression models have assumptions and other diagnostics to be run on them. That would be too time consuming to do at each step of the above algorithms and all of the models that were calculated. Therefore, we acknowledge that these techniques are not perfect. The algorithms are here to help try to make large lists of variables more manageable, not to be the final model decision maker.

Thirdly, these algorithms are called greedy algorithms. Greedy algorithms makes the locally optimal choice at each step with the hope that it leads to the global optimal solution. Remember, once the first step in the above algorithms was chosen, that influenced the subsequent models that were evaluated. Not every possible combination of variables was tested.

Next Steps

What we do with the results from this also differs depending on your background - the statistical/classical approach and the machine learning approach.

The more statistical and classical view would use validation to evaluate which model is “best” at each step of the process. The final model from the process contains the variables you want to keep in your model from that point going forward. To score a final test data set, you would combine the training and validation sets and then build a model with only the variables in the final step of the selection process. The coefficients on these variables might vary slightly because of the combining of the training and validation data sets.

The more machine learning / computer science approach is slightly different. You still use the validation to evaluate which model is “best” at each step in the process. However, instead of taking the variables in the final step of the process as the final variables to use, we look at the number of variables in the final step of the process as what has been optimized. To score a final test data set, you would combine the training and validation sets and then build a model with the same number of variables in the final process, but the variables don’t need to be the same as the ones in the final step of the process. The variables might be different because of the combining of the training and validation data sets.

Those further comparisons are not discussed here.