Residual Analysis

Linear regression models have key assumptions:

  1. The mean of the target is accurately modeled by a linear function of the predictors.

  2. The random error term, \(\varepsilon\), is assumed to have a constant variance, \(\sigma^2\).

  3. The random error term, \(\varepsilon\), is assumed to have a normal distribution with a mean of 0.

  4. The errors are independent of each other.

Sometimes, people state a 5th assumption:

  1. No perfect collinearity

To start, we will deal with the first four assumptions as they are the primary assumptions of linear regression. Notice how each of the first four assumptions deal with the errors in some capacity. Assumptions 2 - 4 directly mention assumptions. The first assumption around linearity deals with a model not being fit well which results in patterns in our errors.

Let’s look at those first four assumptions visually:

Visual Representation of Linear Regression Assumptions

Notice how we have a relationship between the predictor variable and target variable. That relationship is not perfect and has some errors. However, those errors follow normal distributions at every point along the regression line. These errors all have the same spread, or variance, along the whole regression line as well.

We do not actually view the true errors of our model, \(\varepsilon_i\), in practice.

\[ y_i = \beta_0 + \beta_1 x_{1,i} + \cdots + \beta_k x_{k,i} + \varepsilon_i \]

\[ \varepsilon_i = y_i - (\beta_0 + \beta_1 x_{1,i} + \cdots + \beta_k x_{k,i}) \]

We do not observe actual errors because we don’t know the actual \(\beta\) coefficients. Instead, we have an estimate of the error - called a residual.

\[ \hat{\varepsilon}_i = y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_{1,i} + \cdots + \hat{\beta}_k x_{k,i}) \]

\[ \hat{\varepsilon}_i = y_i - \hat{y}_i \]

These residuals are just the difference between the true value of our target at each observation and the predicted values from our model of those observations. The following is a visual representation of that concept:

Residuals vs. True Errors from Regression Model

Let’s see how to get these residuals from each software!

Now that we have our residuals we can evaluate our assumptions.

Linearity / Lack-of-Fit

One of the assumptions of linear regression assumes that the expected value of the target variable is accurately modeled by a linear function of the predictor variables. We want our models to best represent the real relationship between our predictors and the target variable. If this is true, then we would expect our residual plots to be random scatter of observations (in other words, all of the “signal” was correctly captured in the model and there is just noise left over). If the residuals are not randomly scattered, then they are revealing potential misspecification (or lack-of-fit) of the model.

Here are the steps for detecting lack-of-fit in a model:

  1. Plot residuals against the predicted values of the target variable.

  2. Plot the partial residuals against each predictor variable.

  3. Look for patterns in these plots:

    • Trends

    • Changes in variation

    • Isolated, extreme observations

Partial residuals measure the effect of an individual variable \(x_i\) after accounting for all of the other predictor variables. The partial residuals for the \(j^{th}\) variable, \(x_j\), are defined as:

\[ e_i^* = y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_{1,i} + \cdots + \hat{\beta}_{j-1} x_{j-1,i} + \hat{\beta}_{j+1} x_{j+1,i} + \cdots + \hat{\beta}_k x_{k,i}) \]

\[ e_i^* = \hat{\varepsilon}_i + \hat\beta_j x_j \]

Essentially, the variable \(x_j\) as well as its coefficient has been removed from our calculation. That will allow us to see how that variable impacts the relationship. If we see any non-linear patterns that would be a sign that more signal remains.

Let’s look at a plot of our residuals and partial residuals using each software!

Detecting Unequal Variance

One of the assumptions of linear regression is that the variance of the probability distribution of \(\varepsilon_i\) (usually denoted as \(\sigma^2\)) is constant. This property is called homoscedasticity. The breaking of this assumption is called heteroscedasticity. We want a model that is not more likely to be wrong by wider margins depending on if the prediction is a high or low prediction. We want regression models to have the same spread of errors no matter if the prediction is high or low.

One of the most common examples of when the assumption of homoscedasticity is broken is the following pattern to our residuals:

Example of Heteroscedastic Residuals

When looking at our residual plot there appears to be a similar pattern where we have low spread for low predictions and higher spread for higher predictions:

Visualizations are difficult to interpret and that interpretation is subjective at times. Formal statistical tests for heteroscedasticity put more of a scientific feel on the question of heteroscedasticity. One of the most popular test for heteroscedasticity is the Breusch-Pagan test. This test has the null hypothesis of homoscedasticity - the assumption of constant variance being met. The alternative hypothesis is heteroscedasticity - the assumption failing. This is a hypothesis test where we want a large (insignificant) p-value.

Let’s see how to calculate this test in each software!

Based on the above tests there is a statistical problem with heteroscedasticity. The next step is to fix that problem.

One possibility is to transform your target variable with a variance stabilizing transformation. A variance stabilizing transformation is a mathematical transformation that converts a heteroscedastic model to a homoscedastic one. One of the most commonly used transformations to fix variance problems is the natural log.

Let’s transform our variables and explore the new model residuals in each software!

Normality

One of the assumptions of regression is that the probability distribution of \(\varepsilon_i\) is Normal. This is one of the hardest assumptions to meet in practice. However, if the assumption is not met on a small scale (symmetric, but not Normal, might be enough) then the results are not altered very much.

There are two common techniques to visually check for Normality of the error distribution.

  1. Histograms of the residuals

  2. Normal probability plots (Q-Q plots) of the residuals

Personally, I do not trust looking at histograms of data. Data is hard to check Normality with just histograms. This is because Normal distributions have specific properties that are hard to see visually. Although a histogram with symmetric that is in the shape of a bell-curve might be relatively easy to see, the thickness of the tails of a Normal distribution is hard to visualize and know for certain if it is Normal.

Q-Q plots are a far superior visual representation of Normality. These Normal probability plots compare the distribution of the residuals against expected quantiles from a Normal distribution with the same mean and standard deviation as the residuals. if the residuals are approximately equal to their expected points on the Normal distribution, then a straight, diagonal line is formed. Departures from a straight line are signs of the assumption not being met. The plot below shows when the assumption of Normality is being met as well as the common patterns from when it is not met:

Variety of Shapes of Q-Q Plots

If the Q-Q plot has a quadratic shape to it, that means that there is skewness in the data. The direction of the concavity of the quadratic curve shows which direction the skewness occurs. If you see an “S”-like or cubic shape to the Q-Q plot, then you have a kurtosis (thickness of tails) problem with your distribution. One of the deviations means your residuals have wider tails than a Normal distribution (Leptokurtic) and the other a more flat distribution (Platykurtic).

Let’s build histograms and Q-Q plots in each software!

Those Q-Q plots above show our residuals are not Normally distributed. However, visual inspection can be both difficult and subjective. There, we can add statistical tests for Normality to put more of a scientific feel on the question of Normality. There are many tests for Normality, but two of the most popular are the following:

  1. Shapiro-Wilk (better for small to medium sample sizes < 2,000)

  2. Anderson-Darling (better for larger sample sizes)

Both of these hypothesis test have the same null hypothesis - Normality. Since the alternative hypothesis for both tests is the distribution is not Normal, then we would prefer a larger p-value and these tests to not be statistically significant.

Let’s calculate both of these tests in each software!

We have ample evidence at this point to say that our residuals are definitely not Normally distributed.

One possible solution for having residuals that don’t follow a normal distribution is to transform the target variable. Similar to what we did for stabilizing variance for heteroscedasticity, when Normality is broken a transformation might help solve it.

A common statistical transformation is the Box-Cox transformation:

\[ y^` = \begin{cases}\frac{y^\lambda-1}{\lambda} & \text{if } \lambda \ne 0 \\ \log(y) & \text{if } \lambda = 0\end{cases} \]

The value of \(\lambda\) is optimized to best fit the data and typically done with the help of computers. Here are some common transformations based on their respective values of \(\lambda\):

  • Close to 1: No transformation

  • Close to 0: Natural log

  • Close to 0.5: Square root

  • Less than 0: Inverse

Let’s run the Box-Cox transformation in each software and try to fix our lack of Normality problem!

Although not perfect, these plots look like they can get a strong majority of our data to be relatively normal. Since we needed the natural log transformation to help with heteroscedasticity already, it would be a logical choice to take that same transformation as a good solution to our Normality problem. There is still some skewness, but we have yet to evaluate outliers which we will do in subsequent sections.

Independence of Errors

Independence of errors typically is not a problem with a linear regression model built without using time series data. Cross-sectional data is collect across different individuals at a specific point in time. This is what we have with our housing dataset. Time series data on the other hand is collected for one individual (typically) across consecutive points in time. An example of this would be watching the average price of homes over time.

Regression models of time series data can lead to problems in the modeling process. The value of a time series at the point in time \(t\) is often related to the value at the point in time \(t+1\). For example, the average price of homes today is highly correlated to what that average price of homes will be tomorrow, but much less correlated to what the average price of homes will be in a decade. Errors are now correlated with each other which underestimates \(\beta\) coefficients.

The Durbin-Watson test is a populat statistical test to see if there is possible correlation across observations in our residuals. The null hypothesis for this test is no residual correlation, while the alternative is that there is residual correlation. The Durbin-Waston \(d\) test statistic for residual correlation is the following:

\[ d = \frac{\sum_{t=2}^n (\hat\varepsilon_t - \hat\varepsilon_{t-1})^2}{\sum_{t=1}^n \hat\varepsilon^2} \]

The numerator in the above test statistic is the difference between errors at successive points in time. The \(d\) statistic has the following properties:

  • \(0 \le d \le 4\)

  • \(d \approx 2\): uncorrelated

  • \(d<2\): positively correlated

  • \(d>2\): negatively correlated

The sampling distribution of \(d\) is very complex, so no direct cut-off can be calculated for the statistic. This makes p-value calculations difficult and approximate.

Typically, we would not need to even run this test as our data is not recorded over time and therefore, we wouldn’t need to evaluate if it has correlation over time. However, just for completeness, let’s see how to do this in each software!

Outliers and Influential Observations

We have seen through previous looks through the residuals of our data that there are observations in our data that seem to be out of line with a majority of the data.

Observations with residuals that are extremely large are called outliers. When we say extremely large, we typically mean 3 standard deviations away from the mean. Remember, the mean of the residuals is 0. Our standardized residuals that we calculated earlier help us with this evaluation:

\[ \hat{\varepsilon}_i^* = \frac{\hat{\varepsilon}_i}{\sqrt{MSE}} = \frac{(y_i - \hat y_i)}{s} \]

Instead of just calculating standardized residuals, some people also calculate studentized residuals:

\[ \hat{\varepsilon}_i^{**} = \frac{\hat{\varepsilon}_i}{\sqrt{MSE \times (1-h_i)}} = \frac{(y_i - \hat y_i)}{s \sqrt{1-h_i}} \]

These residuals are not just adjusted for scale, but also for the possibility of their influence, here measured with leverage, \(h_i\). Similar to standardized residuals, the typical value of 3 or more represents large studentized residuals.

The leverage of an observation, \(h_i\) is the influence of that particular observation on the respective predicted value. In other words, how do the respective values of the predictor variables for the \(i^{th}\) observation affect the prediction \(\hat y_i\). The equation for leverage is not show here due to its complication. We will just rely on our software.

Leverage is one way to measure the impact an observation has on the regression analysis. Any observations with large impacts on the regression analysis are called influence observations. The following leverage values would imply an observation has an extremely large influence on the model:

\[ h_i > \frac{2(k+1)}{n} \]

Cook’s D (distance) is another way to determine is an observation is an influential observation. Cook’s D measure the influence of each observation on the estimated \(\beta\) coefficients overall.

\[ D_i = \frac{(y_i - \hat y_i)^2}{(k+1)MSE}(\frac{h_i}{(1-h_i)^2}) \]

Observations with large values of Cook’s D are considered influential. Large values of Cook’s D are the following:

\[ D_i > \frac{4}{n} \]

Let’s see how to calculate each of these in each software!

Multicollinearity

Multicollinearity Occurs when two or more of the predictor variables in a regression model are correlated with each other. If two inputs are correlated, they could be bringing similar information to the prediction of the target variable. To be a problem the correlation must be high, which is good because it is difficult to find predictor variables that are not correlated with each other.

High correlation between multiple predictor variables could lead to the following problems:

  • Errors in the calculation of the parameter estimates

  • Errors in the calculation of the standard errors

  • Results that are counterintuitive (parameters estimates with opposite signs)

There are some easy signs for the presence of severe multicollinearity in a regression model:

  • Incorrect signs of coefficients on predictor variables

  • Extreme differences in coefficients of predictor variables after the addition (or deletion) of another predictor variable

  • Switches in significance of predictor variables

A more formal way to evaluate if there are problems with multicollinearity is the variance inflation factor (VIF). The VIF is the amount of inflation of the standard error of the parameter estimates due to multicollinearity. Recall the equation for standard errors of parameter estimates in multiple linear regression:

\[ s_{\hat \beta_j}^2 = \frac{MSE}{(1-R_j^2)\times \sum_{i=1}^n (x_{j,i} - \bar x_j)} = \frac{1}{(1-R_j^2)} \times \frac{MSE}{\sum_{i=1}^n (x_{j,i} - \bar x_j)} \]

The value of \((1-R^2_j)\) is called the tolerance. The inverse of the tolerance, \(\frac{1}{(1-R_j^2)}\), is the VIF. In linear regression, when the VIF is greater than 10, we have a problem of multicollinearity.

The solutions to high multicollinearity are the following:

  1. Drop one of the correlated variables

  2. Avoid making inferences about the parameter estimates

  3. Biased regression techniques

The easiest solution to multicollinearity is the first one - dropping one of the variables causing the problem. If there are two variables that are highly correlated with each other we do not want to drop them both. They have made it this far in the analysis because they provide valuable information. However, they are just providing a severe overlap of information. Dropping all but one of these variables usually solves the multicollinearity problem.

The second solution is not as helpful, especially in the business world where we can’t just ignore what a variable’s stated impact is on the regression model. The third solution, biased regression techniques, will be covered in a subsequent section of this code deck.

Let’s calculate the VIF in each software!