Diagnostics

Residual Analysis

Linear regression models have key assumptions:

The mean of the target is accurately modeled by a linear function of the predictors.
The random error term, \(\varepsilon\), is assumed to have a constant variance, \(\sigma^2\).
The random error term, \(\varepsilon\), is assumed to have a normal distribution with a mean of 0.
The errors are independent of each other.

Sometimes, people state a 5th assumption:

No perfect collinearity

To start, we will deal with the first four assumptions as they are the primary assumptions of linear regression. Notice how each of the first four assumptions deal with the errors in some capacity. Assumptions 2 - 4 directly mention assumptions. The first assumption around linearity deals with a model not being fit well which results in patterns in our errors.

Let’s look at those first four assumptions visually:

Visual Representation of Linear Regression Assumptions

Notice how we have a relationship between the predictor variable and target variable. That relationship is not perfect and has some errors. However, those errors follow normal distributions at every point along the regression line. These errors all have the same spread, or variance, along the whole regression line as well.

We do not actually view the true errors of our model, \(\varepsilon_i\), in practice.

\[ y_i = \beta_0 + \beta_1 x_{1,i} + \cdots + \beta_k x_{k,i} + \varepsilon_i \]

\[ \varepsilon_i = y_i - (\beta_0 + \beta_1 x_{1,i} + \cdots + \beta_k x_{k,i}) \]

We do not observe actual errors because we don’t know the actual \(\beta\) coefficients. Instead, we have an estimate of the error - called a residual.

\[ \hat{\varepsilon}_i = y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_{1,i} + \cdots + \hat{\beta}_k x_{k,i}) \]

\[ \hat{\varepsilon}_i = y_i - \hat{y}_i \]

These residuals are just the difference between the true value of our target at each observation and the predicted values from our model of those observations. The following is a visual representation of that concept:

Residuals vs. True Errors from Regression Model

Let’s see how to get these residuals from each software!

Python
R

We are going to use the final variable list from our cross-validated stepwise regression from the previous section. These selected variables will go into an object called X_selected. We then use the statsmodels.api package and the add_constant function to add a column of 1’s for the intercept term. Lasty, using a combination of the OLS and fit functions we have our linear regression model. The results are printed out with the summary function.

Code

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import statsmodels.api as sm
from scipy import stats

selected_features = ['OverallQual', 'YearBuilt', 'YearRemodAdd', 'GrLivArea', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'WoodDeckSF', 'ScreenPorch', 'GarageYrBlt_was_missing', 'LotShape_IR2', 'LotShape_Reg', 'LandContour_HLS', 'LotConfig_CulDSac', 'Condition1_Norm', 'BldgType_2fmCon', 'BldgType_Duplex', 'BldgType_Twnhs', 'HouseStyle_2Story', 'ExterQual_Fa', 'ExterQual_TA', 'Foundation_CBlock', 'Foundation_PConc', 'BsmtQual_Fa', 'BsmtQual_Gd', 'BsmtQual_Missing', 'BsmtQual_TA', 'HeatingQC_Gd', 'HeatingQC_TA', 'KitchenQual_Fa', 'KitchenQual_Gd', 'KitchenQual_TA', 'GarageType_Missing', 'GarageQual_Fa', 'GarageQual_Missing', 'GarageQual_TA', 'GarageCond_Missing']

# Subset the DataFrame to selected features and build model
X_selected = X_reduced[selected_features].copy()
X_selected = sm.add_constant(X_selected)
model = sm.OLS(y, X_selected).fit()

# Summary with p-values, coefficients, R², etc.
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:              SalePrice   R-squared:                       0.822
Model:                            OLS   Adj. R-squared:                  0.816
Method:                 Least Squares   F-statistic:                     143.9
Date:                Sat, 15 Nov 2025   Prob (F-statistic):               0.00
Time:                        10:53:35   Log-Likelihood:                -12997.
No. Observations:                1095   AIC:                         2.606e+04
Df Residuals:                    1060   BIC:                         2.624e+04
Df Model:                          34                                         
Covariance Type:            nonrobust                                         
===========================================================================================
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
const                   -2.175e+05   2.19e+05     -0.994      0.321   -6.47e+05    2.12e+05
OverallQual              1.279e+04   1492.668      8.570      0.000    9863.423    1.57e+04
YearBuilt                 237.3298     83.924      2.828      0.005      72.654     402.006
YearRemodAdd              112.2428     79.515      1.412      0.158     -43.782     268.268
GrLivArea                  54.3861      3.237     16.803      0.000      48.035      60.737
Fireplaces               6131.4549   2075.915      2.954      0.003    2058.086    1.02e+04
GarageYrBlt              -211.6405     86.888     -2.436      0.015    -382.133     -41.148
GarageCars               1.887e+04   2517.699      7.496      0.000    1.39e+04    2.38e+04
WoodDeckSF                 37.5885      8.972      4.189      0.000      19.983      55.194
ScreenPorch                53.0124     19.141      2.770      0.006      15.453      90.572
GarageYrBlt_was_missing  3359.4116   2749.123      1.222      0.222   -2034.930    8753.753
LotShape_IR2             1.773e+04   6606.161      2.683      0.007    4763.931    3.07e+04
LotShape_Reg            -3726.8188   2504.164     -1.488      0.137   -8640.500    1186.863
LandContour_HLS          6549.2627   6311.098      1.038      0.300   -5834.403    1.89e+04
LotConfig_CulDSac        1.515e+04   4843.966      3.128      0.002    5647.515    2.47e+04
Condition1_Norm          1.364e+04   3226.084      4.228      0.000    7308.414       2e+04
BldgType_2fmCon         -8757.2057   7789.961     -1.124      0.261    -2.4e+04    6528.290
BldgType_Duplex         -1.841e+04   6747.458     -2.729      0.006   -3.17e+04   -5173.315
BldgType_Twnhs          -1.994e+04   6748.013     -2.956      0.003   -3.32e+04   -6703.182
HouseStyle_2Story       -1.202e+04   2835.845     -4.238      0.000   -1.76e+04   -6453.927
ExterQual_Fa            -1.418e+04    1.2e+04     -1.178      0.239   -3.78e+04    9439.891
ExterQual_TA            -5418.1438   3911.919     -1.385      0.166   -1.31e+04    2257.840
Foundation_CBlock        1.326e+04   4536.907      2.922      0.004    4355.636    2.22e+04
Foundation_PConc         1.241e+04   5240.455      2.367      0.018    2122.670    2.27e+04
BsmtQual_Fa             -4.417e+04   9709.360     -4.550      0.000   -6.32e+04   -2.51e+04
BsmtQual_Gd             -3.986e+04   4952.091     -8.049      0.000   -4.96e+04   -3.01e+04
BsmtQual_Missing        -4.134e+04   1.01e+04     -4.113      0.000   -6.11e+04   -2.16e+04
BsmtQual_TA             -3.968e+04   6254.691     -6.344      0.000    -5.2e+04   -2.74e+04
HeatingQC_Gd            -4449.9818   3219.601     -1.382      0.167   -1.08e+04    1867.533
HeatingQC_TA            -6010.6850   3068.190     -1.959      0.050    -1.2e+04       9.730
KitchenQual_Fa          -4.501e+04   9372.851     -4.802      0.000   -6.34e+04   -2.66e+04
KitchenQual_Gd          -3.471e+04   5149.388     -6.740      0.000   -4.48e+04   -2.46e+04
KitchenQual_TA          -4.216e+04   6016.914     -7.007      0.000    -5.4e+04   -3.04e+04
GarageType_Missing       3359.4116   2749.123      1.222      0.222   -2034.930    8753.753
GarageQual_Fa           -2.272e+04   1.07e+04     -2.130      0.033   -4.37e+04   -1787.834
GarageQual_Missing       3359.4116   2749.123      1.222      0.222   -2034.930    8753.753
GarageQual_TA           -1.489e+04   8958.221     -1.662      0.097   -3.25e+04    2691.881
GarageCond_Missing       3359.4116   2749.123      1.222      0.222   -2034.930    8753.753
==============================================================================
Omnibus:                      375.214   Durbin-Watson:                   2.046
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            35278.304
Skew:                          -0.575   Prob(JB):                         0.00
Kurtosis:                      30.783   Cond. No.                     1.00e+16
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.54e-22. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

The results above are there for comparison of later adjustments to the model if needed. To do residual analysis we need to gather the predicted values and residuals from our model. The fittedvalues function gives us predicted values from our training data. The resid function gives us the corresponding residuals.

Later calculations will require our residuals to be standardized. Here we are calculating the standardized residuals by dividing the residuals by the model’s \(RMSE\).

\[ \hat{\varepsilon}_i^* = \frac{\hat{\varepsilon}_i}{\sqrt{MSE}} \]

Recall from the previous section that the \(RMSE\) is the estimated standard deviation of the residuals. To get the model’s \(MSE\) we use the mse_resid function. From there we just calculate its square root using the numpy package’s sqrt function.

Code

fitted = model.fittedvalues
residuals = model.resid
standardized_residuals = residuals / np.sqrt(model.mse_resid)

We are going to use the final variable list from our cross-validated stepwise regression from the previous section. To make the results consistent between Python and R, we will just use the variables flagged by Python here. These selected variables will go into an object called X_selected. We then use the statsmodels.api package and the add_constant function to add a column of 1’s for the intercept term. Lasty, using a combination of the OLS and fit functions we have our linear regression model. The results are printed out with the summary function.

Code

X_reduced <- py$X_reduced

selected_features <- c('OverallQual', 'YearBuilt', 'YearRemodAdd', 'GrLivArea', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'WoodDeckSF', 'ScreenPorch', 'GarageYrBlt_was_missing', 'LotShape_IR2', 'LotShape_Reg', 'LandContour_HLS', 'LotConfig_CulDSac', 'Condition1_Norm', 'BldgType_2fmCon', 'BldgType_Duplex', 'BldgType_Twnhs', 'HouseStyle_2Story', 'ExterQual_Fa', 'ExterQual_TA', 'Foundation_CBlock', 'Foundation_PConc', 'BsmtQual_Fa', 'BsmtQual_Gd', 'BsmtQual_Missing', 'BsmtQual_TA', 'HeatingQC_Gd', 'HeatingQC_TA', 'KitchenQual_Fa', 'KitchenQual_Gd', 'KitchenQual_TA', 'GarageType_Missing', 'GarageQual_Fa', 'GarageQual_Missing', 'GarageQual_TA', 'GarageCond_Missing')

X_selected <- X_reduced[, selected_features]

# Build the linear regression
model <- lm(y ~ ., data = X_selected)

# Print the model summary with p-values, coefficients, R², etc.
summary(model)


Call:
lm(formula = y ~ ., data = X_selected)

Residuals:
    Min      1Q  Median      3Q     Max 
-383968  -15461    -513   14931  252273 

Coefficients: (3 not defined because of singularities)
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)             -2.175e+05  2.188e+05  -0.994  0.32063    
OverallQual              1.279e+04  1.493e+03   8.570  < 2e-16 ***
YearBuilt                2.373e+02  8.392e+01   2.828  0.00477 ** 
YearRemodAdd             1.122e+02  7.952e+01   1.412  0.15836    
GrLivArea                5.439e+01  3.237e+00  16.803  < 2e-16 ***
Fireplaces               6.131e+03  2.076e+03   2.954  0.00321 ** 
GarageYrBlt             -2.116e+02  8.689e+01  -2.436  0.01502 *  
GarageCars               1.887e+04  2.518e+03   7.496 1.39e-13 ***
WoodDeckSF               3.759e+01  8.972e+00   4.189 3.03e-05 ***
ScreenPorch              5.301e+01  1.914e+01   2.770  0.00571 ** 
GarageYrBlt_was_missing  1.344e+04  1.100e+04   1.222  0.22198    
LotShape_IR2             1.773e+04  6.606e+03   2.683  0.00740 ** 
LotShape_Reg            -3.727e+03  2.504e+03  -1.488  0.13698    
LandContour_HLS          6.549e+03  6.311e+03   1.038  0.29963    
LotConfig_CulDSac        1.515e+04  4.844e+03   3.128  0.00181 ** 
Condition1_Norm          1.364e+04  3.226e+03   4.228 2.57e-05 ***
BldgType_2fmCon         -8.757e+03  7.790e+03  -1.124  0.26120    
BldgType_Duplex         -1.841e+04  6.747e+03  -2.729  0.00646 ** 
BldgType_Twnhs          -1.994e+04  6.748e+03  -2.956  0.00319 ** 
HouseStyle_2Story       -1.202e+04  2.836e+03  -4.238 2.45e-05 ***
ExterQual_Fa            -1.418e+04  1.204e+04  -1.178  0.23906    
ExterQual_TA            -5.418e+03  3.912e+03  -1.385  0.16633    
Foundation_CBlock        1.326e+04  4.537e+03   2.922  0.00355 ** 
Foundation_PConc         1.241e+04  5.240e+03   2.367  0.01810 *  
BsmtQual_Fa             -4.417e+04  9.709e+03  -4.550 6.00e-06 ***
BsmtQual_Gd             -3.986e+04  4.952e+03  -8.049 2.23e-15 ***
BsmtQual_Missing        -4.134e+04  1.005e+04  -4.113 4.21e-05 ***
BsmtQual_TA             -3.968e+04  6.255e+03  -6.344 3.32e-10 ***
HeatingQC_Gd            -4.450e+03  3.220e+03  -1.382  0.16722    
HeatingQC_TA            -6.011e+03  3.068e+03  -1.959  0.05037 .  
KitchenQual_Fa          -4.501e+04  9.373e+03  -4.802 1.79e-06 ***
KitchenQual_Gd          -3.471e+04  5.149e+03  -6.740 2.59e-11 ***
KitchenQual_TA          -4.216e+04  6.017e+03  -7.007 4.32e-12 ***
GarageType_Missing              NA         NA      NA       NA    
GarageQual_Fa           -2.272e+04  1.067e+04  -2.130  0.03342 *  
GarageQual_Missing              NA         NA      NA       NA    
GarageQual_TA           -1.489e+04  8.958e+03  -1.662  0.09687 .  
GarageCond_Missing              NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 35130 on 1060 degrees of freedom
Multiple R-squared:  0.8219,    Adjusted R-squared:  0.8162 
F-statistic: 143.9 on 34 and 1060 DF,  p-value: < 2.2e-16

Later calculations will require our residuals to be standardized. Here we are calculating the standardized residuals by dividing the residuals by the model’s \(RMSE\).

\[ \hat{\varepsilon}_i^* = \frac{\hat{\varepsilon}_i}{\sqrt{MSE}} \]

Code

fitted <- fitted(model)
residuals <- residuals(model)
standardized_residuals <- rstandard(model)

Now that we have our residuals we can evaluate our assumptions.

Linearity / Lack-of-Fit

One of the assumptions of linear regression assumes that the expected value of the target variable is accurately modeled by a linear function of the predictor variables. We want our models to best represent the real relationship between our predictors and the target variable. If this is true, then we would expect our residual plots to be random scatter of observations (in other words, all of the “signal” was correctly captured in the model and there is just noise left over). If the residuals are not randomly scattered, then they are revealing potential misspecification (or lack-of-fit) of the model.

Here are the steps for detecting lack-of-fit in a model:

Plot residuals against the predicted values of the target variable.
Plot the partial residuals against each predictor variable.
Look for patterns in these plots:
- Trends
- Changes in variation
- Isolated, extreme observations

Partial residuals measure the effect of an individual variable \(x_i\) after accounting for all of the other predictor variables. The partial residuals for the \(j^{th}\) variable, \(x_j\), are defined as:

\[ e_i^* = y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_{1,i} + \cdots + \hat{\beta}_{j-1} x_{j-1,i} + \hat{\beta}_{j+1} x_{j+1,i} + \cdots + \hat{\beta}_k x_{k,i}) \]

\[ e_i^* = \hat{\varepsilon}_i + \hat\beta_j x_j \]

Essentially, the variable \(x_j\) as well as its coefficient has been removed from our calculation. That will allow us to see how that variable impacts the relationship. If we see any non-linear patterns that would be a sign that more signal remains.

Let’s look at a plot of our residuals and partial residuals using each software!

Python
R

We previously calculated our predicted values and residuals in Python. Now we need to plot them. We will use the matplotlib and seaborn packages to do so. With the seaborn package’s residplot function we just put in the predicted values (fitted values) as well as the residuals into the x and y options respectively. The lowess = True option fits a smoothed curve through the data points. This can help us see if there is any remaining pattern. The last few options add a horizontal line at our residuals average value of 0 (axhline) and add some titles and labels to the plot.

Code

plt.figure(figsize = (6, 4))
sns.residplot(x = fitted, y = residuals, lowess = True, line_kws = {'color': 'red'})
plt.axhline(0, linestyle = '--', color = 'black')
plt.xlabel("Fitted values")
plt.ylabel("Residuals")
plt.title("Linearity: Residuals vs Fitted")
plt.show()

From the plot above we can see a few things. Immediately we notice some points that seem to be far away from the main cloud of data. The estimated line through our data points has a very slight curve. This subtle of a curve might not be enough to say there is a nonlinear pattern, especially with some of those extreme observations potentially skewing the curve. The points also seem to be closer together for lower predicted values and more spread apart for higher predicted values. This might break the equal variance assumption.

Now let’s explore our partial residuals. There is no easy function in Python to build these for us so we will have to build them by hand. First, we use the model.exog_names function to gather a list of our variable names from the model. Notice how we skip the first element (start at index 1 instead of 0) to skip the intercept. Next, we need to gather our three pieces for the calculation of partial residuals - the residuals themselves (resid), the \(\beta\) coefficient (params), and the variable values (model.exog specified at the specific variable name). From there we can easily calculate the partial residuals based on the simplified calculation from above. Lastly, we need to plot these partial residuals by the variable we are using with the regplot function from seaborn. We use a for loop to loop through all of the variables in our variable list.

The code below will plot all of the variables from our model, but we will only show a couple with our output.

Code

# Get list of predictors (exclude intercept)
predictors = model.model.exog_names[1:] 

# Loop through predictors and generate partial residual plots
for var in predictors:
    residuals = model.resid
    beta = model.params[var]
    x_var = model.model.exog[:, model.model.exog_names.index(var)]
    partial_residuals = residuals + beta * x_var

    # Plot
    plt.figure(figsize=(6, 4))
    sns.regplot(x = x_var, y = partial_residuals, lowess = True, line_kws = {'color': 'red'})
    plt.axhline(0, linestyle = '--', color = 'black')
    plt.xlabel(var)
    plt.ylabel("Partial Residuals")
    plt.title(f"Partial Residual Plot for {var}")
    plt.tight_layout()
    plt.show()

We previously calculated our predicted values and residuals in R. Now we need to plot them. We will use the ggplot2 and ggthemes packages to do so. First, our data needs to be in one dataframe. Using the data.frame function we input our predicted / fitted values and the residuals into one dataframe. With the ggplot2 package’s ggplot function we just put in the predicted values (fitted values) as well as the residuals into the x and y options respectively as well as the dataframe they are from. Inside of the geom_smooth function we use the method = "loess" option fits a smoothed curve through the data points. This can help us see if there is any remaining pattern. The last few options add a horizontal line at our residuals average value of 0 (geom_hline) and add some titles and labels to the plot along with a theme.

Code

library(ggplot2)
library(ggthemes)  

resid_data <- data.frame(Fitted = fitted, Residuals = residuals)

# Create the residual plot
ggplot(resid_data, aes(x = Fitted, y = Residuals)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "loess", se = FALSE, color = "red") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
  labs(
    x = "Fitted values",
    y = "Residuals",
    title = "Linearity: Residuals vs Fitted"
  ) +
  theme_classic()

Now let’s explore our partial residuals. To do this we just use the crPlot function from the car package inside of a for loop. The for loop will loop through the crPlot function using all of the variable names obtained from the coef and names functions above it. Notice how we exclude the first element from the coefficient names since we do not want the intercept.

The code below will plot all of the variables from our model, but we will only show a couple with our output.

Code

library(car)

# Get predictor names (excluding intercept)
predictors <- names(coef(model))[-1]

# Loop over each predictor and plot
for (pred in predictors) {
  crPlot(model, variable = pred,
         main = paste("Partial Residual Plot for", pred))
}

Detecting Unequal Variance

One of the assumptions of linear regression is that the variance of the probability distribution of \(\varepsilon_i\) (usually denoted as \(\sigma^2\)) is constant. This property is called homoscedasticity. The breaking of this assumption is called heteroscedasticity. We want a model that is not more likely to be wrong by wider margins depending on if the prediction is a high or low prediction. We want regression models to have the same spread of errors no matter if the prediction is high or low.

One of the most common examples of when the assumption of homoscedasticity is broken is the following pattern to our residuals:

When looking at our residual plot there appears to be a similar pattern where we have low spread for low predictions and higher spread for higher predictions:

Visualizations are difficult to interpret and that interpretation is subjective at times. Formal statistical tests for heteroscedasticity put more of a scientific feel on the question of heteroscedasticity. One of the most popular test for heteroscedasticity is the Breusch-Pagan test. This test has the null hypothesis of homoscedasticity - the assumption of constant variance being met. The alternative hypothesis is heteroscedasticity - the assumption failing. This is a hypothesis test where we want a large (insignificant) p-value.

Let’s see how to calculate this test in each software!

Python
R

All we need to run the Breusch-Pagan test in Python is the het_breuschpagan function from the statsmodels.stats.diagnostic package. The inputs to that function are residuals we calculated earlier as well as the values of each of the variables from the model (obtained with the model.exog function. From there we are just printing out the p-value from the test.

Code

from statsmodels.stats.diagnostic import het_breuschpagan

bp_test = het_breuschpagan(residuals, model.model.exog)

print(f"Breusch-Pagan test: p-value = {bp_test[1]:.3f}")

Breusch-Pagan test: p-value = 0.000

That p-value is extremely small and would be significant at any reasonable \(\alpha\)-level.

All we need to run the Breusch-Pagan test in Python is the bptest function from the lmtest package. The inputs to that function are just the model that built previously saved as model.

Code

library(lmtest)

bptest(model)


    studentized Breusch-Pagan test

data:  model
BP = 318.18, df = 34, p-value < 2.2e-16

That p-value is extremely small and would be significant at any reasonable \(\alpha\)-level.

Based on the above tests there is a statistical problem with heteroscedasticity. The next step is to fix that problem.

One possibility is to transform your target variable with a variance stabilizing transformation. A variance stabilizing transformation is a mathematical transformation that converts a heteroscedastic model to a homoscedastic one. One of the most commonly used transformations to fix variance problems is the natural log.

Let’s transform our variables and explore the new model residuals in each software!

Python
R

The log function from the numpy package transforms our target variable. We save this new variable as the object y2 instead. We now fit the same OLS regression using this new natural log of SalePrice. The results are printed below, but the main focus is on the new residuals.

Code

y2 = np.log(train['SalePrice'])

model2 = sm.OLS(y2, X_selected).fit()

# Summary with p-values, coefficients, R², etc.
print(model2.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:              SalePrice   R-squared:                       0.858
Model:                            OLS   Adj. R-squared:                  0.854
Method:                 Least Squares   F-statistic:                     189.0
Date:                Sat, 15 Nov 2025   Prob (F-statistic):               0.00
Time:                        10:53:36   Log-Likelihood:                 520.79
No. Observations:                1095   AIC:                            -971.6
Df Residuals:                    1060   BIC:                            -796.6
Df Model:                          34                                         
Covariance Type:            nonrobust                                         
===========================================================================================
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
const                       6.4015      0.952      6.723      0.000       4.533       8.270
OverallQual                 0.0742      0.006     11.420      0.000       0.061       0.087
YearBuilt                   0.0014      0.000      3.780      0.000       0.001       0.002
YearRemodAdd                0.0018      0.000      5.166      0.000       0.001       0.002
GrLivArea                   0.0002   1.41e-05     17.202      0.000       0.000       0.000
Fireplaces                  0.0532      0.009      5.886      0.000       0.035       0.071
GarageYrBlt                -0.0008      0.000     -2.040      0.042      -0.002   -2.92e-05
GarageCars                  0.0771      0.011      7.040      0.000       0.056       0.099
WoodDeckSF                  0.0002    3.9e-05      4.497      0.000    9.89e-05       0.000
ScreenPorch                 0.0003   8.33e-05      4.097      0.000       0.000       0.001
GarageYrBlt_was_missing    -0.0081      0.012     -0.675      0.500      -0.032       0.015
LotShape_IR2                0.0558      0.029      1.940      0.053      -0.001       0.112
LotShape_Reg               -0.0213      0.011     -1.960      0.050      -0.043    2.93e-05
LandContour_HLS             0.0123      0.027      0.447      0.655      -0.042       0.066
LotConfig_CulDSac           0.0556      0.021      2.639      0.008       0.014       0.097
Condition1_Norm             0.0546      0.014      3.887      0.000       0.027       0.082
BldgType_2fmCon             0.0210      0.034      0.619      0.536      -0.046       0.087
BldgType_Duplex            -0.0285      0.029     -0.972      0.331      -0.086       0.029
BldgType_Twnhs             -0.1598      0.029     -5.443      0.000      -0.217      -0.102
HouseStyle_2Story          -0.0551      0.012     -4.466      0.000      -0.079      -0.031
ExterQual_Fa               -0.1845      0.052     -3.522      0.000      -0.287      -0.082
ExterQual_TA               -0.0036      0.017     -0.211      0.833      -0.037       0.030
Foundation_CBlock           0.0856      0.020      4.335      0.000       0.047       0.124
Foundation_PConc            0.0851      0.023      3.732      0.000       0.040       0.130
BsmtQual_Fa                -0.1372      0.042     -3.248      0.001      -0.220      -0.054
BsmtQual_Gd                -0.0605      0.022     -2.806      0.005      -0.103      -0.018
BsmtQual_Missing           -0.1859      0.044     -4.251      0.000      -0.272      -0.100
BsmtQual_TA                -0.0763      0.027     -2.805      0.005      -0.130      -0.023
HeatingQC_Gd               -0.0154      0.014     -1.099      0.272      -0.043       0.012
HeatingQC_TA               -0.0339      0.013     -2.539      0.011      -0.060      -0.008
KitchenQual_Fa             -0.1686      0.041     -4.135      0.000      -0.249      -0.089
KitchenQual_Gd             -0.0676      0.022     -3.019      0.003      -0.112      -0.024
KitchenQual_TA             -0.1125      0.026     -4.298      0.000      -0.164      -0.061
GarageType_Missing         -0.0081      0.012     -0.675      0.500      -0.032       0.015
GarageQual_Fa              -0.1045      0.046     -2.250      0.025      -0.196      -0.013
GarageQual_Missing         -0.0081      0.012     -0.675      0.500      -0.032       0.015
GarageQual_TA              -0.0472      0.039     -1.212      0.226      -0.124       0.029
GarageCond_Missing         -0.0081      0.012     -0.675      0.500      -0.032       0.015
==============================================================================
Omnibus:                      672.544   Durbin-Watson:                   1.999
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            18358.695
Skew:                          -2.336   Prob(JB):                         0.00
Kurtosis:                      22.508   Cond. No.                     1.00e+16
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.54e-22. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

We calculate the predictions, residuals, and standardized residuals the same way we did before, but with the results form the new model.

Code

fitted2 = model2.fittedvalues
residuals2 = model2.resid
standardized_residuals2 = residuals2 / np.sqrt(model2.mse_resid) 

plt.figure(figsize=(6, 4))
plt.scatter(fitted2, residuals2, alpha=0.7)
plt.axhline(0, linestyle='--', color='black')
plt.xlabel("Fitted values")
plt.ylabel("Residuals")
plt.title("Homoscedasticity Check")
plt.show()

Those residuals look much better in terms of heteroscedasticity. There are still some outliers that will make any statistical test say constant variance is broken, but the overall pattern looks much better than we had before.

The log function in R transforms our target variable. We save this new variable as the object y2 instead. We now fit the same OLS regression using this new natural log of SalePrice. The results are printed below, but the main focus is on the new residuals.

Code

y2 = log(train$SalePrice)

# Build the linear regression
model2 <- lm(y2 ~ ., data = X_selected)

# Print the model summary with p-values, coefficients, R², etc.
summary(model2)


Call:
lm(formula = y2 ~ ., data = X_selected)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.67235 -0.07213  0.00943  0.08801  0.43662 

Coefficients: (3 not defined because of singularities)
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)              6.401e+00  9.521e-01   6.723 2.90e-11 ***
OverallQual              7.416e-02  6.494e-03  11.420  < 2e-16 ***
YearBuilt                1.380e-03  3.651e-04   3.780 0.000166 ***
YearRemodAdd             1.787e-03  3.459e-04   5.166 2.86e-07 ***
GrLivArea                2.422e-04  1.408e-05  17.202  < 2e-16 ***
Fireplaces               5.316e-02  9.032e-03   5.886 5.29e-09 ***
GarageYrBlt             -7.710e-04  3.780e-04  -2.040 0.041643 *  
GarageCars               7.711e-02  1.095e-02   7.040 3.46e-12 ***
WoodDeckSF               1.755e-04  3.904e-05   4.497 7.67e-06 ***
ScreenPorch              3.412e-04  8.328e-05   4.097 4.51e-05 ***
GarageYrBlt_was_missing -3.228e-02  4.784e-02  -0.675 0.499974    
LotShape_IR2             5.576e-02  2.874e-02   1.940 0.052625 .  
LotShape_Reg            -2.135e-02  1.089e-02  -1.960 0.050315 .  
LandContour_HLS          1.226e-02  2.746e-02   0.447 0.655310    
LotConfig_CulDSac        5.562e-02  2.107e-02   2.639 0.008427 ** 
Condition1_Norm          5.456e-02  1.404e-02   3.887 0.000108 ***
BldgType_2fmCon          2.097e-02  3.389e-02   0.619 0.536242    
BldgType_Duplex         -2.852e-02  2.936e-02  -0.972 0.331488    
BldgType_Twnhs          -1.598e-01  2.936e-02  -5.443 6.49e-08 ***
HouseStyle_2Story       -5.511e-02  1.234e-02  -4.466 8.81e-06 ***
ExterQual_Fa            -1.845e-01  5.237e-02  -3.522 0.000446 ***
ExterQual_TA            -3.594e-03  1.702e-02  -0.211 0.832801    
Foundation_CBlock        8.557e-02  1.974e-02   4.335 1.60e-05 ***
Foundation_PConc         8.508e-02  2.280e-02   3.732 0.000200 ***
BsmtQual_Fa             -1.372e-01  4.224e-02  -3.248 0.001197 ** 
BsmtQual_Gd             -6.046e-02  2.155e-02  -2.806 0.005102 ** 
BsmtQual_Missing        -1.859e-01  4.373e-02  -4.251 2.32e-05 ***
BsmtQual_TA             -7.633e-02  2.721e-02  -2.805 0.005126 ** 
HeatingQC_Gd            -1.540e-02  1.401e-02  -1.099 0.271906    
HeatingQC_TA            -3.389e-02  1.335e-02  -2.539 0.011264 *  
KitchenQual_Fa          -1.686e-01  4.078e-02  -4.135 3.83e-05 ***
KitchenQual_Gd          -6.765e-02  2.240e-02  -3.019 0.002593 ** 
KitchenQual_TA          -1.125e-01  2.618e-02  -4.298 1.89e-05 ***
GarageType_Missing              NA         NA      NA       NA    
GarageQual_Fa           -1.045e-01  4.642e-02  -2.250 0.024634 *  
GarageQual_Missing              NA         NA      NA       NA    
GarageQual_TA           -4.723e-02  3.897e-02  -1.212 0.225878    
GarageCond_Missing              NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1528 on 1060 degrees of freedom
Multiple R-squared:  0.8584,    Adjusted R-squared:  0.8539 
F-statistic:   189 on 34 and 1060 DF,  p-value: < 2.2e-16

We calculate the predictions, residuals, and standardized residuals the same way we did before, but with the results form the new model.

Code

fitted2 <- fitted(model2)
residuals2 <- residuals(model2)
standardized_residuals2 <- rstandard(model2)

resid_data2 <- data.frame(Fitted = fitted2, Residuals = residuals2)

# Create the residual plot
ggplot(resid_data2, aes(x = Fitted, y = Residuals)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
  labs(
    x = "Fitted values",
    y = "Residuals",
    title = "Linearity: Residuals vs Fitted"
  ) +
  theme_classic()

Normality

One of the assumptions of regression is that the probability distribution of \(\varepsilon_i\) is Normal. This is one of the hardest assumptions to meet in practice. However, if the assumption is not met on a small scale (symmetric, but not Normal, might be enough) then the results are not altered very much.

There are two common techniques to visually check for Normality of the error distribution.

Histograms of the residuals
Normal probability plots (Q-Q plots) of the residuals

Personally, I do not trust looking at histograms of data. Data is hard to check Normality with just histograms. This is because Normal distributions have specific properties that are hard to see visually. Although a histogram with symmetric that is in the shape of a bell-curve might be relatively easy to see, the thickness of the tails of a Normal distribution is hard to visualize and know for certain if it is Normal.

Q-Q plots are a far superior visual representation of Normality. These Normal probability plots compare the distribution of the residuals against expected quantiles from a Normal distribution with the same mean and standard deviation as the residuals. if the residuals are approximately equal to their expected points on the Normal distribution, then a straight, diagonal line is formed. Departures from a straight line are signs of the assumption not being met. The plot below shows when the assumption of Normality is being met as well as the common patterns from when it is not met:

If the Q-Q plot has a quadratic shape to it, that means that there is skewness in the data. The direction of the concavity of the quadratic curve shows which direction the skewness occurs. If you see an “S”-like or cubic shape to the Q-Q plot, then you have a kurtosis (thickness of tails) problem with your distribution. One of the deviations means your residuals have wider tails than a Normal distribution (Leptokurtic) and the other a more flat distribution (Platykurtic).

Let’s build histograms and Q-Q plots in each software!

Python
R

We can just use the histplot and qqplot functions from the seaborn package to easily build these for us. We will use the subpolots function from matplotlib to plot these beside each other for easier viewing. By putting in the standardized residuals we calculated from earlier in the code into each of these functions, we can evaluate their normality. Again, it is easier to truly evaluate normality with the Q-Q plot as compared to the histogram.

Code

fig, ax = plt.subplots(1, 2, figsize = (6, 4))

# Histogram + Density Curve
sns.histplot(standardized_residuals, kde = True, ax = ax[0])
ax[0].set_title("Histogram of Residuals")

# Q-Q plot
sm.qqplot(standardized_residuals, line = '45', ax = ax[1])
ax[1].set_title("Q-Q Plot")

plt.tight_layout()
plt.show()

Code

library(ggplot2)
library(gridExtra)

# Histogram + density plot
p1 <- ggplot(data.frame(resid = standardized_residuals), aes(x = resid)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "lightblue", color = "black") +
  geom_density(color = "red", size = 1) +
  ggtitle("Histogram of Residuals") +
  theme_minimal()

# Q-Q plot using stats::qqnorm and qqline
qq_data <- data.frame(sample = standardized_residuals)
p2 <- ggplot(qq_data, aes(sample = sample)) +
  stat_qq() +
  stat_qq_line(color = "red") +
  ggtitle("Q-Q Plot") +
  theme_minimal()

# Arrange plots side-by-side
grid.arrange(p1, p2, ncol = 2)

Those Q-Q plots above show our residuals are not Normally distributed. However, visual inspection can be both difficult and subjective. There, we can add statistical tests for Normality to put more of a scientific feel on the question of Normality. There are many tests for Normality, but two of the most popular are the following:

Shapiro-Wilk (better for small to medium sample sizes < 2,000)
Anderson-Darling (better for larger sample sizes)

Both of these hypothesis test have the same null hypothesis - Normality. Since the alternative hypothesis for both tests is the distribution is not Normal, then we would prefer a larger p-value and these tests to not be statistically significant.

Let’s calculate both of these tests in each software!

Python
R

In Python we just need to use the stats library from scipy. Inside of that library there is the shapiro and anderson functions to run each of our respective Normality tests on our previously calculated residuals.

Code

from scipy import stats

shapiro_test = stats.shapiro(residuals)
print(f"Shapiro-Wilk test: Statistic = {shapiro_test.statistic:.3f}, p-value = {shapiro_test.pvalue:.3f}")

Shapiro-Wilk test: Statistic = 0.794, p-value = 0.000

Code

anderson_test = stats.anderson(residuals)
print(f"Anderson-Darling test: Statistic = {anderson_test.statistic:.3f}")

Anderson-Darling test: Statistic = 29.921

Code

print(f"Significance level: {anderson_test.significance_level}%")

Significance level: [15.  10.   5.   2.5  1. ]%

Code

print(f"Critical value: {anderson_test.critical_values}")

Critical value: [0.574 0.654 0.784 0.915 1.088]

Both of those tests show our data is statistically not Normally distributed at any reasonable \(\alpha\)-level.

In Python we just need to use the nortest package to the run the Anderson-Darling test. The Shapiro-Wilk test comes with base R’s preloaded stats package. Inside of those libraries there are the shapiro.test and ad.test functions to run each of our respective Normality tests on our previously calculated residuals.

Code

library(nortest)

shapiro_test <- shapiro.test(residuals)
cat(sprintf("Shapiro-Wilk test: Statistic = %.3f, p-value = %.3f\n",
            shapiro_test$statistic, shapiro_test$p.value))

Shapiro-Wilk test: Statistic = 0.794, p-value = 0.000

Code

ad_test <- ad.test(residuals)
cat(sprintf("Anderson-Darling test: Statistic = %.3f, p-value = %.3f\n",
            ad_test$statistic, ad_test$p.value))

Anderson-Darling test: Statistic = 29.921, p-value = 0.000

Both of those tests show our data is statistically not Normally distributed at any reasonable \(\alpha\)-level.

We have ample evidence at this point to say that our residuals are definitely not Normally distributed.

One possible solution for having residuals that don’t follow a normal distribution is to transform the target variable. Similar to what we did for stabilizing variance for heteroscedasticity, when Normality is broken a transformation might help solve it.

A common statistical transformation is the Box-Cox transformation:

\[ y^` = \begin{cases}\frac{y^\lambda-1}{\lambda} & \text{if } \lambda \ne 0 \\ \log(y) & \text{if } \lambda = 0\end{cases} \]

The value of \(\lambda\) is optimized to best fit the data and typically done with the help of computers. Here are some common transformations based on their respective values of \(\lambda\):

Close to 1: No transformation
Close to 0: Natural log
Close to 0.5: Square root
Less than 0: Inverse

Let’s run the Box-Cox transformation in each software and try to fix our lack of Normality problem!

Python
R

Still using the stats library from scipiy we can use the boxcox function to give us an optimal value of \(\lambda\) for our target variable y.

Code

new_data, best_lambda = stats.boxcox(y)

print(f"Optimal λ (lambda): {best_lambda:.4f}")

Optimal λ (lambda): -0.1723

That value of \(\lambda\) is rather close to 0, but it is also negative. Therefore, either the natural log or inverse functions might work for our data. Let’s calculate each of those for our data.

Code

y2 = np.log(train['SalePrice'])
y3 = 1/(train['SalePrice'])

The code is not shown to build the models and Q-Q plots for each as it would be just repeating a lot of the code above. Let’s just examine the Q-Q plots for each.

First, let’s look at the results from the natural log transformation on our target variable.

A majority of the plot looks really good except for the small part of the left hand tail still having some skewness. However, this skewness might be from a series of outliers in that tail.

Next, let’s look at the results from the inverse transformation on our target variable.

We see a similar pattern here as we did with the natural log transformation, just in reverse. Here, we have some outliers causing skewness in the right hand tail instead.

We need the MASS package in R to calculate the Box-Cox transformation. The input to the boxcox function is the linear regression model object we previously built. We will evaluate \(\lambda\) values between -2 and 2 as those seem reasonable. Next, we extract the best \(\lambda\) value from our results using the which.max function.

Code

library(MASS)

boxcox_result <- boxcox(model, lambda = seq(-2, 2, 0.01))

Code

best_lambda <- boxcox_result$x[which.max(boxcox_result$y)]

cat(sprintf("Optimal λ (lambda): %.4f\n", best_lambda))

Optimal λ (lambda): 0.0600

That value of \(\lambda\) is rather close to 0, but it is also positive. Therefore, the natural log might work for our data. Luckily, we already calculated that transformation for our heteroscedasticity concerns earlier.

The code is not shown to build the models and Q-Q plots for this as it would be just repeating a lot of the code above. Let’s just examine the Q-Q plots for the natural log transformation on our target variable.

A majority of the plot looks really good except for the small part of the left hand tail still having some skewness. However, this skewness might be from a series of outliers in that tail.

Although not perfect, these plots look like they can get a strong majority of our data to be relatively normal. Since we needed the natural log transformation to help with heteroscedasticity already, it would be a logical choice to take that same transformation as a good solution to our Normality problem. There is still some skewness, but we have yet to evaluate outliers which we will do in subsequent sections.

Independence of Errors

Independence of errors typically is not a problem with a linear regression model built without using time series data. Cross-sectional data is collect across different individuals at a specific point in time. This is what we have with our housing dataset. Time series data on the other hand is collected for one individual (typically) across consecutive points in time. An example of this would be watching the average price of homes over time.

Regression models of time series data can lead to problems in the modeling process. The value of a time series at the point in time \(t\) is often related to the value at the point in time \(t+1\). For example, the average price of homes today is highly correlated to what that average price of homes will be tomorrow, but much less correlated to what the average price of homes will be in a decade. Errors are now correlated with each other which underestimates \(\beta\) coefficients.

The Durbin-Watson test is a populat statistical test to see if there is possible correlation across observations in our residuals. The null hypothesis for this test is no residual correlation, while the alternative is that there is residual correlation. The Durbin-Waston \(d\) test statistic for residual correlation is the following:

\[ d = \frac{\sum_{t=2}^n (\hat\varepsilon_t - \hat\varepsilon_{t-1})^2}{\sum_{t=1}^n \hat\varepsilon^2} \]

The numerator in the above test statistic is the difference between errors at successive points in time. The \(d\) statistic has the following properties:

\(0 \le d \le 4\)
\(d \approx 2\): uncorrelated
\(d<2\): positively correlated
\(d>2\): negatively correlated

The sampling distribution of \(d\) is very complex, so no direct cut-off can be calculated for the statistic. This makes p-value calculations difficult and approximate.

Typically, we would not need to even run this test as our data is not recorded over time and therefore, we wouldn’t need to evaluate if it has correlation over time. However, just for completeness, let’s see how to do this in each software!

Python
R

The durbin_watson test that we will use on our residuals is found in the statsmodels package’s stats.stattools library.

Code

dw_stat = sm.stats.stattools.durbin_watson(residuals)
print(f"Durbin-Watson statistic: {dw_stat:.3f}")

Durbin-Watson statistic: 2.046

Not surprisingly, our Durbin-Watson test statistic is rather close to 2 and uncorrelated. Our data isn’t is ordered in any notion of time so again, this would not need calculating for our dataset.

We will again use a function from the lmtest package. Here we input our model object into the dwtest function to get the test statistic.

Code

library(lmtest)

dw_result <- dwtest(model)
cat(sprintf("Durbin-Watson statistic: %.3f, p-value: %.4f\n",
            dw_result$statistic, dw_result$p.value))

Durbin-Watson statistic: 2.046, p-value: 0.7606

Not surprisingly, our Durbin-Watson test statistic is rather close to 2 and uncorrelated. Our data isn’t is ordered in any notion of time so again, this would not need calculating for our dataset.

Outliers and Influential Observations

We have seen through previous looks through the residuals of our data that there are observations in our data that seem to be out of line with a majority of the data.

Observations with residuals that are extremely large are called outliers. When we say extremely large, we typically mean 3 standard deviations away from the mean. Remember, the mean of the residuals is 0. Our standardized residuals that we calculated earlier help us with this evaluation:

\[ \hat{\varepsilon}_i^* = \frac{\hat{\varepsilon}_i}{\sqrt{MSE}} = \frac{(y_i - \hat y_i)}{s} \]

Instead of just calculating standardized residuals, some people also calculate studentized residuals:

\[ \hat{\varepsilon}_i^{**} = \frac{\hat{\varepsilon}_i}{\sqrt{MSE \times (1-h_i)}} = \frac{(y_i - \hat y_i)}{s \sqrt{1-h_i}} \]

These residuals are not just adjusted for scale, but also for the possibility of their influence, here measured with leverage, \(h_i\). Similar to standardized residuals, the typical value of 3 or more represents large studentized residuals.

The leverage of an observation, \(h_i\) is the influence of that particular observation on the respective predicted value. In other words, how do the respective values of the predictor variables for the \(i^{th}\) observation affect the prediction \(\hat y_i\). The equation for leverage is not show here due to its complication. We will just rely on our software.

Leverage is one way to measure the impact an observation has on the regression analysis. Any observations with large impacts on the regression analysis are called influence observations. The following leverage values would imply an observation has an extremely large influence on the model:

\[ h_i > \frac{2(k+1)}{n} \]

Cook’s D (distance) is another way to determine is an observation is an influential observation. Cook’s D measure the influence of each observation on the estimated \(\beta\) coefficients overall.

\[ D_i = \frac{(y_i - \hat y_i)^2}{(k+1)MSE}(\frac{h_i}{(1-h_i)^2}) \]

Observations with large values of Cook’s D are considered influential. Large values of Cook’s D are the following:

\[ D_i > \frac{4}{n} \]

Let’s see how to calculate each of these in each software!

Python
R

The primary method of looking at outliers and influential observations in Python’s statsmodels package is the OLSInfluence function from the stats.outliers_influence part of statsmodels. We apply this OLSInfluence function on our model object. Here we are using the original model, but we could easily just use the second model where we transformed the target with the natural log. We already calculated the standardized residuals earlier and can examine if any of those take values over 3. However, here we will calculate the studentized residuals with the resid_studentized_internal function. From there we print our outliers.

Code

from statsmodels.stats.outliers_influence import OLSInfluence

# Influence summary
influence = OLSInfluence(model)

# Studentized Residuals
std_resid = influence.resid_studentized_internal

# Flag potential outliers
outliers = np.where(np.abs(std_resid) > 3)[0]
print(f"Potential outliers (|standardized residual| > 3): {outliers.tolist()}")

Potential outliers (|standardized residual| > 3): [187, 198, 225, 267, 418, 432, 489, 597, 608, 616, 850, 890, 971, 1037, 1082]

The above output gives the row value for each outlier based on studentized residuals. Let’s compare these values to some of the influential observation calculations.

To calculate leverage we will use the hat_matrix_diag function. We will print out only the observations that are greater than our leverage cut-off as defined above.

Code

leverage = influence.hat_matrix_diag

n = model.nobs
k = model.df_model
leverage_threshold = 2 * (k + 1) / n
high_leverage = np.where(leverage > leverage_threshold)[0]

print(f"High leverage points (leverage > {leverage_threshold:.3f}): {high_leverage.tolist()}")

High leverage points (leverage > 0.064): [4, 5, 19, 30, 36, 39, 66, 89, 93, 94, 95, 100, 103, 105, 124, 142, 152, 156, 160, 161, 164, 175, 183, 194, 195, 209, 214, 215, 225, 230, 240, 242, 248, 253, 279, 280, 289, 305, 310, 320, 329, 339, 346, 353, 361, 368, 391, 435, 439, 443, 454, 464, 474, 481, 489, 500, 516, 518, 519, 541, 565, 584, 589, 596, 613, 619, 622, 624, 632, 646, 669, 688, 706, 711, 715, 718, 731, 732, 736, 741, 745, 746, 751, 756, 758, 762, 784, 799, 808, 853, 877, 888, 902, 915, 921, 927, 931, 935, 937, 938, 940, 956, 957, 961, 984, 995, 1007, 1013, 1030, 1049, 1058, 1059, 1068, 1074, 1075, 1078, 1085]

The above output gives the row value for each influential observation based on leverage.

To get Cook’s D we will just use the cooks_distance function from the same OLSInfluence object we have been working with. Again, we will compare all of our observations to our cut-off defined above and print only the influential ones out.

Code

cooks_d = influence.cooks_distance[0]
influential_points = np.where(cooks_d > 4 / n)[0]

print(f"Influential points (Cook's D > {4/n:.3f}): {influential_points.tolist()}")

Influential points (Cook's D > 0.004): [39, 124, 159, 175, 185, 187, 190, 198, 225, 267, 279, 329, 339, 400, 417, 418, 432, 489, 597, 608, 616, 617, 619, 688, 702, 711, 741, 789, 808, 812, 843, 850, 890, 966, 971, 1030, 1037, 1049, 1078, 1082]

To try and view some of these all together, the statsmodels package has a graphics.influence_plot function. We just input our model object from before and specify the criterion = "cooks" option.

Code

fig, ax = plt.subplots(figsize=(8, 6))
sm.graphics.influence_plot(model, criterion = "cooks", ax = ax)
plt.title("Influence Plot: Cook's Distance vs Leverage")
plt.tight_layout()
plt.show()

This plot compares all three of the above. The vertical axis shows studentized residuals. Points far away from the cloud vertically (with values above 3 or below -3) would be outliers. Points with high values of leverage would be points to the far right of the plot. The size of bubbles corresponds to Cook’s D so bigger bubbles are more influential too. We can see that observation 1298 seems to be both a high leverage point as well as an outlier. It also has a really large value for Cook’s D. Not all of these will always agree, but they do on this point.

Outliers and influential points should not just be dropped from an analysis without further inspecting the data point for possible errors or interesting factors that might be able to be included in a model. The code below takes the metrics we calculated, puts them into a pandas Dataframe and then prints them out by descending values of Cook’s D.

Code

diagnostics = pd.DataFrame({
    'y': model.model.endog,
    'fitted': model.fittedvalues,
    'resid': model.resid,
    'std_resid': std_resid,
    'leverage': leverage,
    'cooks_d': cooks_d
})

# Sort by most Cook’s D
print("\nTop 5 most influential observations:")


Top 5 most influential observations:

Code

print(diagnostics.sort_values('cooks_d', ascending = False).head())

           y         fitted          resid  std_resid  leverage   cooks_d
1298  160000  543967.859930 -383967.859930 -11.529578  0.101418  0.394823
523   184750  497788.628015 -313038.628015  -9.182248  0.058344  0.137473
691   755000  502727.010268  252272.989732   7.326346  0.039359  0.057872
1182  745000  503661.235674  241338.764326   7.020772  0.042632  0.057762
1169  625000  438915.918502  186084.081498   5.430344  0.048610  0.039650

The primary method of looking at outliers and influential observations is R’s car package. We apply this rstudent function on our model object. Here we are using the original model, but we could easily just use the second model where we transformed the target with the natural log. We already calculated the standardized residuals earlier and can examine if any of those take values over 3. However, here we will calculate the studentized residuals. From there we print our outliers.

Code

library(car)
library(broom)
library(dplyr)

std_resid <- rstudent(model)

outliers <- which(abs(std_resid) > 3)
cat("Potential outliers (|studentized residual| > 3):", outliers, "\n")

Potential outliers (|studentized residual| > 3): 188 199 226 268 419 433 490 598 609 617 851 891 972 1038 1083

The above output gives the row value for each outlier based on studentized residuals. Let’s compare these values to some of the influential observation calculations.

To calculate leverage we will use the hatvalues function. We will print out only the observations that are greater than our leverage cut-off as defined above.

Code

leverage <- hatvalues(model)
n <- nobs(model)
k <- length(coefficients(model)) - 1

leverage_threshold <- 2 * (k + 1) / n
high_leverage <- which(leverage > leverage_threshold)
cat(sprintf("High leverage points (leverage > %.3f): %s\n", leverage_threshold, toString(high_leverage)))

High leverage points (leverage > 0.069): 6, 31, 37, 40, 67, 90, 94, 96, 104, 106, 125, 143, 153, 157, 161, 162, 165, 176, 184, 196, 210, 215, 226, 231, 241, 249, 254, 280, 281, 290, 306, 311, 321, 330, 340, 354, 369, 392, 440, 444, 455, 465, 490, 501, 520, 542, 566, 597, 614, 620, 623, 625, 647, 670, 689, 707, 716, 719, 733, 737, 742, 746, 747, 752, 757, 759, 763, 785, 800, 809, 854, 903, 916, 922, 928, 932, 936, 938, 941, 958, 962, 1014, 1031, 1050, 1069, 1075, 1076, 1086

The above output gives the row value for each influential observation based on leverage.

To get Cook’s D we will just use the cooks.distance function from the same model object we have been working with. Again, we will compare all of our observations to our cut-off defined above and print only the influential ones out.

Code

cooks_d <- cooks.distance(model)

influential_points <- which(cooks_d > 4 / n)
cat(sprintf("Influential points (Cook's D > %.3f): %s\n", 4 / n, toString(influential_points)))

Influential points (Cook's D > 0.004): 40, 125, 160, 176, 186, 188, 191, 199, 226, 268, 280, 330, 340, 401, 418, 419, 433, 435, 450, 475, 490, 598, 609, 617, 618, 620, 642, 689, 694, 703, 712, 716, 742, 790, 809, 813, 844, 851, 891, 946, 967, 972, 1031, 1038, 1050, 1079, 1083

To try and view some of these all together, the car package has a influencePlot function. We just input our model object from before.

Code

library(car)

influencePlot(model,
              id.method = "identify",
              main = "Influence Plot",
              sub = "Bubble size = Cook's distance")

         StudRes        Hat        CookD
523   -9.5662538 0.05834372 1.492560e-01
1298 -12.3226870 0.10141846 4.286653e-01
52     0.3481201 0.17005195 7.100373e-04
246   -0.1264835 0.16944357 9.333812e-05

Outliers and influential points should not just be dropped from an analysis without further inspecting the data point for possible errors or interesting factors that might be able to be included in a model. The code below takes the metrics we calculated, puts them into a dataframe and then prints them out by descending values of Cook’s D.

Code

# ---- Combine into a DataFrame ----
diagnostics <- data.frame(
  y = model$model[[1]],
  fitted = fitted(model),
  resid = resid(model),
  std_resid = rstudent(model),
  leverage = hatvalues(model),
  cooks_d = cooks.distance(model)
)

cat("\nTop 5 most influential observations (by Cook's D):\n")


Top 5 most influential observations (by Cook's D):

Code

top_influential <- diagnostics %>% arrange(desc(cooks_d)) %>% head(5)
print(top_influential)

          y   fitted     resid  std_resid   leverage    cooks_d
1298 160000 543967.9 -383967.9 -12.322687 0.10141846 0.42866530
523  184750 497788.6 -313038.6  -9.566254 0.05834372 0.14925597
691  755000 502727.0  252273.0   7.515646 0.03935867 0.06283273
1182 745000 503661.2  241338.8   7.186540 0.04263179 0.06271281
1169 625000 438915.9  186084.1   5.504894 0.04861047 0.04304850

Multicollinearity

Multicollinearity Occurs when two or more of the predictor variables in a regression model are correlated with each other. If two inputs are correlated, they could be bringing similar information to the prediction of the target variable. To be a problem the correlation must be high, which is good because it is difficult to find predictor variables that are not correlated with each other.

High correlation between multiple predictor variables could lead to the following problems:

Errors in the calculation of the parameter estimates
Errors in the calculation of the standard errors
Results that are counterintuitive (parameters estimates with opposite signs)

There are some easy signs for the presence of severe multicollinearity in a regression model:

Incorrect signs of coefficients on predictor variables
Extreme differences in coefficients of predictor variables after the addition (or deletion) of another predictor variable
Switches in significance of predictor variables

A more formal way to evaluate if there are problems with multicollinearity is the variance inflation factor (VIF). The VIF is the amount of inflation of the standard error of the parameter estimates due to multicollinearity. Recall the equation for standard errors of parameter estimates in multiple linear regression:

\[ s_{\hat \beta_j}^2 = \frac{MSE}{(1-R_j^2)\times \sum_{i=1}^n (x_{j,i} - \bar x_j)} = \frac{1}{(1-R_j^2)} \times \frac{MSE}{\sum_{i=1}^n (x_{j,i} - \bar x_j)} \]

The value of \((1-R^2_j)\) is called the tolerance. The inverse of the tolerance, \(\frac{1}{(1-R_j^2)}\), is the VIF. In linear regression, when the VIF is greater than 10, we have a problem of multicollinearity.

The solutions to high multicollinearity are the following:

Drop one of the correlated variables
Avoid making inferences about the parameter estimates
Biased regression techniques

The easiest solution to multicollinearity is the first one - dropping one of the variables causing the problem. If there are two variables that are highly correlated with each other we do not want to drop them both. They have made it this far in the analysis because they provide valuable information. However, they are just providing a severe overlap of information. Dropping all but one of these variables usually solves the multicollinearity problem.

The second solution is not as helpful, especially in the business world where we can’t just ignore what a variable’s stated impact is on the regression model. The third solution, biased regression techniques, will be covered in a subsequent section of this code deck.

Let’s calculate the VIF in each software!

Python
R

In Python we can just use the variance_inflation_factor function from the statsmodels.stats.outliers_influence package. We just need to input the values of our predictor variables. We just create a blank pandas DataFrame to store our column names and VIF values. The columns function will grab the names while the variance_inflation_factor function applied to the values of our predictor variables will calculate the VIF’s. We are just using a for loop to use this function across all predictor variables in the range of our predictor variables.

Code

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_data = pd.DataFrame()
vif_data["feature"] = X_selected.columns
vif_data["VIF"] = [variance_inflation_factor(X_selected.values, i) for i in range(X_selected.shape[1])]

print(vif_data)

                    feature           VIF
0                     const  4.248991e+04
1               OverallQual  3.831085e+00
2                 YearBuilt  5.792185e+00
3              YearRemodAdd  2.361090e+00
4                 GrLivArea  2.657506e+00
5                Fireplaces  1.566672e+00
6               GarageYrBlt  3.793925e+00
7                GarageCars  3.101954e+00
8                WoodDeckSF  1.179558e+00
9               ScreenPorch  1.098978e+00
10  GarageYrBlt_was_missing           inf
11             LotShape_IR2  1.131663e+00
12             LotShape_Reg  1.295285e+00
13          LandContour_HLS  1.063126e+00
14        LotConfig_CulDSac  1.162340e+00
15          Condition1_Norm  1.097693e+00
16          BldgType_2fmCon  1.154122e+00
17          BldgType_Duplex  1.318707e+00
18           BldgType_Twnhs  1.111311e+00
19        HouseStyle_2Story  1.507329e+00
20             ExterQual_Fa  1.393342e+00
21             ExterQual_TA  3.210060e+00
22        Foundation_CBlock  4.471441e+00
23         Foundation_PConc  6.026108e+00
24              BsmtQual_Fa  2.011392e+00
25              BsmtQual_Gd  5.306442e+00
26         BsmtQual_Missing  1.999892e+00
27              BsmtQual_TA  8.563747e+00
28             HeatingQC_Gd  1.296765e+00
29             HeatingQC_TA  1.691710e+00
30           KitchenQual_Fa  2.076800e+00
31           KitchenQual_Gd  5.632807e+00
32           KitchenQual_TA  8.029478e+00
33       GarageType_Missing           inf
34            GarageQual_Fa  3.640103e+00
35       GarageQual_Missing           inf
36            GarageQual_TA  6.537258e+00
37       GarageCond_Missing           inf

We can see there are some problems in the above output. We have some extremely large values of infinity. Those infinite values mean that we have perfect multicollinearity. Perfect multicollinearity occurs when two or more predictor variables are perfect combinations of each other (or take the exact same values). As we explore these variables we see they are all missing flag variables for things related to the home’s garage.

Let’s drop three of those variables are rerun the code to see what the VIF values are now!

Code

X_selected = X_selected.drop(['GarageType_Missing', 
                              'GarageQual_Missing',
                              'GarageCond_Missing'], axis=1)

vif_data = pd.DataFrame()
vif_data["feature"] = X_selected.columns
vif_data["VIF"] = [variance_inflation_factor(X_selected.values, i) for i in range(X_selected.shape[1])]

print(vif_data)

                    feature           VIF
0                     const  42489.905445
1               OverallQual      3.831085
2                 YearBuilt      5.792185
3              YearRemodAdd      2.361090
4                 GrLivArea      2.657506
5                Fireplaces      1.566672
6               GarageYrBlt      3.793925
7                GarageCars      3.101954
8                WoodDeckSF      1.179558
9               ScreenPorch      1.098978
10  GarageYrBlt_was_missing      5.029598
11             LotShape_IR2      1.131663
12             LotShape_Reg      1.295285
13          LandContour_HLS      1.063126
14        LotConfig_CulDSac      1.162340
15          Condition1_Norm      1.097693
16          BldgType_2fmCon      1.154122
17          BldgType_Duplex      1.318707
18           BldgType_Twnhs      1.111311
19        HouseStyle_2Story      1.507329
20             ExterQual_Fa      1.393342
21             ExterQual_TA      3.210060
22        Foundation_CBlock      4.471441
23         Foundation_PConc      6.026108
24              BsmtQual_Fa      2.011392
25              BsmtQual_Gd      5.306442
26         BsmtQual_Missing      1.999892
27              BsmtQual_TA      8.563747
28             HeatingQC_Gd      1.296765
29             HeatingQC_TA      1.691710
30           KitchenQual_Fa      2.076800
31           KitchenQual_Gd      5.632807
32           KitchenQual_TA      8.029478
33            GarageQual_Fa      3.640103
34            GarageQual_TA      6.537258

Those VIF values are all much more reasonable as they are all below 10.

In R, we can simply use the vif function from the car package. The input to the vif function is just the same model object we have been using from our linear regression model. We then create a data.frame to store all of our values to print them out.

However, if you were to run the following code, R would report an error around aliased coefficients.

Code

library(car)

vif_values <- vif(model)

vif_data <- data.frame(
  feature = names(vif_values),
  VIF = as.numeric(vif_values)
)

print(vif_data)

Those aliased coefficients are the ones in the linear regression output that have NA values. Those mean that we have perfect multicollinearity. Perfect multicollinearity occurs when two or more predictor variables are perfect combinations of each other (or take the exact same values).

The car package also has an alias function where you can put in your model object to discover the problem variables. In all of the output below it shows the values of not 0 where there are problems.

Code

library(car)

alias(model)

Model :
y ~ OverallQual + YearBuilt + YearRemodAdd + GrLivArea + Fireplaces + 
    GarageYrBlt + GarageCars + WoodDeckSF + ScreenPorch + GarageYrBlt_was_missing + 
    LotShape_IR2 + LotShape_Reg + LandContour_HLS + LotConfig_CulDSac + 
    Condition1_Norm + BldgType_2fmCon + BldgType_Duplex + BldgType_Twnhs + 
    HouseStyle_2Story + ExterQual_Fa + ExterQual_TA + Foundation_CBlock + 
    Foundation_PConc + BsmtQual_Fa + BsmtQual_Gd + BsmtQual_Missing + 
    BsmtQual_TA + HeatingQC_Gd + HeatingQC_TA + KitchenQual_Fa + 
    KitchenQual_Gd + KitchenQual_TA + GarageType_Missing + GarageQual_Fa + 
    GarageQual_Missing + GarageQual_TA + GarageCond_Missing

Complete :
                   (Intercept) OverallQual YearBuilt YearRemodAdd GrLivArea
GarageType_Missing 0           0           0         0            0        
GarageQual_Missing 0           0           0         0            0        
GarageCond_Missing 0           0           0         0            0        
                   Fireplaces GarageYrBlt GarageCars WoodDeckSF ScreenPorch
GarageType_Missing 0          0           0          0          0          
GarageQual_Missing 0          0           0          0          0          
GarageCond_Missing 0          0           0          0          0          
                   GarageYrBlt_was_missing LotShape_IR2 LotShape_Reg
GarageType_Missing 1                       0            0           
GarageQual_Missing 1                       0            0           
GarageCond_Missing 1                       0            0           
                   LandContour_HLS LotConfig_CulDSac Condition1_Norm
GarageType_Missing 0               0                 0              
GarageQual_Missing 0               0                 0              
GarageCond_Missing 0               0                 0              
                   BldgType_2fmCon BldgType_Duplex BldgType_Twnhs
GarageType_Missing 0               0               0             
GarageQual_Missing 0               0               0             
GarageCond_Missing 0               0               0             
                   HouseStyle_2Story ExterQual_Fa ExterQual_TA
GarageType_Missing 0                 0            0           
GarageQual_Missing 0                 0            0           
GarageCond_Missing 0                 0            0           
                   Foundation_CBlock Foundation_PConc BsmtQual_Fa BsmtQual_Gd
GarageType_Missing 0                 0                0           0          
GarageQual_Missing 0                 0                0           0          
GarageCond_Missing 0                 0                0           0          
                   BsmtQual_Missing BsmtQual_TA HeatingQC_Gd HeatingQC_TA
GarageType_Missing 0                0           0            0           
GarageQual_Missing 0                0           0            0           
GarageCond_Missing 0                0           0            0           
                   KitchenQual_Fa KitchenQual_Gd KitchenQual_TA GarageQual_Fa
GarageType_Missing 0              0              0              0            
GarageQual_Missing 0              0              0              0            
GarageCond_Missing 0              0              0              0            
                   GarageQual_TA
GarageType_Missing 0            
GarageQual_Missing 0            
GarageCond_Missing 0

As we explore these variables we see they are all missing flag variables for things related to the home’s garage. Let’s drop three of those variables are rerun the code to see what the VIF values are now!

Code

library(car)

X_selected <- X_selected %>%
  dplyr::select(
    -GarageType_Missing, -GarageQual_Missing, -GarageCond_Missing
  )

model <- lm(y ~ ., data = X_selected)

vif_values <- vif(model)

vif_data <- data.frame(
  feature = names(vif_values),
  VIF = as.numeric(vif_values)
)

print(vif_data)

                   feature      VIF
1              OverallQual 3.831085
2                YearBuilt 5.792185
3             YearRemodAdd 2.361090
4                GrLivArea 2.657506
5               Fireplaces 1.566672
6              GarageYrBlt 3.793925
7               GarageCars 3.101954
8               WoodDeckSF 1.179558
9              ScreenPorch 1.098978
10 GarageYrBlt_was_missing 5.029598
11            LotShape_IR2 1.131663
12            LotShape_Reg 1.295285
13         LandContour_HLS 1.063126
14       LotConfig_CulDSac 1.162340
15         Condition1_Norm 1.097693
16         BldgType_2fmCon 1.154122
17         BldgType_Duplex 1.318707
18          BldgType_Twnhs 1.111311
19       HouseStyle_2Story 1.507329
20            ExterQual_Fa 1.393342
21            ExterQual_TA 3.210060
22       Foundation_CBlock 4.471441
23        Foundation_PConc 6.026108
24             BsmtQual_Fa 2.011392
25             BsmtQual_Gd 5.306442
26        BsmtQual_Missing 1.999892
27             BsmtQual_TA 8.563747
28            HeatingQC_Gd 1.296765
29            HeatingQC_TA 1.691710
30          KitchenQual_Fa 2.076800
31          KitchenQual_Gd 5.632807
32          KitchenQual_TA 8.029478
33           GarageQual_Fa 3.640103
34           GarageQual_TA 6.537258

Those VIF values are all much more reasonable as they are all below 10.

Another option outside of the popular vif function from the car package is the check_collinearity function from the performance package. This function is not bothered by aliased coefficients like the vif function and just drops them in the calculation of the VIF.

Code

library(performance)

vif_data <- check_collinearity(model)
print(vif_data)

# Check for Multicollinearity

Low Correlation

              Term  VIF   VIF 95% CI adj. VIF Tolerance Tolerance 95% CI
       OverallQual 3.83 [3.47, 4.24]     1.96      0.26     [0.24, 0.29]
      YearRemodAdd 2.36 [2.16, 2.59]     1.54      0.42     [0.39, 0.46]
         GrLivArea 2.66 [2.43, 2.92]     1.63      0.38     [0.34, 0.41]
        Fireplaces 1.57 [1.46, 1.70]     1.25      0.64     [0.59, 0.69]
       GarageYrBlt 3.79 [3.44, 4.20]     1.95      0.26     [0.24, 0.29]
        GarageCars 3.10 [2.82, 3.42]     1.76      0.32     [0.29, 0.35]
        WoodDeckSF 1.18 [1.12, 1.28]     1.09      0.85     [0.78, 0.90]
       ScreenPorch 1.10 [1.05, 1.20]     1.05      0.91     [0.83, 0.95]
      LotShape_IR2 1.13 [1.08, 1.23]     1.06      0.88     [0.81, 0.93]
      LotShape_Reg 1.30 [1.22, 1.40]     1.14      0.77     [0.71, 0.82]
   LandContour_HLS 1.06 [1.02, 1.18]     1.03      0.94     [0.85, 0.98]
 LotConfig_CulDSac 1.16 [1.10, 1.26]     1.08      0.86     [0.79, 0.91]
   Condition1_Norm 1.10 [1.05, 1.20]     1.05      0.91     [0.84, 0.95]
   BldgType_2fmCon 1.15 [1.09, 1.25]     1.07      0.87     [0.80, 0.91]
   BldgType_Duplex 1.32 [1.24, 1.43]     1.15      0.76     [0.70, 0.81]
    BldgType_Twnhs 1.11 [1.06, 1.21]     1.05      0.90     [0.83, 0.94]
 HouseStyle_2Story 1.51 [1.40, 1.64]     1.23      0.66     [0.61, 0.71]
      ExterQual_Fa 1.39 [1.30, 1.51]     1.18      0.72     [0.66, 0.77]
      ExterQual_TA 3.21 [2.92, 3.54]     1.79      0.31     [0.28, 0.34]
 Foundation_CBlock 4.47 [4.05, 4.96]     2.11      0.22     [0.20, 0.25]
       BsmtQual_Fa 2.01 [1.85, 2.20]     1.42      0.50     [0.45, 0.54]
  BsmtQual_Missing 2.00 [1.84, 2.19]     1.41      0.50     [0.46, 0.54]
      HeatingQC_Gd 1.30 [1.22, 1.40]     1.14      0.77     [0.71, 0.82]
      HeatingQC_TA 1.69 [1.57, 1.84]     1.30      0.59     [0.54, 0.64]
    KitchenQual_Fa 2.08 [1.91, 2.27]     1.44      0.48     [0.44, 0.52]
     GarageQual_Fa 3.64 [3.30, 4.02]     1.91      0.27     [0.25, 0.30]

Moderate Correlation

                    Term  VIF   VIF 95% CI adj. VIF Tolerance Tolerance 95% CI
               YearBuilt 5.79 [5.22, 6.44]     2.41      0.17     [0.16, 0.19]
 GarageYrBlt_was_missing 5.03 [4.54, 5.58]     2.24      0.20     [0.18, 0.22]
        Foundation_PConc 6.03 [5.43, 6.70]     2.45      0.17     [0.15, 0.18]
             BsmtQual_Gd 5.31 [4.79, 5.89]     2.30      0.19     [0.17, 0.21]
             BsmtQual_TA 8.56 [7.70, 9.54]     2.93      0.12     [0.10, 0.13]
          KitchenQual_Gd 5.63 [5.08, 6.26]     2.37      0.18     [0.16, 0.20]
          KitchenQual_TA 8.03 [7.22, 8.95]     2.83      0.12     [0.11, 0.14]
           GarageQual_TA 6.54 [5.89, 7.27]     2.56      0.15     [0.14, 0.17]

We see the same final results here as we got after dropping the variables with perfect multicollinearity.