Categorical Data Analysis

Exploratory Data Analysis

First, we need to first explore our data before building any models to try and explain/predict our categorical target variable. With categorical variables, we can look at the distribution of the categories as well as see if this distribution has any association with other variables. For this analysis we are going to continue to use the Ames housing dataset. However, we need to come up with a binary target variable for this kind of modeling.

Imagine you worked for a real estate agency and got a bonus check if you sold a house above $175,000 in value. Let’s create this variable in our data:

Code

train['bonus'] = (train['SalePrice'] > 175000).astype(int)
test['bonus'] = (test['SalePrice'] > 175000).astype(int)

predictors = train.drop(columns=['SalePrice', 'bonus'])
predictors = pd.get_dummies(predictors, drop_first=True)
predictors = predictors.astype(float)

predictors = predictors.drop(['GarageType_Missing', 
                              'GarageQual_Missing',
                              'GarageCond_Missing'], axis=1)


X = predictors
y = train['bonus']

To keep the results the same between software we have loaded the Python train and test datasets in R and created the same X data frame and y vector for our predictors and target respectively.

Code

library(tidyverse)

train <- py$train
test <- py$test

train <- train %>%
  mutate(bonus = as.integer(SalePrice > 175000))

test <- test %>%
  mutate(bonus = as.integer(SalePrice > 175000))

predictors <- train %>%
  select(-SalePrice, -bonus)

X <- model.matrix(~ ., data = predictors)[, -1] %>%
  as_tibble()

X <- X %>%
  select(-GarageTypeMissing, -GarageQualMissing, -GarageCondMissing)

y <- train$bonus

Remember that we have already split our data into training and testing pieces from the previous sections. Because models are prone to discovering small, spurious patterns on the data that is used to create them (the training data), we set aside the validation and/or testing data to get a clear view of how they might perform on new data that the models have never seen before.

You are interested in what variables might be associated with obtaining a higher chance of getting a bonus (selling a house above $175,000). An association exists between two categorical variables if the distribution of one variable changes when the value of the other categorical variable changes. If there is no association, the distribution of the first variable is the same regardless of the value of the other variable. For example, if we wanted to know if obtaining a bonus on selling a house in Ames, Iowa was associated with whether the house had central air we could look at the distribution of bonus eligible houses. If we observe that 41% of homes with central air are bonus eligible and 41% of homes without central air are bonus eligible, then it appears that central air has no bearing on whether the home is bonus eligible. However, if instead we observe that only 3% of homes without central air are bonus eligible, but 44% of home with central air are bonus eligible, then it appears that having central air might be related to a home being bonus eligible.

To understand the distribution of categorical variables we need to look at frequency tables. A frequency table shows the number of observations that occur in certain categories or intervals. A one-way frequency table examines all the categories of one variable. These are easily visualized with bar charts.

Let’s see how to do this in each software!

Python
R

Let’s look at the distribution of both bonus eligibility and central air using the value_counts attribute of the train dataset. The countplot function allows us to view our data in a bar chart.

Code

from matplotlib import pyplot as plt
import seaborn as sns

train['bonus'].value_counts()

bonus
0    623
1    472
Name: count, dtype: int64

Code

ax = sns.countplot(x = "bonus", data = train, color = "blue")
ax.set(xlabel = 'Bonus Eligible',
       ylabel = 'Frequency',
       title = 'Bar Graph of Bonus Eligibility')
plt.show()

Code

train['CentralAir'].value_counts()

CentralAir
Y    1033
N      62
Name: count, dtype: int64

Code

plt.cla()
ax = sns.countplot(x = "CentralAir", data = train, color = "blue")
ax.set(xlabel = 'Central Air',
       ylabel = 'Frequency',
       title = 'Bar Graph of Central Air Availability')
plt.show()

Frequency tables show single variables, but if we want to explore two variables together we look at cross-tabulation tables. A cross-tabulation table shows the number of observations for each combination of the row and column variables.

Let’s again examine bonus eligibility, but this time across levels of central air. Here, we can use the crosstab function from pandas.

Code

import pandas as pd

pd.crosstab(index = train['CentralAir'], columns = train['bonus'])

bonus         0    1
CentralAir          
N            60    2
Y           563  470

Code

plt.cla()
ax = sns.countplot(x = "bonus", data = train, hue = "CentralAir")
ax.set(xlabel = 'Bonus Eligibility',
       ylabel = 'Frequency',
       title = 'Comparison of Central Air by Bonus')
plt.show()

From the above output we can see that 62 homes have no central air with only 2 of them being bonus eligible. However, there are 1033 homes that have central air with 470 of them being bonus eligible.

Let’s look at the distribution of both bonus eligibility and central air using the table function. The ggplot function with the geom_bar function allows us to view our data in a bar chart.

Code

table(y)

y
  0   1 
623 472

Code

ggplot(data = train) +
  geom_bar(mapping = aes(x = bonus))

Code

table(train$CentralAir)


   N    Y 
  62 1033

Code

ggplot(data = train) +
  geom_bar(mapping = aes(x = CentralAir))

Let’s again examine bonus eligibility, but this time across levels of central air. Again, we can use the table function. The prop.table function allows us to compare two variables in terms of proportions instead of frequencies.

Code

table(train$CentralAir, train$bonus)

Code

prop.table(table(train$CentralAir, train$bonus))

   
              0           1
  N 0.054794521 0.001826484
  Y 0.514155251 0.429223744

Code

ggplot(data = train) +
  geom_bar(mapping = aes(x = bonus, fill = CentralAir))

From the above output we can see that 62 homes have no central air with only 2 of them being bonus eligible. However, there are 1033 homes that have central air with 470 of them being bonus eligible. For an even more detailed breakdown we can use the CrossTable function.

Code

library(gmodels)

CrossTable(train$CentralAir, train$bonus, prop.chisq = FALSE, expected = TRUE)


 
   Cell Contents
|-------------------------|
|                       N |
|              Expected N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  1095 

 
                 | train$bonus 
train$CentralAir |         0 |         1 | Row Total | 
-----------------|-----------|-----------|-----------|
               N |        60 |         2 |        62 | 
                 |    35.275 |    26.725 |           | 
                 |     0.968 |     0.032 |     0.057 | 
                 |     0.096 |     0.004 |           | 
                 |     0.055 |     0.002 |           | 
-----------------|-----------|-----------|-----------|
               Y |       563 |       470 |      1033 | 
                 |   587.725 |   445.275 |           | 
                 |     0.545 |     0.455 |     0.943 | 
                 |     0.904 |     0.996 |           | 
                 |     0.514 |     0.429 |           | 
-----------------|-----------|-----------|-----------|
    Column Total |       623 |       472 |      1095 | 
                 |     0.569 |     0.431 |           | 
-----------------|-----------|-----------|-----------|

 
Statistics for All Table Factors


Pearson's Chi-squared test 
------------------------------------------------------------
Chi^2 =  42.61838     d.f. =  1     p =  6.653136e-11 

Pearson's Chi-squared test with Yates' continuity correction 
------------------------------------------------------------
Chi^2 =  40.91212     d.f. =  1     p =  1.592307e-10

The advantage of the CrossTable function is that we can easily get not only the frequencies, but the cell, row, and column proportions. For example, the third number in each cell gives us the row proportion. For homes without central air, 96.8% of them are not bonus eligible, while 3.2% of them are. For homes with central air, 54.5% of the homes are not bonus eligible, while 45.5% of them are. This would appear that the distribution of bonus eligible homes changes across levels of central air - a relationship between the two variables. This expected relationship needs to be tested statistically for verification. The expected = TRUE option provides an expected cell count for each cell. These expected counts help calculate the tests of association in the next section.

Tests of Association

We have statistical tests to evaluate relationships between two categorical variables. The null hypothesis for these statistical tests is that the two variables have no association - the distribution of one variable does not change across levels of another variable. The alternative hypothesis is an association between the two variables - the distribution of one variable changes across levels of another variable.

These statistical tests follow a $\chi^2$-distribution. The $\chi^2$-distribution is a distribution that has the following characteristics:

Bounded below by 0
Right skewed
One set of degrees of freedom

A plot of a variety of $\chi^2$-distributions is shown here:

Two common $\chi^2$ tests are the Pearson and Likelihood Ratio $\chi^2$ tests. They compare the observed count of observations in each cell of a cross-tabulation table to their expected count if there was no relationship. The expected cell count applies the overall distribution of one variable across all the levels of the other variable. For example, overall 57% of all homes are not bonus eligible. If that were to apply to every level of central air, then the 62 homes without central air would be expected to have 35.3 ( $ = 62 $ ) of them would be bonus eligible while 26.7 ( $ = 62 $ ) of them would not be bonus eligible. We actually observe 60 and 2 homes for each of these categories respectively. The further the observed data is from the expected data, the more evidence we have that there is a relationship between the two variables.

The test statistic for the Pearson $\chi^2$ test is the following:

\[ \chi^2_P = \sum_{i=1}^R \sum_{j=1}^C \frac{(Obs_{i,j} - Exp_{i,j})^2}{Exp_{i,j}} \]

From the equation above, the closer that the observed count of each cross-tabulation table cell (across all rows and columns) to the expected count, the smaller the test statistic. As with all hypothesis tests, the smaller the test statistic, the larger the p-value, implying less evidence for the alternative hypothesis.

Another common test is the Likelihood Ratio test. The test statistic for this is the following:

\[ \chi^2_L = 2 \times \sum_{i=1}^R \sum_{j=1}^C Obs_{i,j} \times \log(\frac{Obs_{i,j}}{Exp_{i,j}}) \]

The p-value comes from a $\chi^2$-distribution with degrees of freedom that equal the product of the number of rows minus one and the number of columns minus one. Both of the above tests have a sample size requirement. The sample size requirement is 80% or more of the cells in the cross-tabulation table need expected counts larger than 5.

For smaller sample sizes, this might be hard to meet. In those situations, we can use a more computationally expensive test called Fisher’s exact test. This test calculates every possible permutation of the data being evaluated to calculate the p-value without any distributional assumptions.

Both the Pearson and Likelihood Ratio $\chi^2$ tests can handle any type of categorical variable (either ordinal, nominal, or both). However, ordinal variables provide us extra information since the order of the categories actually matters compared to nominal categories. We can test for even more with ordinal variables against other ordinal variables. We can test whether two ordinal variables have a linear association as compared to just a general one. An ordinal test for association is the Mantel-Haenszel $\chi^2$ test. The test statistic for the Mantel-Haenszel $\chi^2$ test is the following:

\[ \chi^2_{MH} = (n-1)r^2 \] where $r^2$ is the Pearson correlation between the column and row variables. This test follows a $\chi^2$-distribution with only one degree of freedom.

Let’s see how to do each of these tests in each software!

Python
R

Let’s examine the relationship between central air and bonus eligibility with the Pearson and Likelihood Ratio tests using the chi2_contingency function on our crosstab object comparing our two variables.

Code

from scipy.stats import chi2_contingency

chi2_contingency(pd.crosstab(index = train['CentralAir'], columns = train['bonus']), correction = True)

Chi2ContingencyResult(statistic=np.float64(40.91211822680216), pvalue=np.float64(1.5923072242369428e-10), dof=1, expected_freq=array([[ 35.27488584,  26.72511416],
       [587.72511416, 445.27488584]]))

The above results shows an extremely small p-value that is below any reasonable significance level. This implies that we have statistical evidence for a relationship between having central air and bonus eligibility of homes. The p-value comes from a $\chi^2$-distribution with degrees of freedom that equal the product of the number of rows minus one and the number of columns minus one.

Since both the central air and bonus eligibility variables are binary, they are ordinal. Since they are both ordinal, we should use the Mantel-Haenszel $\chi^2$ test. However, at the time of writing this, Python does not have a built in function for this. Since the previous test can handle ordinal vs ordinal we will just use that.

Although not applicable in this dataset, if we had a small sample size we could use the Fisher’s exact test. To perform this test we can use the fisher_exact function.

Code

from scipy.stats import fisher_exact

fisher_exact(pd.crosstab(index = train['CentralAir'], columns = train['bonus']))

SignificanceResult(statistic=np.float64(25.044404973357015), pvalue=np.float64(4.0622269468997353e-13))

We see the same results as with the other tests because the assumptions were met for sample size.

Let’s examine the relationship between central air and bonus eligibility with the Pearson and Likelihood Ratio tests using the assocstats function on our table object comparing CentralAir to bonus.

Code

library(vcd)

assocstats(table(train$CentralAir, train$bonus))

                    X^2 df   P(> X^2)
Likelihood Ratio 55.774  1 8.1268e-14
Pearson          42.618  1 6.6531e-11

Phi-Coefficient   : 0.197 
Contingency Coeff.: 0.194 
Cramer's V        : 0.197

Since both the central air and bonus eligibility variables are binary, they are ordinal. Since they are both ordinal, we should use the Mantel-Haenszel $\chi^2$ test with the CMHtest function. In the main output table, the first row is the Mantel-Haenszel $\chi^2$ test.

Code

library(vcdExtra)

CMHtest(table(train$CentralAir, train$bonus))$table[1,]

       Chisq           Df         Prob 
4.257946e+01 1.000000e+00 6.786846e-11

From here we can see another extremely small p-value as we saw in the earlier, more general $\chi^2$ tests.

Although not applicable in this dataset, if we had a small sample size we could use the Fisher’s exact test. To perform this test we can use the fisher.test function.

Code

fisher.test(table(train$CentralAir, train$bonus))


    Fisher's Exact Test for Count Data

data:  table(train$CentralAir, train$bonus)
p-value = 4.062e-13
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
   6.564357 211.498059
sample estimates:
odds ratio 
   25.0044

We see the same results as with the other tests because the assumptions were met for sample size.

Looking at statistical tests one at a time can be time consuming. Similar to linear regression, we might need to evaluate large numbers of predictor variables against the target variable:

Categorical predictors: Tests of association
Continuous predictors: ANOVA F-Test

Let’s do this optimized version in each software!

Python
R

Since there are different tests to run for different types of predictor variables, the first thing we need to do is split our predictor variables into two dataframes - one for each. Since our data is already dummy coded as binary variables for all categorical variables, we simply just count which ones have only 2 unique values and which have more. To do this we use the SelectKBest function from the sklearn.feature_selection package. We also need the chi2 and f_classif functions to get the proper tests for each type of variable.

Code

from sklearn.feature_selection import SelectKBest, chi2, f_classif

# Separate categorical (dummy) vs. continuous features
categorical_features = [col for col in X.columns if X[col].nunique() == 2]
continuous_features = [col for col in X.columns if X[col].nunique() > 2]

X_cat = X[categorical_features]
X_cont = X[continuous_features]

Let’s compare our categorical variables to the categorical target variable first. We use the SelectKBest function with the score_func = chi2 option to use the Pearson $\chi^2$ test. The k = 'all' option keeps all of the predictor variables in the output for us to decide which are significant and which are not. Next, we input the target variable y and the categorical predictor variables X_cat into the fit function. From there we just use the columns, scores_, and pvalues_ functions to extract the needed results from the SelectKBest object. We use sort_values to rank by the $\chi^2$ test statistic and then print the output.

Code

# Fit SelectKBest for Categorical Variables
selector = SelectKBest(score_func = chi2, k = 'all')
selector.fit(X_cat, y)

SelectKBest(k='all', score_func=<function chi2 at 0x31e96a7a0>)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Code

# Create a DataFrame with feature names, Chi2-scores, and p-values
scores_cat_df = pd.DataFrame({
    'Feature': X_cat.columns,
    'Chi2_score': selector.scores_,
    'p_value': selector.pvalues_
})

# Sort by Chi2-score descending
scores_cat_df = scores_cat_df.sort_values(by = 'Chi2_score', ascending = False)

with pd.option_context('display.max_rows', None):
    print(scores_cat_df)

                    Feature  Chi2_score       p_value
32             ExterQual_Gd  224.599505  8.977600e-51
35         Foundation_PConc  175.457806  4.755917e-40
50           KitchenQual_TA  173.723647  1.137457e-39
42              BsmtQual_TA  160.246359  9.995905e-37
33             ExterQual_TA  154.565036  1.742800e-35
60        GarageType_Detchd  147.900974  4.986293e-34
49           KitchenQual_Gd  145.278000  1.867169e-33
53      FireplaceQu_Missing  111.315245  5.047157e-26
40              BsmtQual_Gd  108.458896  2.132267e-25
34        Foundation_CBlock   97.986917  4.211560e-23
46             HeatingQC_TA   95.777399  1.285550e-22
56        GarageType_Attchd   80.424477  3.020306e-19
52           FireplaceQu_Gd   59.714787  1.096496e-14
28        HouseStyle_2Story   56.609415  5.315671e-14
55           FireplaceQu_TA   43.957692  3.355511e-11
4              LotShape_Reg   36.693270  1.382559e-09
1   GarageYrBlt_was_missing   34.183260  5.015858e-09
58       GarageType_BuiltIn   29.236936  6.404660e-08
21          BldgType_Duplex   21.442420  3.646145e-06
48           KitchenQual_Fa   19.349401  1.088534e-05
41         BsmtQual_Missing   18.940610  1.348514e-05
8         LotConfig_CulDSac   18.090433  2.106572e-05
39              BsmtQual_Fa   17.091630  3.561894e-05
61            GarageQual_Fa   15.972661  6.426385e-05
65            GarageCond_Fa   14.029441  1.799706e-04
12         Condition1_Feedr   13.657770  2.193318e-04
36          Foundation_Slab   13.637239  2.217433e-04
29        HouseStyle_SFoyer   11.832078  5.821899e-04
20          BldgType_2fmCon   11.832078  5.821899e-04
2              LotShape_IR2   11.807235  5.900098e-04
44             HeatingQC_Gd   11.774367  6.005193e-04
22           BldgType_Twnhs   11.529898  6.848578e-04
69             PavedDrive_P   11.104001  8.614168e-04
24        HouseStyle_1.5Unf    9.849117  1.699133e-03
54           FireplaceQu_Po    9.091493  2.568012e-03
30          HouseStyle_SLvl    8.505131  3.541463e-03
16          Condition1_RRAe    6.818620  9.021227e-03
59       GarageType_CarPort    6.818620  9.021227e-03
14          Condition1_PosA    6.599576  1.020030e-02
5           LandContour_HLS    6.468732  1.097888e-02
31             ExterQual_Fa    5.916039  1.500365e-02
68            GarageCond_TA    5.109384  2.379677e-02
64            GarageQual_TA    4.593559  3.209230e-02
67            GarageCond_Po    4.545746  3.300063e-02
57       GarageType_Basment    4.073247  4.356745e-02
26        HouseStyle_2.5Fin    3.959251  4.661444e-02
43             HeatingQC_Fa    3.836129  5.015920e-02
27        HouseStyle_2.5Unf    3.756431  5.260476e-02
70             PavedDrive_Y    3.710338  5.407643e-02
6           LandContour_Low    3.540288  5.989506e-02
15          Condition1_PosN    2.560982  1.095305e-01
47             CentralAir_Y    2.413095  1.203242e-01
13          Condition1_Norm    2.302612  1.291566e-01
63            GarageQual_Po    2.272873  1.316556e-01
0   LotFrontage_was_missing    1.996018  1.577131e-01
62            GarageQual_Gd    1.890713  1.691216e-01
10            LotConfig_FR3    1.659215  1.977093e-01
11         LotConfig_Inside    1.330450  2.487251e-01
18          Condition1_RRNe    1.319915  2.506073e-01
25        HouseStyle_1Story    0.972153  3.241438e-01
23          BldgType_TwnhsE    0.948505  3.301002e-01
45             HeatingQC_Po    0.757624  3.840730e-01
51           FireplaceQu_Fa    0.649568  4.202671e-01
66            GarageCond_Gd    0.602896  4.374753e-01
19          Condition1_RRNn    0.581947  4.455502e-01
3              LotShape_IR3    0.581947  4.455502e-01
37         Foundation_Stone    0.233609  6.288609e-01
9             LotConfig_FR2    0.137593  7.106859e-01
38          Foundation_Wood    0.116804  7.325267e-01
7           LandContour_Lvl    0.081683  7.750293e-01
17          Condition1_RRAn    0.078624  7.791702e-01

Let’s compare our continuous variables to the categorical target variable next. We use the SelectKBest function with the score_func = f_classif option to use the ANOVA F test. The k = 'all' option keeps all of the predictor variables in the output for us to decide which are significant and which are not. Next, we input the target variable y and the continuous predictor variables X_cont into the fit function. From there we just use the columns, scores_, and pvalues_ functions to extract the needed results from the SelectKBest object. We use sort_values to rank by the F test statistic and then print the output.

Code

# Fit SelectKBest for Continuous Variables
selector = SelectKBest(score_func = f_classif, k = 'all')
selector.fit(X_cont, y)

SelectKBest(k='all')

Code

# Create a DataFrame with feature names, F-scores, and p-values
scores_cont_df = pd.DataFrame({
    'Feature': X_cont.columns,
    'F_score': selector.scores_,
    'p_value': selector.pvalues_
})

# Sort by F-score descending
scores_cont_df = scores_cont_df.sort_values(by = 'F_score', ascending = False)

with pd.option_context('display.max_rows', None):
    print(scores_cont_df)

          Feature     F_score        p_value
2     OverallQual  939.444747  2.105314e-149
9        FullBath  586.871120  4.006507e-104
8       GrLivArea  496.380694   5.877885e-91
15     GarageCars  495.071231   9.230725e-91
3       YearBuilt  477.733334   3.767127e-88
4    YearRemodAdd  409.667751   1.300176e-77
16     GarageArea  377.945632   1.551815e-72
14    GarageYrBlt  349.107274   7.999537e-68
5     TotalBsmtSF  302.443271   5.418162e-60
6        1stFlrSF  278.514412   7.129802e-56
12   TotRmsAbvGrd  232.557625   9.412330e-48
13     Fireplaces  199.883786   8.403050e-42
18    OpenPorchSF  122.231599   5.218397e-27
7        2ndFlrSF   92.101987   5.383967e-21
10       HalfBath   87.582941   4.443573e-20
17     WoodDeckSF   70.111757   1.699288e-16
0     LotFrontage   63.169962   4.696179e-15
1         LotArea   57.771800   6.313604e-14
11   BedroomAbvGr   16.337071   5.673082e-05
19  EnclosedPorch   15.720686   7.818955e-05
20      3SsnPorch    6.740800   9.549670e-03
21    ScreenPorch    3.201426   7.385091e-02
24         MoSold    3.200368   7.389852e-02
22       PoolArea    3.116430   7.778563e-02
25         YrSold    1.159797   2.817452e-01
23        MiscVal    0.302122   5.826672e-01

That is a much quicker and easier way to get a large number of hypothesis tests comparing both the categorical and continuous predictors variables to our target variable.

Since there are different tests to run for different types of predictor variables, the first thing we need to do is split our predictor variables into two dataframes - one for each. Since our data is already dummy coded as binary variables for all categorical variables, we simply just count which ones have only 2 unique values and which have more. To do this we use the map_lgl and n_distinct functions from the purrr and dplyr packages respectively.

Code

library(dplyr)
library(purrr)
library(broom)
library(tibble)

names(X)[names(X) == "`1stFlrSF`"] <- "FirstFlrSF"
names(X)[names(X) == "`2ndFlrSF`"] <- "SecondFlrSF"
names(X)[names(X) == "`3SsnPorch`"] <- "ThreeStoryPorch"

df <- bind_cols(X, target = y)

categorical_features <- names(X)[map_lgl(X, ~ n_distinct(.) == 2)]
continuous_features  <- names(X)[map_lgl(X, ~ n_distinct(.) > 2)]

Let’s compare our categorical variables to the categorical target variable first. We use the map_dfr function along with the chisq.test function to use the Pearson $\chi^2$ test on a table of our predictor variable and target variable. From there we just extract the variable names, test statistic, and p-values from the necessary objects. These results might be different than Python because the chisq.test function uses the Yates’ continuity correction on the results. We use arrange to rank by the $\chi^2$ test statistic and then print the output.

Code

chi2_results <- map_dfr(categorical_features, function(var) {
  # Treat each feature as a 0/1 vector
  x <- df[[var]]
  y <- df$target
  
  tbl <- table(x, y)
  test <- chisq.test(tbl)
  
  tibble(
    Feature = var,
    Chi2_score = unname(test$statistic),
    p_value = test$p.value
  )
})

chi2_results <- chi2_results %>%
  arrange(desc(Chi2_score))

print(chi2_results, n = Inf)

# A tibble: 71 × 3
   Feature                 Chi2_score   p_value
   <chr>                        <dbl>     <dbl>
 1 ExterQualTA               4.00e+ 2 4.38 e-89
 2 KitchenQualTA             3.47e+ 2 2.15 e-77
 3 ExterQualGd               3.36e+ 2 4.94 e-75
 4 FoundationPConc           3.16e+ 2 1.13 e-70
 5 BsmtQualTA                2.86e+ 2 4.58 e-64
 6 KitchenQualGd             2.39e+ 2 6.19 e-54
 7 FireplaceQuMissing        2.10e+ 2 1.24 e-47
 8 GarageTypeDetchd          2.02e+ 2 8.61 e-46
 9 GarageTypeAttchd          1.98e+ 2 4.69 e-45
10 BsmtQualGd                1.86e+ 2 2.46 e-42
11 FoundationCBlock          1.70e+ 2 8.23 e-39
12 HeatingQCTA               1.32e+ 2 1.60 e-30
13 LotShapeReg               9.82e+ 1 3.79 e-23
14 HouseStyle2Story          8.00e+ 1 3.65 e-19
15 FireplaceQuGd             7.85e+ 1 8.04 e-19
16 GarageCondTA              5.66e+ 1 5.22 e-14
17 FireplaceQuTA             5.55e+ 1 9.20 e-14
18 GarageQualTA              4.36e+ 1 4.09 e-11
19 PavedDriveY               4.32e+ 1 4.99 e-11
20 CentralAirY               4.09e+ 1 1.59 e-10
21 GarageYrBlt_was_missing   3.43e+ 1 4.76 e- 9
22 GarageTypeBuiltIn         2.94e+ 1 5.83 e- 8
23 BldgTypeDuplex            2.06e+ 1 5.57 e- 6
24 KitchenQualFa             1.83e+ 1 1.93 e- 5
25 LotConfigCulDSac          1.81e+ 1 2.08 e- 5
26 BsmtQualMissing           1.76e+ 1 2.69 e- 5
27 Condition1Norm            1.60e+ 1 6.39 e- 5
28 BsmtQualFa                1.59e+ 1 6.62 e- 5
29 GarageQualFa              1.53e+ 1 9.12 e- 5
30 HeatingQCGd               1.36e+ 1 2.29 e- 4
31 Condition1Feedr           1.35e+ 1 2.42 e- 4
32 GarageCondFa              1.29e+ 1 3.23 e- 4
33 FoundationSlab            1.21e+ 1 4.95 e- 4
34 LotShapeIR2               1.10e+ 1 9.31 e- 4
35 BldgType2fmCon            1.07e+ 1 1.08 e- 3
36 HouseStyleSFoyer          1.07e+ 1 1.08 e- 3
37 BldgTypeTwnhs             1.06e+ 1 1.11 e- 3
38 PavedDriveP               9.95e+ 0 1.60 e- 3
39 HouseStyle1.5Unf          8.27e+ 0 4.03 e- 3
40 HouseStyleSLvl            8.09e+ 0 4.45 e- 3
41 FireplaceQuPo             7.50e+ 0 6.17 e- 3
42 LandContourHLS            5.80e+ 0 1.60 e- 2
43 Condition1RRAe            5.22e+ 0 2.24 e- 2
44 GarageTypeCarPort         5.22e+ 0 2.24 e- 2
45 ExterQualFa               4.63e+ 0 3.13 e- 2
46 LotConfigInside           4.57e+ 0 3.24 e- 2
47 Condition1PosA            4.50e+ 0 3.38 e- 2
48 HeatingQCFa               3.29e+ 0 6.97 e- 2
49 GarageTypeBasment         3.06e+ 0 8.04 e- 2
50 GarageCondPo              2.97e+ 0 8.46 e- 2
51 LandContourLow            2.93e+ 0 8.67 e- 2
52 HouseStyle2.5Unf          2.59e+ 0 1.08 e- 1
53 HouseStyle2.5Fin          2.50e+ 0 1.14 e- 1
54 LotFrontage_was_missing   2.17e+ 0 1.40 e- 1
55 Condition1PosN            1.79e+ 0 1.81 e- 1
56 HouseStyle1Story          1.75e+ 0 1.86 e- 1
57 GarageQualGd              1.16e+ 0 2.82 e- 1
58 GarageQualPo              8.57e- 1 3.54 e- 1
59 BldgTypeTwnhsE            8.14e- 1 3.67 e- 1
60 LandContourLvl            6.52e- 1 4.19 e- 1
61 LotConfigFR3              6.16e- 1 4.33 e- 1
62 FireplaceQuFa             3.62e- 1 5.47 e- 1
63 GarageCondGd              1.57e- 1 6.92 e- 1
64 LotShapeIR3               9.74e- 2 7.55 e- 1
65 Condition1RRNn            9.74e- 2 7.55 e- 1
66 LotConfigFR2              4.14e- 2 8.39 e- 1
67 Condition1RRNe            1.94e- 2 8.89 e- 1
68 FoundationStone           5.09e- 3 9.43 e- 1
69 Condition1RRAn            3.04e- 3 9.56 e- 1
70 HeatingQCPo               1.61e-27 1.000e+ 0
71 FoundationWood            1.49e-29 1.000e+ 0

Let’s compare our continuous variables to the categorical target variable next. We use the map_dfr function with the lm function to use the ANOVA F test. We sue the paste function to craete the needed formula’s relating each predictor variable to the target. From there we just extract the needed results from the respective objects. We use arrange to rank by the F test statistic and then print the output.

Code

f_results <- map_dfr(continuous_features, function(var) {
  formula <- as.formula(paste(var, "~ target"))
  model <- lm(formula, data = df)
  tidy_anova <- tidy(summary(model))
  
  tibble(
    Feature = var,
    F_score = summary(model)$fstatistic[1],
    p_value = pf(summary(model)$fstatistic[1],
                 summary(model)$fstatistic[2],
                 summary(model)$fstatistic[3],
                 lower.tail = FALSE)
  )
})

f_results <- f_results %>%
  arrange(desc(F_score))

print(f_results, n = Inf)

# A tibble: 26 × 3
   Feature         F_score   p_value
   <chr>             <dbl>     <dbl>
 1 OverallQual     939.    2.11e-149
 2 FullBath        587.    4.01e-104
 3 GrLivArea       496.    5.88e- 91
 4 GarageCars      495.    9.23e- 91
 5 YearBuilt       478.    3.77e- 88
 6 YearRemodAdd    410.    1.30e- 77
 7 GarageArea      378.    1.55e- 72
 8 GarageYrBlt     349.    8.00e- 68
 9 TotalBsmtSF     302.    5.42e- 60
10 FirstFlrSF      279.    7.13e- 56
11 TotRmsAbvGrd    233.    9.41e- 48
12 Fireplaces      200.    8.40e- 42
13 OpenPorchSF     122.    5.22e- 27
14 SecondFlrSF      92.1   5.38e- 21
15 HalfBath         87.6   4.44e- 20
16 WoodDeckSF       70.1   1.70e- 16
17 LotFrontage      63.2   4.70e- 15
18 LotArea          57.8   6.31e- 14
19 BedroomAbvGr     16.3   5.67e-  5
20 EnclosedPorch    15.7   7.82e-  5
21 ThreeStoryPorch   6.74  9.55e-  3
22 ScreenPorch       3.20  7.39e-  2
23 MoSold            3.20  7.39e-  2
24 PoolArea          3.12  7.78e-  2
25 YrSold            1.16  2.82e-  1
26 MiscVal           0.302 5.83e-  1

That is a much quicker and easier way to get a large number of hypothesis tests comparing both the categorical and continuous predictors variables to our target variable.

Measures of Association

Tests of association are best designed for just that, testing the existence of an association between two categorical variables. However, hypothesis tests are impacted by sample size. When we have the same sample size, tests of association can rank significance of variables with p-values. However, when sample sizes are not the same (or degrees of freedom are not the same) between two tests, the tests of association are not best for comparing the strength of an association. In those scenarios, we have measures of strength of association that can be compared across any sample size.

Measures of association were not designed to test if an association exists, as that is what statistical testing is for. They are designed to measure the strength of association. There are dozens of these measures. Three of the most common are the following: - Odds Ratios (only for comparing two binary variables) - Cramer’s V (able to compare nominal variables with any number of categories) - Spearman’s Correlation (able to compare ordinal variables with any number of categories).

An odds ratio indicates how much more likely, with respect to odds, a certain event occurs in one group relative to its occurrence in another group. The odds of an event occurring is not the same as the probability that an event occurs. The odds of an event occurring is the probability the event occurs divided by the probability that event does not occur.

\[ Odds = \frac{p}{1-p} \]

Let’s again examine the cross-tabulation table between central air and bonus eligibility.


 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  1095 

 
                 | train$bonus 
train$CentralAir |         0 |         1 | Row Total | 
-----------------|-----------|-----------|-----------|
               N |        60 |         2 |        62 | 
                 |    17.330 |    22.875 |           | 
                 |     0.968 |     0.032 |     0.057 | 
                 |     0.096 |     0.004 |           | 
                 |     0.055 |     0.002 |           | 
-----------------|-----------|-----------|-----------|
               Y |       563 |       470 |      1033 | 
                 |     1.040 |     1.373 |           | 
                 |     0.545 |     0.455 |     0.943 | 
                 |     0.904 |     0.996 |           | 
                 |     0.514 |     0.429 |           | 
-----------------|-----------|-----------|-----------|
    Column Total |       623 |       472 |      1095 | 
                 |     0.569 |     0.431 |           | 
-----------------|-----------|-----------|-----------|

Let’s look at the row without central air. The probability that a home without central air is not bonus eligible is 96.8%. That implies that the odds of not being bonus eligible in homes without central air is 30.25 (= 0.968/0.032). For homes with central air, the odds of not being bonus eligible are 1.20 (= 0.545/0.455). The odds ratio between these two would be approximately 25.21 (= 30.25/1.20). In other words, homes without central air are 25.2 times as likely (in terms of odds) to not be bonus eligible as compared to homes with central air. This relationship is intuitive based on the numbers we have seen. Without going into details, it can also be shown that homes with central air are 25.2 times as likely (in terms of odds) to be bonus eligible.

Cramer’s V is another measure of strength of association. Cramer’s V is calculated as follows:

\[ V = \sqrt{\frac{\chi^2_P/n}{\min(Rows-1, Columns-1)}} \]

Cramer’s V is bounded between 0 and 1 for every comparison other than two binary variables. For two binary variables being compared the bounds are -1 to 1. The idea is still the same for both. The further the value is from 0, the stronger the relationship. Unfortunately, unlike $R^2$, Cramer’s V has no interpretative value. It can only be used for comparison.

Lastly, we have Spearman’s correlation. Much like the Mantel-Haenszel test of association was specifically designed for comparing two ordinal variables, Spearman correlation measures the strength of association between two ordinal variables. Spearman is not limited to only categorical data analysis as it is also used for detecting heteroscedasticity in linear regression. Remember, Spearman correlation is a correlation on the ranks of the observations as compared to the actual values of the observations.

As previously mentioned, these are only a few of the dozens of different measures of association that exist. However, they are the most used ones.

Let’s see how to do this in each software!

Python
R

We can use the fisher_exact function from before to also get the odds ratio. The statistic that is reported in the output is the odds ratio between the two binary variables.

Code

from scipy.stats import fisher_exact

fisher_exact(pd.crosstab(index = train['CentralAir'], columns = train['bonus']))

SignificanceResult(statistic=np.float64(25.044404973357015), pvalue=np.float64(4.0622269468997353e-13))

This is the odds ratio of the left column odds in the top row over the left column odds in the bottom row. This means that homes without central air are 25 times as likely (in terms of odds) to not be bonus eligible as compared to homes with central air.

The association function from scipy.stats.contingency gives us the Cramer’s V value with the method = "cramer" option.

Code

from scipy.stats.contingency import association

association(pd.crosstab(index = train['CentralAir'], columns = train['bonus']), method = "cramer")

0.19728378722898784

The Cramer’s V value is 0.197. There is no good or bad value for Cramer’s V. There is only better or worse when comparing to another variable. For example, when looking at the relationship between the lot shape of the home and bonus eligibility, the Cramer’s V is 0.30. This would mean that lot shape has a stronger association with bonus eligibility than central air.

Code

from scipy.stats.contingency import association

association(pd.crosstab(index = train['LotShape'], columns = train['bonus']), method = "cramer")

0.30363883074881914

The spearmanr function provides Spearman’s correlation. Since these variables are both ordinal (all binary variables are ordinal), Spearman’s correlation would be more appropriate than Cramer’s V.

Code

from scipy.stats import spearmanr

spearmanr(train['CentralAir'], train['bonus'])

SignificanceResult(statistic=np.float64(0.19728378722898784), pvalue=np.float64(4.529096765048832e-11))

Again, as with Cramer’s V, Spearman’s correlation is a comparison metric, not a good vs. bad metric. For example, when looking at the relationship between the number of fireplaces of the home and bonus eligibility, the Spearman’s correlation is 0.42. This would mean that fireplace count has a stronger association with bonus eligibility than central air.

Code

from scipy.stats import spearmanr

spearmanr(train['Fireplaces'], train['bonus'])

SignificanceResult(statistic=np.float64(0.42125878576411335), pvalue=np.float64(2.4534446022121495e-48))

We can use the OddsRatio function to get the odds ratio.

Code

library(DescTools)

OddsRatio(table(train$CentralAir, train$bonus))

[1] 25.0444

We use the assocstats function to get the Cramer’s V value. This function also provides the Pearson and Likelihood Ratio $\chi^2$ tests as well.

Code

assocstats(table(train$CentralAir, train$bonus))

                    X^2 df   P(> X^2)
Likelihood Ratio 55.774  1 8.1268e-14
Pearson          42.618  1 6.6531e-11

Phi-Coefficient   : 0.197 
Contingency Coeff.: 0.194 
Cramer's V        : 0.197

Code

assocstats(table(train$LotShape, train$bonus))

                    X^2 df P(> X^2)
Likelihood Ratio 101.61  3        0
Pearson          100.96  3        0

Phi-Coefficient   : NA 
Contingency Coeff.: 0.291 
Cramer's V        : 0.304

The cor.test function that gave us Pearson’s correlation also provides Spearman’s correlation. Since these variables are both ordinal (all binary variables are ordinal), Spearman’s correlation would be more appropriate than Cramer’s V.

Code

cor.test(x = as.numeric(ordered(train$CentralAir)), 
         y = as.numeric(ordered(train$bonus)), 
         method = "spearman")


    Spearman's rank correlation rho

data:  as.numeric(ordered(train$CentralAir)) and as.numeric(ordered(train$bonus))
S = 175651871, p-value = 4.529e-11
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.1972838

Code

cor.test(x = as.numeric(ordered(train$Fireplaces)), 
         y = as.numeric(ordered(train$bonus)), 
         method = "spearman")


    Spearman's rank correlation rho

data:  as.numeric(ordered(train$Fireplaces)) and as.numeric(ordered(train$bonus))
S = 126641241, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.4212588

Now we have explored our data! In the next section it is time to start building our logistic regression model.

	score_func	<function chi2 at 0x31e96a7a0>
	k	'all'

	score_func	<function f_c...t 0x31e96a5c0>
	k	'all'