Part 2: Inside The ‘Black Box’ of ML

What About Comparing Categorical Features?

DH
Hitachi Solutions Braintrust
3 min readApr 3, 2019

--

By Derek Hughes, Data Scientist, Capax Global

Please read Part 1 prior to reading this article.

We can explore the relationship between categorical features by using methodologies to test for association with the target variable, and then collect and test for association between other categorical features.

We often use a mosaic plot to get a quick visual overview of the categorical features. This allows us to easily identify relationships between two or three categorical variables and all of their levels at once. This is a quick way to visually identify areas of interest.

For measurement of association or effect between categorical features, Goodman Kruskal’s lambda and Cramer’s V are a good place to start and provide clear numeric results (0 = no association, 1 = perfect association).

Another particularly interesting method is correspondence analysis, which uses a contingency table and is used to identify associations between different categorical levels. Additionally, we can identify levels that may show redundant information and, as such, consider combining these levels into a single “new” feature.

NOTE: The categorical variables in the baseball data set showed very little variance, so for demo purposes, we simulated the results from a survey rating the qualities of different baseball fields.

#R — correspondence analysis with plot
field_table <- table(fields$field_quality, fields$field_type) #contingency table
corresp_anlys <- ca(field_table) #create and plot analysis
plot(corresp_anlys, mass = TRUE, contrib = “absolute”, map =”rowgreen”, arrows = c(FALSE, TRUE))

#Python — correspondence analysis with plot
#contingency table
cont_table = contingency.VarVar(field_quality, field_type, fields)
corresp_anyls = correspondence.CA(cont_table, field_quality.values, field_type.values)
corresp_anyls.plot_biplot()

Correspondence Analysis with Plot

Another method that uses contingency tables is the Chi-squared test for independence, where large Chi-squared values indicate a strong association between these variables. Chi-squared is a very common, well-researched method that can quickly identify significant dependent relationships.

#R — mosaic plot
#large residuals = large difference from that which was expected if independent, so
#large residuals = dependence between levels

library(“gplots”)
mosaicplot(field_table, shade = TRUE, las=2, main = “Field Feature Quality by Field Type”)

#Python — mosaic plot
from statsmodels.graphics.mosaicplot import mosaic
field_table = pd.read_csv(“Field_Features_Fields.csv”)
mosaic(field_table, [‘field_features’])
plt.show()

Mosaic Plot

#Python — chi-squared test
from scipy import stats
field_table = field_table[[‘field_quality’,’field_type’]]
field_table = np.array([field_table.iloc[0:,0].values,
field_table.iloc[0:,1].values])
stats.chi2_contingency(field_table)

The Chi-squared correlation plot reiterates the mosaic plot

Stronger dependence levels:
outdoor_grass — lighting/dugout,
indoor_turf — lighting/dugout,
retractable_ — field

#R — create categorical bucket
bucket_categorical <- c(“field_quality”, “field_type”)

#Python — create categorical bucket
bucket_categorical <- [“field_quality”, “field_type”]

What about relationships between continuous/numeric and categorical features?

A simple logistic regression can be used here to understand the correlation. By setting the target variable as the categorical variable and the continuous variable as the predictor in a logistic regression, if an accurate model is created, it’s reasonable to believe the features are correlated.

Another approach for identifying linear relationships is the point biserial correlation coefficient. This is derived from Pearson’s correlation, has an intuitive output, and is designed for measuring associations between continuous and categorical features.

In part three of the series, we’ll go over the third and fourth buckets, which use a wrapper and embedded methods to identify useful features. You can also learn more about what Capax Global does at www.capaxglobal.com

--

--