Part 1: Opening The ‘Black Box’ of ML

Understanding Data Relationships to Identify Useful Features for Machine Learning Models

Published in

Hitachi Solutions Braintrust

10 min readMar 11, 2019

By Derek Hughes, Data Scientist, Capax Global

Understanding the relationships within data is a critical component to the work of any data scientist. We need to explain to project stakeholders why we selected the features used in the model. We need to understand the qualities of every individual feature, how every feature relates to every other feature, and how these features relate as combinations of features within a myriad of algorithms.

It’s no longer enough to just say “it’s a black box” and that’s sufficient for a stakeholder. Most often, the most effective machine learning models use features engineered from a thorough analysis of the data. This is the “art” part of machine learning and data science. This article intends to shed light on the infamous “black box,” and present an organized framework for developing deeper insights about your features to create more effective machine learning models.

How do we make sense of it?

Identifying and creating useful features for developing accurate machine learning models is challenging and time-consuming; having a flexible framework to use with most machine learning problems can help. The goal of this approach is to provide a framework for combining sound statistical methodologies in an organized manner to facilitate our understanding of our data and, even more importantly, which parts of the data are not well represented by our model. This approach allows us to identify groups of useful features quicker and more clearly, can be used when we have minimal domain knowledge about our use case, assists when explaining our decision making to project stakeholders, and can serve as a basis for future feature engineering tasks.

Why not use every feature?

While it seems intuitive to include every feature into our model that may explain any information about our target feature, this can be costly on many levels in the world of machine learning models. At a high-level overview, first, we need to abide by the law of parsimony. It states that a simpler model is often the best model for a few reasons.

First, a simpler model is much easier to interpret and to explain to the data science project stakeholders.
Second, simpler models are less likely to overfit a model, which is when our model performs fantastically when using the same data we used to develop the model but very poorly on new, unseen data (which is exactly the data it would see in a production environment and the whole point for creating the model!).
Third, in the age of Big Data simpler models have much lower computational costs and processing times. The second major reason not to include every possible feature is the curse of dimensionality — which essentially means we need a lot more data for each additional feature for the model to produce stable results.

Where to start?

Assuming we have a maturely developed supervised learning use case, a good place to begin is with our target, response, or dependent feature — or for the layman, what we’re trying to predict. Now is a good time to point out that I will be using the terms variable and feature interchangeably with both meaning the data in a specific column in a relational database table. Also, the same interchangeability applies when using the target/response/dependent variable to represent the feature we are trying to predict.

Furthermore, it’s important to understand that for structured data these features can be continuous (think numeric in terms of measuring something) or categorical (textual or numeric data that represent categories such as shirt colors) — both of which require vastly different methodologies to understand their internal patterns and external relationships. We need to understand why the statistical techniques are useful, when to use them, if our data meets the assumptions for the methodology, and how to interpret the results.

OK, now that formalities are out of the way, let’s dive into it!

What’s up with the target variable?

Starting with the target variable is useful because, ultimately, this is what we are trying to predict. We want to understand the target variable intimately as a single variable and in relation to other information in the data.

For understanding the target variable, basic exploratory data analysis, and common visualizations are sufficient. We want to know the basics: the range, distribution, mean, median, mode, etc. You know, the summary statistics and basic stuff, but why?

Generally, we are looking for non-normal distributions, extreme values, and an understanding of how our target variable differs over the data range. We are looking for abnormalities by applying questions such as, “Does this make sense in the context of this use case?” to each of our summary statistics. For example, “Does it make sense to have these minimum and maximum values for this use case?” or “Should we have negative values here?” How we handle abnormal or unsuspected values will be addressed in later parts of this blog series.

#R — organize data
#data set is baseball stats and the target variable to predict is number of hits in 1986
#collect categorical variables
char_var <- names(data)[which(sapply(data, is.character))]
factor_var <- names(data)[which(sapply(data, is.factor))]
cat_var <- c(char_var, factor_var)
cat_var <- cat_var[-c(1,2)]
#collect numeric variables
numeric_var <- names(data)[which(sapply(data, is.numeric))]
numeric_var <- numeric_var[which(!numeric_var %in% cat_var )]
numeric_var <- numeric_var[-c(1,18)]
#R — create density plot
a <- ggplot(data, aes(hits86))
a + geom_density(kernel=”gaussian”) #density plot
#Python — create density plot
import seaborn as sns
sns.distplot(data[‘atbat86’], hist = False, kde = True,
kde_kws = {‘linewidth’: 3})

#R- create histogram
a + geom_histogram(binwidth=20)
#Python- create histogram
import matplotlib.pyplot as plt
plt.hist(data[‘atbat86’], bins=20, normed=True, alpha=0.5,
histtype=’stepfilled’, color=’#2d98d4',
edgecolor=’#37474f’)

How does the target variable relate to the predictor features?

Now that we understand our target variable, let's see how it relates to the other variables/features in the data — the predictor features. We begin with a scatterplot and heatmap for numeric features and look for any features with strong correlations (positive or negative) to our target variable. We default to use Spearman instead of Pearson correlations, if able, as the Spearman correlation metric is non-parametric and therefore able to identify non-linear relationships, while Pearson is solely focused on linear relationships.

#R — create correlation heatmap
p.mat <- cor.mtest(data[, numeric_var], method = c(“spearman”))
#correlation matrix
correlations <- cor(data[, numeric_var]) #get the correlations
corrplot(correlations,order=”hclust”) #group the correlations
#plot the heatmap
corrplot(correlations, method=”color”, col=col(200),
type=”upper”, order=”hclust”,
addCoef.col = “black”,
tl.col=”black”, tl.srt=45, tl.cex = .75, # Text label color and rotation
p.mat = p.mat, sig.level = 0.05, insig = “blank”, # Combine with significance
number.cex= 9/ncol(correlations),
diag=FALSE
)
#Python — create correlation heatmap
corr = data.corr() # calculate the correlation matrix
# plot the heatmap
sns.heatmap(corr,
xticklabels=corr.columns,
yticklabels=corr.columns)

#R — create scatterplot
b <- ggplot(data, aes(hits86, runs86))
b + geom_point(col=”#2d98d4") +
labs(title=”Histogram of Hits in 1986",
x = “Hits in 1986”) +
theme_bw()
#Python — create scatterplot
data.plot.scatter(“atbat86”, “runs86”)

What’s a bucket?

At this point, we put these features into what we will call our “correlation bucket” of features. Think of it as a bucket full of numeric features with strong correlations to our target variable. We’ll be more specific and call this bucket our “numeric correlation bucket,” because in a moment we’ll also create a correlation bucket for categorical features.

#R — create continuous features bucket
bucket_continuous <- data[,c(“atbat86”, “runs86”, “walks86”, “rbi86”, “homer86”)]
#Python — create continuous features bucket
bucket_continuous = data[[‘atbat86’, ‘runs86’, ‘walks86’, ‘rbi86’,’homer86']]

What do we do with our bucket?

The idea is not to simply use correlations and statistical significance for feature selection (there are already many established techniques for that). The idea is beyond that. It is to learn about the data, to learn about the distribution/variance in the target variable that is not being explained, and to understand what new features could be created or external data could be imported to fill in that gap. Finally, we need to explain to the data science projects stakeholders why we selected the features in the model (the “it’s a black box” or “the model selects them on its own” reply often isn’t a sufficient response), and this process provides us with the reasoning behind our decisions.

Multi-what?

Staying within our numeric correlation bucket, the next step is to conduct pairwise correlation checks between each of these features followed by variance checks of each feature. The primary reason for pairwise correlation checks is to identify potential multicollinearity issues.

Imagine you and some friends decide to have some fun, pull a prank, and push an unsuspecting friend into a swimming pool as he walks by. As your friend walks by, at the same moment, you and your friends jump up and simultaneously push your unsuspecting buddy into the pool. But which of your friends contributed the most to launching your buddy into his watery destination? Individually each of your friends is strongly correlated with, in this case, the appropriately named target variable (your wet friend), but because each caused the effect at the same time (providing correlated, overlapping, and redundant information) it’s difficult to identify which of these friends contributed the most.

This is the issue with multicollinearity and many machine learning models do not respond well to it. Predictor feature sets having multicollinearity and can provide erroneous and/or biased results, so removing multicollinearity issues allows us to increase our faith in the model predictions.

There is no hard rule for what correlation level defines multicollinearity, but we want to remove those predictor features that are highly correlated to each other. I personally remove one if the correlation is above .9. To decide which one to remove, I run a simple single variable regression on the target variable using each predictor variable and see which provides the best accuracy (keeping the predictor with the best accuracy, of course).

At this point we should have a bucket of predictor variables that are a) strongly correlated with the target variable, but b) not obnoxiously correlated with each other.

#R — correlation plot with bucket one (identify multicollinearity)
p.mat <- cor.mtest(bucket_ continuous, method = c(“spearman”))
correlations <- cor(bucket_ continuous)
corrplot(correlations,order=”hclust”)
# plot the heatmap
corrplot(correlations, method=”color”, col=col(200),
type=”upper”, order=”hclust”,
addCoef.col = “black”, # Add coefficient of correlation
tl.col=”black”, tl.srt=45, tl.cex = .75,#Text label color and rotation
p.mat = p.mat, sig.level = 0.05, insig = “blank”, # Combine with significance
number.cex= 9/ncol(correlations),
diag=FALSE
)
#Python — correlation plot with bucket one (identify multicollinearity)
corr = data.corr() # calculate the correlation matrix
# plot the heatmap
import seaborn as sns
sns.heatmap(corr,
xticklabels=corr.columns,
yticklabels=corr.columns)

#Python — conduct pair-wise correlation
data[‘atbat86’].corr(data[‘runs86’])
#Which predictor variable to remove?
#R — use single predictor model (linear regression) to compare
summary(lm(hits86 ~ atbat86, data)) #R-squared = 0.9367
summary(lm(hits86 ~ runs86, data)) #R-squared = 0.85
#Python — use single predictor model (linear regression) to compare
import pandas as pd
atbat86 = pd.DataFrame(data[‘atbat86’], columns=[‘atbat86’])
runs86 = pd.DataFrame(data[‘runs86’], columns=[‘runs86’])
y = pd.DataFrame(data[‘hits86’], columns=[“hits86”])
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(atbat86,y) #fit linear regression atbat86
lm.score(atbat86,y) #score linear regression atbat86
lm.fit(runs86,y) #fit linear regression runs86
lm.score(runs86,y) #score linear regression runs86
#R — remove runs in 1986
bucket_continuous <- subset(bucket_continuous, select = -c(runs86))
bucket_continuous
[1] “atbat86”, “walks86”, “rbi86”, “homer86”
#Python — remove runs in 1986
bucket_continuous = data.drop([‘hits86’], axis=1)

Wait, we want variance?

The next step for filtering our features is checking the variance of each feature. Now, it should be noted that our correlation processes should filter out these features, but for the sake of demonstration or usual cases where our correlation processes do not filter any features, it’s worth discussing.

Let’s consider another example with our unsuspecting and now drenched friend. Let’s say our unsuspecting friend is a little slow at identifying the prank and continues to walk in front of you and your friends at the pool. Of course, not to miss any opportunity, you and your friends, with great joy, send him airborne into the pool every time.

This happens over and over, but occasionally, some of your friends miss the opportunity because they are not around when he walks in front of them, allowing our water-loving friend to make it past the group of pranksters and escaping the fathoms of the pool on occasion. However, if one of the pranksters is always around when your soaked friend approaches (meaning there is little to no variance in this prankster’s behavior), regardless if your friend ends up in the pool or not, then it’s very difficult to use that prankster to predict whether the target is going into the water or not. Hence, we discard features with zero or very little variance in its values.

#R — check for zero or near-zero variance
nearZeroVar(bucket_continuous, saveMetrics = TRUE) #check for zero or near-zero variance features (none here)
#Python — check for zero or near-zero variance
from sklearn.feature_selection import VarianceThreshold
X = bucket_continuous #set features to check
sel = VarianceThreshold(threshold=(.8)) #conduct low variance check
sel.fit_transform(X) #conduct low variance check

freqRatio percentUnique zeroVar   nzvatbat86  1.000000      76.70807   FALSE FALSEwalks86  1.083333      27.63975   FALSE FALSErbi86    1.444444      31.98758   FALSE FALSEhomer86  1.047619      11.18012   FALSE FALSE

None of the variables meet the tolerances for zero or near zero variance to be removed.

Summary

So far, we’ve learned:

How to segment our features into separate buckets for future analysis
Identified the need to understand the features in our data at a deep level
Methods to identify relationships with the feature we’re trying to predict
Methods to identify relationships between continuous/numeric features
Identified methods that assist combating modeling issues such as multicollinearity and zero variance

Stay tuned for Part 2 of the series, we’ll discuss how to create a bucket for categorical features. You can also learn more about what Capax Global does at www.capaxglobal.com