Responsible Machine Learning

by Om Bathija, Machine Learning Intern at Eightfold.ai

Manav Mehra
Engineering at Eightfold.ai
4 min readMay 5, 2023

--

Having interpretable Machine Learning models can sometimes be as important as having accurate ones. At Eightfold.ai, one of our constant goals has been to encourage diversity and reduce bias in hiring. Since our platform uses a series of complex deep neural network models in order to make the predictions it does, understanding how it arrives at these predictions is a critical part of ensuring that it’s a fair predictor that’s not using any signals that could contribute to bias in any way.

In this post, we will dive into understanding a simple Decision Tree Classifier. In the next post, we will talk about doing the same for deep neural network models.

Experiment setup

We’re going to consider a very simple experiment with randomly generated data that builds a classifier that helps us decide if we should play tennis or not, based on the temperature and wind-speed.

Here’s the code that’s used to generate the data:

data = []

for i in range(NUM_SAMPLES):
temp = randint(60, 110)
windspeed = randint(5, 50)
unrelated_feature = randint(90,100)

if temp < 90 and windspeed < 20:
# ideal conditions for tennis!
# let's play
label = 1
else:
label = 0

# as you can see, the unrelated feature has (true to it's name)
# not been used to calculate the label

# let's mislabel data ~20% of the time, just to make it more interesting
mislabel = randint(0,10)
if mislabel >= 2:
# flip the label
label = 0 if label else 1

data.append([temp, windspeed, unrelated_feature, label])

data = np.array(data)

Training our classifier

This is accomplished in a couple of lines of code using sklearn. Let’s start by training a DecisionTreeClassifier:

features = data[:, :-1]
labels = data[:, :-1].reshape(-1)
clf = DecisionTreeClassifier(max_depth=3).fit(features, labels)

This is a fairly simple problem, so we’re not going to need a deep tree to achieve a decent classifier. Now that we have a classifier object, we can dig into it to understand what’s going on.

Feature Importance

This is probably the first thing that you should check when you have trained a tree-based model. Most tree-based models use some metric such as Gini to determine which features to use to split the dataset on. Sklearn’s tree based classifiers provide the feature_importances array which effectively tells you how heavily it relied on each feature while constructing the tree.

It’s easy to access through the clf.feature_importances_ attribute:

print pd.Series(clf.feature_importances_, index=[‘temp’, ‘windspeed’, ‘unrelated_feature’])

>> temp 0.529555
>> windspeed 0.467440
>> unrelated_feature 0.003006

This shows us that the unrelated_feature is the least important feature, which is what we’d expect to see.

Decision Rules

Any given path down a constructed decision tree is basically a boolean expression consisting of the value of a particular feature and a comparative test against some threshold that was determined during construction. One of the most interesting exercises you can carry out is to expose these “decision rules” being used by the classifier.

If you’d like a visual representation of your tree, that’s easy to build, thanks to the export_graphviz function provided by sklearn.

tree.export_graphviz(clf, feature_names=[‘temp’, ‘windspeed’, ‘unrelated_feature’])

While this visual representation works great for simple trees, it’s virtually useless for even moderately complex ones, because there’s a lot of paths to follow down to the bottom.

A simple way to transform this valuable data into a more usable format is to convert it to a (rule) => (number of samples affected) mapping. This allows you to zero in on the most important paths in the tree.

We actually have access to the entire tree object created by sklearn, so there’s many different ways in which you can build out this mapping.

# left and right children
clf.tree_.children_left
clf.tree_.children_right

# feature used to split at each node
clf.tree_.feature

# threshold used for each split
clf.tree_.threshold

Here’s what the mapping looks like for the tree we just constructed:

As we can see, it seems to be relying really heavily on windspeed to make its decisions. Almost 7000 of the 10000 samples are being classified without even looking at the temperature! This seems counterintuitive as by looking at the feature importances, one would assume that temperature and wind-speed are used equally. However, since the feature_importances_ attribute relies not on total number of samples affected by a particular feature, rather on the average impurity decrease, it’s always a good idea to check the rules.

The other insight afforded by the decision rules is the thresholds that are being picked. As can be seen here, it’s actually found the artificial thresholds we used (of windspeed being < 20 and temperature being < 90) quite effectively. Of course, we rarely know the underlying distribution which resulted in our dataset, but looking for outliers in these thresholds is a great way to identify problems in data or in the model.

So the next time you’re training a tree-based model, don’t just look at the F1 score. The interpretability of Decision Trees is one of their most powerful features, so make sure to take advantage of it!

About Eightfold.ai

Eightfold.ai delivers the Talent Intelligence Platform™, the most effective way for companies to identify promising candidates, meet diversity hiring goals, retain top performers, and serve their recruitment needs at lower total cost. Eightfold’s patented Artificial Intelligence–based platform enables more than 100 enterprises to turn talent management into a competitive advantage. Built by top engineers out of Facebook, Google and other leading technology companies, Eightfold is based in Mountain View, California.

--

--