GAMMA FACET: A New Approach for Universal Explanations of Machine Learning Models

Published in

GAMMA — Part of BCG X

14 min readJan 12, 2021

Authors: Jan Ittner, Konstantin Hemker & Malo Grisard

Rapid advances in artificial intelligence (AI) technologies equip us with an ever-evolving toolset to analyze even the most complex real-world business problems and processes. State-of-the art machine learning algorithms allow decision makers to accurately predict business-critical outcomes such as cost, speed, quality, or yield. But more often than not, the true business value of AI lies not in merely predicting outcomes (which customers are likely to cancel their contract?), but in explaining and optimizing those outcomes (what must we do to retain high-value customers?).

Moreover, a manager’s willingness to accept the use of machine learning to make day-to-day decisions often hinges on his or her trust in the algorithm, which, in turn, requires some understanding of how an AI model makes predictions and decisions. The need to explain AI models will become even more important as companies develop an increased awareness of ethical AI and seek to design AI models capable of unbiased, safe, and responsible decision-making.

Using Cooperative Game Theory to Explain Feature Contributions

Human-explainable AI has seen huge advances in recent years, especially with the arrival of Shapley Additive Explanations (SHAP), a unifying theory and algorithm applying Cooperative Game Theory to explain individual predictions of a machine learning model. What makes SHAP so attractive is that, despite its advanced mathematical underpinnings, the results are intuitive even to general audiences.

SHAP quantifies the contributions of all of a model’s variables to any given predicted outcome. For example, we might use a model to predict a patient’s risk of developing diabetes, based on variables such as weight, age, and exercise habits. The model might tell us that the risk is 67%. SHAP will go further, telling us, for example, that the patient’s age increases the diabetes risk by 5%, while her weight decreases it by 2%.

The Clarifying Role of Model Inspection and Virtual Experiments

SHAP offers a highly useful means of explaining individual predictions. Until recently, however, there were only limited means to explain the model as a whole — to explain in general how variables act and interact to come up with predictions. GAMMA FACET presents a new, holistic approach to explain machine learning models. It does so from two angles: First, it uses a newly developed model inspection algorithm to explain how variables of a predictive model collaborate to predict outcomes by identifying patterns across the explanations of many individual predictions. For example, GAMMA FACET might find that weight and body mass need to be considered in combination to assess diabetes risk, whereas body-mass index and height/waist ratio might be interchangeable. Second, it applies a simulation approach to determine in “virtual experiments” how systematic changes to key factors impact predicted outcomes, such as how an increase in age affects diabetes risk across a population of patients.

Case Study: Preventing Breakdowns in Water Drilling

The best way to explore these concepts is through a real-world example. Drilling a water well is very dangerous and costly. The costs of such drilling are driven by the time it takes to finalize a well in order to start pumping water from it. To reduce those costs, drillers are usually incentivized to drill at a faster pace — measured as the Rate of Penetration (ROP). Depending on soil characteristics, day rates can range from $30k to $250k. But there is a trade-off: Drilling faster increases the risk of incidents, such as a formation collapse or a gas infiltration. We will therefore build a machine-learning model to understand the impact of drilling speed on the incident risk, in the context of other risk factors.

For the sake of clarity, we use a simplified dataset for this example. The dataset contains 500 observations, with each row representing a drilling operation of the past, along with a binary indicator of whether or not a well-drilling incident happened in the operation.

Based on present and past operating conditions, a predictive algorithm can be used to alert the drilling operator of a high incident risk. The operators would then have the opportunity to adjust drilling parameters. However, knowing when to act is often not enough. The operator also needs to understand why there are incidents, and which are the optimal drilling conditions that balance drilling cost with the cost of a potential incident. GAMMA FACET can help to deliver these actionable insights.

scikit-learn and the Model Pipeline

To form the backbone of our explainable machine learning model, we must first build a model pipeline that allows us to trace all outputs of the model back to the initial data inputs, across all transformation and training steps.

GAMMA FACET is designed around scikit-learn, the de-facto industry standard for machine learning in Python. scikit-learn offers a wide range of regression and classification algorithms. It also offers a universal approach to building machine learning pipelines, combining data pre-processing with the actual model fitting in one integrated workflow.

FACET enhances scikit-learn in three essential ways:

End-to-end feature traceability: While native scikit-learn is built around numpy and produces all outputs as numerical arrays, FACET provides enhanced versions of more than 150 scikit-learn transformers, regressors, and classifiers that deliver all outputs as pandas data frames. Additionally, FACET includes attributes for mapping the names of derived features back to the features from which they originated. This mapping is essential if the features are to be referred to by name further downstream in the machine learning pipeline.
Enhanced pipelining: FACET introduces additional two-step pipeline classes that contain one pre-processing step (which may itself be a pipeline) and a learner step. Our experience is that this seemingly minor addition leads to significantly more concise and readable pipelining code.
Enhanced validation: FACET introduces cross-validators for several variants of bootstrapping, a statistical technique that is especially relevant in the context of FACET’s simulation capabilities.

Referring back to the drilling example, here is how we might construct a pipeline using FACET’s support for feature traceability:

As you see in this code snippet, the pipeline above looks almost exactly the same as a pipeline constructed using pure scikit-learn. Notice, however, that we import all pipelines, transformers, and estimators from FACET’s sklearndf package, and that the names of the familiar scikit-learn classes all have “DF” suffix. Also note the special ClassifierPipelineDF, one of FACET’s enhanced pipelines comprising one optional pre-processing step, along with a subsequent learner step guaranteed to be a classifier. As you can see in the output, the pre-processing result is a data frame that preserves all feature names.

Next, we want to tune our model’s hyper-parameters using FACET’s LearnerRanker. LeanerRanker operates similarly to scikit-learn’s grid searcher, but makes it much easier to let several types of models compete against each other, rather than optimizing the hyper-parameters of a single model:

Model Explanations: What Causes Incidents?

We now have a tuned and trained model that predicts the incident risk of our drilling operations. But we want to be more proactive than just deploying the model and responding ad-hoc to predicted risks. Instead, we want to know what the model has learned about why and when incidents happen. We want the model to help us understand how we can systematically change the way we operate our drilling machinery to reduce the incident risk.

FACET approaches model explanation as a combination of two methods:

Explaining global feature interactions: This method tells us what the model has learned about how features contribute both individually and collectively to outcomes. FACET introduces a new algorithm that, for each pair of features, quantifies synergy, redundancy, and independence (see below for more details). This algorithm is based on SHAP vector decomposition, a mathematical framework we developed for global-model explanations and that we will detail further in an upcoming publication.
Model-based simulations: This method allows us to identify how systematic feature changes will help achieve a desired outcome, in this case to minimize the risk of a drill breakdown. We achieve this outcome by creating synthetic samples for a range of values and then using the model to evaluate changes in predicted risk. As you will see below, understanding global feature interactions (i.e., method 1) is an essential step to ensuring that our simulations are valid under real-world conditions.

When used in our drilling example, FACET’s LearnerInspector class provides an overview of feature interactions by calculating pairwise synergy and redundancy:

The results are two matrices that, for any pair of features, tell us as a percentage the degree of synergy and redundancy between these two features.

Synergy

Synergy is the degree to which the model combines information from one feature with another to predict the target. For example, let’s assume we are predicting cardiovascular health using age and gender and the fitted model includes a complex interaction between them. This means these two features are synergistic for predicting cardiovascular health. Further, both features are important to the model and removing either one would significantly impact performance. Let’s assume age is a more important feature than gender and so age contributes more to the combined prediction than gender. This asymmetric contribution means the synergy for (age, gender) is less than the synergy for (gender, age). To think about it another way, imagine the prediction is a coordinate you are trying to reach. From your starting point, age gets you much closer to this point than gender, however, you need both to get there. Synergy reflects the fact that gender gets more help from age (higher synergy from the perspective of gender) than age does from gender (lower synergy from the perspective of age) to reach the prediction.

This leads to an important point: synergy is a naturally asymmetric property of the global information two interacting features contribute to the model predictions. Synergy is expressed as a percentage ranging from 0% (full autonomy) to 100% (full synergy). Note that synergistic features can be completely uncorrelated, and they can be hard to spot through regular exploratory analysis.

To interpret the synergy matrix, the first feature in a pair is the row (“perspective from”), and the second feature the column. In our drilling example, FACET reports that “from the perspective” of rotation speed, 67% of the information is combined with weight on the bit to predict failure. This seems sensible in context, as drilling with both a high bit weight and a high rotation can have a disproportionately large impact on the wear of the equipment, and so drastically increase the incident risk. It is understandable that the synergy is also high from the perspective of weight on the bit (61%). This tells us that we should look at rotation speed and weight on the bit together to understand their contributions to incident risk.

Redundancy

Redundancy is the degree to which a feature in a model duplicates the information of a second feature to predict the target. For example, let’s assume we had house size and number of bedrooms for predicting house price. These features capture similar information as the more bedrooms the larger the house and likely a higher price on average. The redundancy for (number of bedrooms, house size) will be greater than the redundancy for (house size, number of bedrooms). This is because house size “knows” more of what number of bedrooms does for predicting house price than vice-versa. Hence, there is greater redundancy from the perspective of number of bedrooms. Another way to think about it is removing house size will be more detrimental to model performance than removing number of bedrooms, as house size can better compensate for the absence of number of bedrooms. This also implies that house size would be a more important feature than number of bedrooms in the model.

The important point here is that like synergy, redundancy is a naturally asymmetric property of the global information feature pairs have for predicting an outcome. Redundancy is expressed as a percentage ranging from 0% (full uniqueness) to 100% (full redundancy). Redundancy cannot necessarily be spotted in exploratory analysis if two features are redundant but not linearly correlated.

As with synergy, the matrix row is the “perspective from” feature in the row-column feature pair. For our drilling example, we observe two pairs of highly redundant features:

The first redundant feature pair is ROP and IROP. The redundancy is similar from the perspective of either feature (75%) because one is the inverse of the other and so they can substitute one another in the model for incident risk. This is a good example of FACET’s ability to pick up redundancies between features even when they are not linearly correlated.
The second pair of redundant features is depth of operation and hole diameter. From the perspective of hole diameter 53% of the information is duplicated with depth of the operation to predict failure. Intuitively, we can see why, as the depth of operation and the hole diameter are highly connected as drillers use thinner drilling bits as they drill deeper into the earth. The reason the redundancy for (depth of the operation, hole diameter) is slightly lower than (hole diameter, depth of the operation) is because the depth of operation is a slightly more important feature in the model.

FACET can produce a second type of diagram that is very useful for assessing synergy or redundancy relationships as a hierarchical clustering dendrogram. Note this approach relies on a symmetric variant of redundancy (or synergy) that provides not only a simplified perspective but a feature distance (1 — metric) for clustering. In our example, we are interested in the redundancy dendrogram:

The redundancy dendrogram reveals clusters of redundant features, indicates the degree of mutual redundancy among features in a cluster (the further left features are merged in the dendrogram, the stronger their redundancy) and, using a color scale, shows feature importance for individual features and clusters of features.

Our two pairs of redundant features are clearly recognizable in the dendrogram, including their combined importance. The rate of penetration (ROP) is highly redundant with its inverse feature (IROP) (> 80% redundancy), and the combined importance of both features is 36%. Given that we want to simulate ROP, we will remove IROP to make sure that the feature we simulate is a unique contributor to the outcome (we will provide a more detailed explanation in the next section).

There is an interesting observation when we generate a new redundancy dendrogram after removing IROP and re-training the model: The feature importance of ROP has gone up to 35%, indicating that ROP has taken on the role of the former ROP/IROP cluster in explaining ROP-related contributions to the incident risk.

Redundancy linkage dendrogram after feature pruning

Simulating Feature Uplift

Having inspected the model, we have arrived at a good understanding of how the model makes a prediction, and how the predictors interact with each other.

Frequently these insights lead straight to a “what if” question: How can we systematically change an influenceable variable to improve the outcome? In our example, we want to understand how changes in Rate of Penetration impact incident risk. From an economic standpoint, drill operators will try to drill as fast as possible while maintaining safety and reducing failures. Similar questions apply to other business contexts, where the aim is to reduce costs, maximize yield, retain customers or, indeed, optimize any business outcome based on data.

FACET’s simulation approach singles out one feature, then runs a series of “virtual experiments” for a range of values, pretending for each experiment that the simulated feature always took the given value for each historical observation.

With this approach it is crucial that the feature we simulate is not redundant with any other feature in the model. Doing so would run the risk of creating infeasible scenarios by adjusting the value of one feature but not of its redundant sibling. If, in our example, we were to simulate ROP for a range of values, but kept IROP in the model, we would more than likely create infeasible scenarios.

A simulation is conducted in two steps: First, we decide on a feature to be simulated and choose one of FACET’s partitioner classes to split the range of previously observed values for that feature into partitions. Second, we run one simulation per partition, each time fixing the value of the simulated feature at the partition’s central value across all observations.

Therefore, in our example, the simulation is asking the question: “What would my average incident risk have been had I always drilled with X m/s of ROP?” ROP is measured as a real number, so we use a ContinuousRangePartitioner to create a series of equally sized partitions within the range of values historically observed for ROP. It is important that we simulate only within the historically observed range of values, since the model has been trained on values in that range and usually will not be able to come up with valid out-of-range extrapolations.

Simulation Step 1 — Partitioning the observed values for the ROP feature into buckets of equal size

Using the best model, we previously identified from the LearnerRanker, the simulator now determines the average predicted incident risk for each partition. FACET supports bootstrapping, allowing us to repeat each simulation many times on variations of the model trained on different subsets of the data, and use the distribution of simulation results to quantify the level of confidence in the simulation.

Simulation Step 2 — Simulating outcomes for different values of the ROP feature

The visualization above shows the effect of ROP on the incident risk, while also providing a sense of the confidence in the simulation. The x-axis shows the different partitions for which the simulation was run. The bars below the x-axis show the number of observations in the original sample that fall within the partition, indicating the support we have for each simulated value. (Note how the confidence interval broadens toward the margins as we see fewer actually observed values for these partitions.) The central ascending line represents the median predicted incident risk for each partition, while the outer lines show the 95% confidence interval of all cross-validation splits from our prediction.

The simulation confirms that the incident likelihood increases significantly as we increase ROP. It provides an insight into the risk levels of operating at faster ROP. We can also see that there have been multiple occasions where the ROP was operated at a dangerously high level (>30ft/h), leading to incident likelihoods above 70%.

What’s Next?

From a business perspective, having the answers to these “what if” questions is extremely valuable to the process of assessing risk and finding ways to improve current processes. In our example, the journey would not end here. Based on the results from the simulation, a next step could be to conduct a cost-benefit analysis of drilling with a slower or faster ROP to achieve the best trade-off between drilling cost and the financial risk of a drilling incident.

You can easily take GAMMA FACET for a spin. Simply conda install gamma-facet -c bcg_gamma -c conda-forge and check out our GitHub repository for further documentation and worked examples.

Acknowledgements

This package would not have been possible without the availability of two outstanding Python packages for machine learning and model explainability:

1. scikit-learn provides the learners and transformers that make up the underlying machine learning pipelines of FACET. Moreover, we designed the FACET API in line with the basic fit/transform/predict paradigm scikit-learns to give data scientists an easy start with FACET.

2. The SHAP implementation by Scott Lundberg is used to estimate the SHAP vectors being decomposed into the synergy, redundancy, and independence vectors.