Beyond Explainability: Quantifying the Trustworthiness of a Neural Network

Sheldon Fernandez
DarwinAI
Published in
12 min readNov 10, 2020

Explainability tells us how AI works, but says nothing about when it can be trusted. This limitation must be confronted.

Earlier this year, in our XAI primer, we introduced the concept of explainability by way of a provocative thought experiment; namely, by asking readers to consider a major scientific advancement and remove the foundational knowledge that enabled it:

At Kitty Hawk in 1904, the Wright brothers chance upon on a strange and unwieldy metallic structure. Its design and aerodynamic properties are a mystery, but through much trial-and-error, they’re flying…

In such a scenario, we argued, the euphoria of the moment would be offset by a nagging question: could we leverage something with an imperfect understanding of how it worked? What is the limit of a fundamental breakthrough absent its fundamental understanding?

As mentioned in the aforementioned piece, the ‘black box’ problem that plagues AI — our inability to peek inside neural networks and understand how they work — represents one of the most urgent moral and business imperatives of our time.

The solution, termed explainability, involves demystifying the machinery that makes the system work: uncovering the aerodynamic properties that enabled Wilbur and Orville to steer their mechanical contraption through twelve seconds of flight.

As we detailed at length, deep learning explainability is considerably more complex and numerous approaches have been explored to identify the critical factors in network decision-making that can then be verified using clever counterfactual techniques.

For all its promise, however, explainability is just a first step towards a complete understanding of the system. For example, deconstructing how an aircraft works under specific conditions does not assure us it will work for all conditions (strong headwinds, rain, snow, slush, etc.).

In other words, while explainability reveals how and why the system works, it tells us nothing about when and where the system can be trusted (or the opposite: when and where it should not be trusted).

As AI increases in both scope and sophistication, this critical limitation must be confronted…

The basis for quantifying trust

In this post, we detail the academic work of our team who authored two seminal papers on the topic of neural network trust (here and here).

In developing this scholarship, our Chief Scientist Dr. Alexander Wong, asked two questions about the relationship between trust and confidence:

  1. How much trust do we accord an actor who gives incorrect answers with great confidence?
  2. How much trust do we accord an actor who gives correct answers hesitantly?

These inquiries, which are rooted in social psychology, posit a straightforward relationship of how overconfidence and underconfidence impact the trustworthiness. Specifically, trust increases when correct answers are given confidently and when incorrect answers are expressed cautiously. Likewise, it decreases when high and low levels of confidence are behind incorrect and correct responses, respectively.

Using this lens, we can begin to quantify the trustworthiness of a neural network against a particular dataset for specific situations. The metrics below all employ this framework and, in concert, they can characterize network trust in a robust and reliable manner. In devising them, our team wanted to ensure a level of practicality in their design, which entailed two additional requirements. Namely, the metrics needed to be:

  1. Simple to compute to enable trust quantification not only end-of-cycle, but in a continuous manner to help identify when a neural network is no longer trustworthy in light of changing data dynamics.
  2. Intuitive, such that human designers and evaluators have a clear understanding of what the metrics mean and the particulars they communicate.

You can judge whether the metrics in the following sections meet these requirements.

Metric 1: Question-answer trust

The first metric, termed question-answer trust, quantifies the trustworthiness of a neural network for an individual answer. At first blush, this metric may seem confusing — after all, deep learning networks provide confidence scores for their operations: “I’m 70% certain this is a picture of a hammer”.

The difference — outlined in the aforementioned papers — is the introduction of an all-knowing oracle into the process who knows the answers ahead of time and can evaluate trust by means of the criteria above, rewarding well-placed confidence and penalizing undeserved overconfidence in line with human assessments of credibility.

The oracle knows with absolutely certainty, for example, that ‘Picture X’ contains a hammer. As such:

  • If the network is 90% confident it’s a pen, its question-answer trust score will be negatively affected.
  • If the network is only 30% confident it’s a hammer, its question-answer trust score will likewise be negatively affected.
  • If the network is 80% confident it’s a hammer, its question-answer trust score will be positively affected.

It is worth emphasizing that Question-answer trust is not intended to be used in isolation, but rather in combination with the metrics that follow.

Metric 2: Trust density

Building on the previous metric, trust density characterizes the trustworthiness of a network across a particular answer scenario (e.g., all hammer examples in the dataset).

This metric is best interpreted visually:

The figure above depicts the trust densities for the ‘hammer’ images in the ImageNet dataset for four popular networks: ResNet-50, MobileNetV1, MobileNetV2, and ShufflenetV2.

On a technical level, trust density is a probability distribution, which is to say that the area of each curve sums to 1. On a visual level, the metric paints an intuitive portrait of how trustworthy a network is for a specific answer scenario.

In this example, we’re evaluating the trustworthiness of each network when it correctly identifies a hammer in the ImageNet database. A glance at the graph suggests Resnet-50 is significantly more trustworthy than the other networks in this context. This is reflected by the ‘peak’ to the right of the graph, which signifies that the preponderance of the network’s hammer predictions are made with a high level of trust according to our question-answer-trust metric. In other words, when ResNet50 identifies a hammer accurately, it does so in a manner that is fairly trustworthy.

More nuanced insight is provided by comparing MobileNet-V2 (green) with ShuffleNet-V2 (red). Although the average height of both graphs is similar (the trust spectrum coefficients described in the next section), the shape of their curves (trust densities) vary considerably. Whereas MobileNet-V2 presents higher densities at both extremes, ShuffleNet-V2 exhibits higher densities in the mid-range trust region and moderate densities in the higher trust range.

The resulting interpretation is subtle and important: when it comes to correctly identifying hammers, both networks exhibit the same level of trust in the aggregate. However, while MobileNet-V2’s moderate score is arrived at asymmetrically — some of its decisions are highly trustworthy whereas others are not — ShuffleNet-V2 performs with modest consistency. Colloquially, we might say that MobileNet-V2 is a volatile actor, combining moments of great skill with incompetence, whereas ShuffleNet-V2 is the more workmanlike specimen: dependable, reliable, but rarely impressing.

Note that trust densities can also be computed for incorrect answer combinations (e.g., instances where a network identifies a hammer as a pen). In this case, the metric communicates the extent to which one can trust an erroneous prediction. The idea is that actors unsure of their predictions should declare their hesitance — i.e., incorrect answers expressed with low confidence instill trust that the model is in fact on shaky ground.

Metric 2a: Conditional Trust density

The dual nature of trust — the extent to which it is influenced by confident incorrect answers and tentative correct ones — allows us to decompose trust densities with additional effect. Specifically, we can separate the trust density curve into two curves that reveal a model’s trustworthiness when it provides: 1.) the correct answer for a given scenario; 2.) the incorrect one. This bifurcation results in the conditional trust densities shown below.

The green and red curves depict trust densities for correct and incorrect answers, respectively. Together, they sum to the blue curve that depicts the model’s overall trust density (previous section) for a question-answer pair.

In this case, the conditional trust densities illustrate the behavior of ResNet-50 for the ‘monitor’ scenario in ImageNet. Moving from left to right along the green curve, we encounter low scores for low trust regions — which is to be expected, as correct answers are less likely to have low trust levels while still being correct (i.e., accurate predictions are typically made with a base level of confidence). The peak of the green curve is instead focused around the mid-trust regions of the density curve, before tapering off as we approach the high-trust areas.

Interpretation: ResNet50 is moderately trustworthy at correctly identifying monitors; while it is rarely hesitant in such circumstances, it is rarely confident as well.

Shifting our attention to the red curve, and again scanning left to right, we encounter two peaks: one in the very-low trust region and one in the mid-trust area that overlaps with the green peak just described.

The first peak is particularly problematic from a trust perspective, as when ResNet-50 mistakenly identifies an object as a monitor it is quite confident of its answer (overconfident, in fact). The second peak is problematic but in a different way: by overlapping with the green peak, the network is moderately confident about both correct and incorrect answers, which can be difficult to disambiguate.

Based on these results we can conclude that ResNet-50 is not very trustworthy for this specific scenario, as areas of low-trust brought on by overconfidence for incorrect responses are intermixed with areas of moderate trust for right and wrong answers.

In sum, the behaviors characterized by trust densities and their conditional counterparts provide a nuanced understanding of a model’s trustworthiness for question-answer scenarios, where trust breaks down, and whether the methods for mitigating its shortcomings are effective.

Metric 3: Trust Spectrum

If trust density captures a microscopic view of the network — its trustworthiness for a given question-answer pair — trust spectrum provides a macroscopic assessment by depicting trust across all possible answer scenarios.

For this metric, we compute trust spectrum coefficients for each scenario by conflating the model’s density curve into a singular scalar value (essentially the ‘average height of the curve, but consult the aforementioned papers for the mathematical nuances). Plotting these coefficients across all answers results in the trust spectrum below.

As before, the figure provides comparative results across four networks. An examination of the spectra reveals a few noteworthy points.

First, there are answers where the trustworthiness of all the networks is quite low (e.g., neither screens nor monitors can be detected with high confidence). In practice, such cases would warrant a deeper investigation into their causes and strategies such as training with additional data to mitigate them.

Second, there are several answer scenarios where the variance in trust is quite pronounced. For example, ResNet-50 is significantly more trustworthy at identifying hammers than MobileNet-V2 and ShuffleNet-V2, which are, in turn, considerably better than MobileNet-V1.

By pinpointing where trust breaks down in a comparative and scenario-specific manner, trust spectrums provide useful insight into how a network can be improved. In this example, we might investigate the architectural nuances that allow ResNet-50 to effectively identify hammers so they can be integrated into other network designs.

Metric 4: NetTrustScore

While it is tempting to reduce everything to a single metric (e.g., performance, accuracy), the preceding sections hopefully illustrate that each metric conveys rich information that can have important implications for network design and optimization.

With that caveat in mind, we introduce NetTrustScore, which is a scalar metric of the overall trustworthiness of a deep neural network (you might think this metric is computed by simply averaging the trust scores for all answer scenarios, but it is slightly more complex as it must account for the frequency of each scenario in the dataset).

Fundamentally, NetTrustScore is an aggregated score that indicates how well placed a network’s confidence is under all possible answer scenarios. As such, it is particularly useful for quickly comparing the trustworthiness of various networks before conducting a finer analysis by means of trust densities and spectrums.

Continuing from the previous example, the NetTrustScore for our networks are as follows:

  • Mobile-Net-V1: 0.713
  • ShuffleNet-V2: 0.723
  • MobileNet-V2: 0.739
  • ResNet-50: 0.776

Unsurprisingly, ResNet-50 achieves the highest NetTrustScore amongst the four models. It’s frontrunner status reflects the simple fact that it not only gets a significantly higher number of answers correct, but is quite confident about the correctness of its answers (look again at the previous figure and observe how much light blue there is along the top of the plot).

The complex nature of trust, however, is reflected by a notable disparity: while ResNet-50 provides ~7% more correct answers than MobileNet-V2, it’s NetTrustScore is greater by only half that margin (~3.7%).

This reason for the gap is a result of ResNet-50 overconfidence in giving incorrect answers (~7.2% more than MobileNet-V2). This unwarranted bravado reduces its trustworthiness in accordance with the aforementioned insights from social psychology.

Zooming Out: The Trust Matrix

While the metrics above are useful in characterizing trust for a particular scenario — identifying a hammer or monitor — they fail to provide an understanding of how trust manifests across all the question-answer combinations a model may encounter.

To this end, we introduce a final tool by way of the trust matrix: a visual grid that provides deeper, more granular insights into where trust breaks down for a neural network and its universe of inputs. The concept, which is inspired by simpler confusion matrices, is illustrated below for ResNet-50 and a subset of ImageNet samples.

Like a confusion matrix, the trust matrix charts a model’s answers against those of a human labeler. In a perfect world of complete agreement, a bright yellow (i.e., high trust) diagonal would bisect the matrix with no ‘off-diagonal’ areas of disagreement. The reality is much more complex, however, and the figure reveals numerous areas where trust breaks down.

First, the dark blue squares in the off-diagonal regions (see the ‘street sign — switch’ pair circled in red) represent areas of low-trust produced by overconfident yet incorrect answers. These areas are prime targets for deeper analysis and additional data collection.

Second, green and yellow squares that reflect areas of high trust are present both on and off the diagonal. Whereas the former is desirable (strong trust for correct answers), the latter (strong trust for incorrect answers) is more complicated as it pinpoints scenarios where the model doesn’t make the right decision, but isn’t clear as to what the answer should be.

The pink circle around the ‘laptop — water bottle’ pair is illustrative of such a scenario. In such a case, the designer could employ a supplementary system (such as a secondary classifier) to help remedy the confusion.

A final benefit of the trust matrix is when it is compared with another:

The figure above contains the trust matrix for ResNet-50 (left) alongside the trust matrix for MobileNet-V2 and illustrates the differing trust characteristics of each network.

Specifically, we can see that despite MobileNet-V2 having more incorrect answers (i.e., more off-diagonal squares) and lower overall trust than ResNet-50, it exhibits noticeably larger portions of high-trust areas (light greens and yellows) in the off-diagonal. Despite MobileNet-V2 having a lower overall NetTrustScore than ResNet-50 (recall, .739 vs .776), it is more trustworthy for incorrect answer scenarios. This is a desirable trait to have insofar as the model’s incorrect predictions can be more easily flagged and rectified by another system (e.g., a human).

The Next Flight

Be it accountability or fairness, transparency or ethics, the scrutiny of AI will continue to deepen as the technology proliferates. And while explainability tells us how and why a neural network works, the metrics above tell us when and where it can be trusted.

Our hope is that these metrics provide the foundation for improved tools to evaluate the trustworthiness of neural networks, which will aid practitioners in producing, deploying, and certifying deep learning solutions that can be trusted to operate in mission-critical scenarios.

Returning to our inaugural thought experiment, it was the Wright brothers who said:

“It is possible to fly without motors, but not without knowledge and skill.”

If the fathers of aviation were alive today they might strike a similarly cautious tone about the technical breakthroughs ahead of us and the flights of fancy they will inspire.

We invite you to learn more about these metrics in our two academic papers here and here.

If you’d like to learn more about our work, check out our website or contact us.

--

--