Good Data Science — Interpreting the Black Box

Illustration by Hanne Morstad

In the previous article in this series about Democratized AI and the Black Box Problem, I proposed a checklist that we should be able to answer with some degree of confidence in order to assess if one has achieved Interpretability within a Machine Learning system. That is, given our hypothesis and the questions we are trying to answer:

In this post I will provide examples and attempt to argue the relevance of these criteria and how they can guide us towards doing Good Data Science.

What are you looking at?

Neural Networks, and especially Deep Neural Networks, outperforms almost any other Machine Learning techniques for a number of tasks, because they are able to learn machine learning models that captures the high-dimensional, non-linear correlations contained in the data, represented in a parameter space with millions of parameters. In practice, decoupling and understanding these learned correlations is an intractable task. But we can to some extent visualize these correlations and gain some intuition about the inner workings of the network.

There are currently two main approaches to inspect how and what a Neural Network has learned: Layer attribution and Feature visualization. The latter will be discussed in a follow-up post going deeper into Interpretability.

A two-class problem, retriever puppy and kitten, saliency map on the right. Credit: “Feature Visualization” by Olah, et.al., Distill, 2017.

Layer attribution, also called a saliency map, measures which parts of an image were the most important contributions in having a model predict a particular object class. Shown in the figure, a classifier model has determined that there are two classes in the image: a retriever puppy and a kitten. The heat map indicates which super pixels belong to which class, and also the relative contribution, or importance, of each pixel in determining each objects class.

A sailboat and terrier, with their respective saliency maps below. K. Simonyan et.al. arXiv:1312.6034v2

Similarly, in the next figure, we can see that the top of the sail was very influential in the classification of “sail boat”, with some contribution also by the boat being surrounded by a specific texture, which we can identify as choppy sea. The example to the right with the terrier, is actually quite impressive, because even when the dog is imaged with a background of similar colour tones, the model has learned that the colour itself is not an important feature to look for in determining the object class. (This article provides examples of how colour as an important feature can lead to racist and discriminatory Machine Learning systems)

Q: Is the model answering the right question?

A practical use case for Layer Attribution is to check whether or not the classification model is answering the right question. A famous example is the Wolf v Husky classifier, where a classifier model has been trained to answer the question of whether the animal in the images is a husky or a wolf. And seemingly, the performance is quite good and the model gets most classes right. That is, until one inspects where in the image the model is actually looking. It turns out that during prediction the model is actually looking at the background and not paying much attention to the animal itself. In other words, the model is actually answering the question “Is there snow in the background?”.

A saliency map of where in the image the model is looking when being asked to answer if the animal is a wolf or a husky. Credit: “Why Should I Trust You?”: Explaining the Predictions of Any Classifier, M.T. Ribeiro et al

The reason for this is that during training of the Neural Network model, the network was presented mostly images of wolfs with snow present in the background, while the huskies most frequently appeared in front of green vegetation. So the answers “Yes/No” turns out to be utterly meaningless since the question actually being answered is “Is there snow in the background” and not “Is this a wolf?”. Looking to the checklist of Interpretability, this clearly is a prime example of failing to ask the question correctly and thus leaving it up to the Neural Network to figure out which task it is supposed to perform.

By definition, the purpose of a classifier model is to discriminate based on the features, or properties, of the data it is presented with. Here the snow offered a path of least resistance, as it was the most dominating feature available for the model to learn to discriminate the object classes by.

Most progress with tools for Interpretability has been made with Neural Networks in Computer Vision, because the data is visual and hence these visual techniques are fairly low hanging fruit in terms of gaining intuition about a Network’s inner workings. As Machine Learning is becoming more and more prevalent in Decision Making systems based on non-visual data, like unstructured text or time-series data, or a combination of both, the need for an expanded tool box becomes evident.

Q: Can the data answer the question?

Predictably, there is an intense interest in leveraging Machine Learning to try to win the stock market by being able to forecast whether the price of stock will go up or down in the short term. This type of Regression analysis on time-series data is a task that Machine Learning is well suited for, and it is fairly straightforward to train a Neural Network that learns to model historical stock price fluctuations and is able predict the next time step with some reasonable accuracy. However, the transition from a well-controlled testbed to the incredibly intricate and complex system of the uncountable and unknowable variables comprising the global trade markets, is about as smooth as a gravel path.

A Machine Learning model cannot learn something it was never shown during training and it cannot know what it has never seen before.

The model will fail when it encounters a never-before-seen occurrence in the data stream, in the sense that, while the model will always output an answer, the answer itself is nonsensical because the question is ill-posed and lacks context.

Specifically, the model fails because in order for a Neural Network to be sufficiently robust to handle unexpected fluctuations in the stock exchange market, it has to learn basic human behavior and psychology. The model needs to be able to account for how people groupthink and react to geopolitical events, natural disasters and other world encompassing calamities that perturbs the established and stable behavioral patterns in stock trades.

Ideally the model should also be able to read and comprehend press releases regarding staff changes in companies or to detect possible mentions in the media related to corruption scandals or future regulations that will affect earning potential. The problem of semantic analysis in Natural Language Processing is incredibly difficult and still unsolved, and until then, Machine Learning models cannot reliably distinguish between “good news” and “bad news” (or even “fake” news).

It is also expected that when word comes out that some actors are buying and selling on advice by an algorithm, there will be a positive feedback loop where herds of people will want to get in on the same perceived fool-proof trade, and situations may arise where there exists no incentive to sell a stock because after all, the AI-stock oracle said to buy it, not to sell.

All of these factors provides the context of a stocks valuation from moment to moment. The Black Monday crash of ’87 demonstrates how a crash can happen in a short span of time, with multiple players reacting to incomplete information from a variety of sources, such as policymakers, governmental agencies and the good ole rumour mill.

This timeline of the Black Monday crash of October 19, 1987, relating the sequence of events and changes in the S&P 500 index, illustrates that simply numbers alone are not enough to understand fluctuations in the stock market.

So, let’s check in with the second criteria in the checklist, “Can the data answer the question?”. It is fairly straightforward to construct a dataset with the limited quantitative data regarding a specific stock, like, prices, performance, budgets, but this data will not give a full picture of the complex, hidden variables that affect how and why people decide to bet on a stock or not. And if the answer is not contained in your data, your model will not have the prerequisites for answering your question.

There is currently no generic tool in the Interpretability toolbox to check the veracity and suitability of data, except our own critical thinking and healthy dose of skepticism.

Q: Can the answer be explained?

Other, perhaps more noble, tasks for a Machine Learning system can be found within the healthcare technology industry, with use cases such as detecting anomalies in ECG data, abnormal blood values and other measurable quantities in the human body; to classifying objects in medical imaging such as tumors and fractures and so on.

These are all excellent Machine Learning problems. But when one attempts to use Machine Learning for automatic diagnosis and treatment suggestions based on a fairly limited set of data points pulled from a patients medical chart, things start getting hairy fast. The hairiness is not based on there not existing any underlying causation and correlation between groups of patients, diseases, symptoms, treatment regimens and so on, but rather on whether or not the data is sufficient.

As with the stock valuation example, we need to ask if the data contains all possible causes leading to the observed symptoms. Or are there important factors, unobserved variables, that are not quantified within the dataset? Or as with the “Wolf v Husky” example, is there a risk that the model has actually been trained to answer a different question than the one we believe we are asking it?

Currently, Machine Learning for use in medical diagnosing lacks transparency and verifiable results. ALEX HOGAN, IKE SWETLITZ/STAT

Imagine that a specific disease is being treated at one particular hospital, with its in-house group of medical experts, serving a probably fairly homogeneous demography in one specific area of town. How can one guarantee that the model doesn’t actually classify the socioeconomic status of the patients and not their medical condition? That this model instead learns that heart attacks are only suffered by wealthy, white, middle-aged men in management positions while training for their next triathlon? Because that is the typical group of patients served by this particular treatment facility?

What happens if a company tries to sell a Treatment Recommender model like this to a clinic in a sub-Saharan country, where the majority of the patients are women with health issues related to being both female and poor, something that the model has never seen before? To check in with our list again: Can the physician trust the diagnosis and the treatment recommended by the Recommender model in this case?

The short answer is no, because she cannot open up the Black Box and explain how the model arrived at these recommendations. Choosing from a list of suggested treatment regimens sorted in order of how much “confidence” the models has, ranging from 0 to 100%, becomes an inane exercise without knowledge of the contribution and importance placed on the various input features, like symptoms, vital signs, gender, age, epicrisis, medical history and so on.

This case might sound like an hyperbole, but both the concerns and the dangers are real, because Machine Learning for use in medical diagnosis today lacks both transparency and proven safety and efficacy.

First, do no harm

A significant part of working as a Data Scientist comes down to expectation management and giving tempered advice on whether a Machine Learning system is appropriate for a given task, and also on proper and ethical use of data. Yes, Machine Learning does work! And really, really well, but with certain limitations. These limitations can be difficult to pinpoint precisely, as hopefully illustrated by the examples given in this article. But, being aware and able to identify limitations is key to doing Good Data Science, or at least, to avoiding doing harm.

In the end, we have to rely on critical thinking and following the precautionary principle: Avoid using technology that is not fully understood in Decision Making systems where the stakes are critical or high. While it is possible to train a model to achieve high accuracy for a specific and narrow task, knowing the why and the how a model arrives at a result is currently not really knowable due to the lack of standardized tools and approaches for interpreting and explaining. This is especially true for non-visual tasks.

What we can be certain of, however, is the maxim that if we put garbage in, we are going to get garbage out. And if the data is poor, either due to bad quality or low quantity, or maybe it can’t even corroborate the hypothesis in the first place, then we should have very low trust in the results and predictions made by the model that was trained on this data.