A fake chart with the letter “a” followed by an icon for “eye”. There is a triangular shape formed out of letters: “T-E-S-T-D-E-B-U-G-R-E-P-A-I-R-F-A-R-R-O-B-U-S-T-E-X-P-L-A-I-N-A-B-L-E”

Ensuring Trustworthy AI through AI Testing

Published in

IBM Data Science in Practice

11 min readJul 21, 2021

Many industries have centered on AI-based innovation in their business development. However, trust in AI output is crucial for the broad adoption of AI systems. A natural question arises on how to ensure this trustworthiness. All industrial practices have dedicated phases and personnel for testing in the software development lifecycle towards ensuring reliability for traditional non-AI applications. In this post, I discuss the importance, challenges, and some of IBM Research’s efforts on AI model testing towards ensuring AI trustworthiness.

Types of Testing in AI Lifecycle

At a high level, the Data and AI Lifecycle contains three phases:

The first phase is the Data Lifecycle where data is pre-processed and made ready for building an AI model.
The second phase is the model building phase where data scientists try to produce the best model.
The third phase is the post-model building phase, where the model is validated from a business perspective before it is deployed, monitored, and retrained.

The three stages of AI model development with subsections. The first stage is data preparation with data selection, ingestion, fusion, quality and cleaning as pieces. The second stage is model building with feature engineering, model building, model testing and selection as pieces. The third stage is model deployment, with risk assessment, deploy, monitor, analyze & recommend, and repair as pieces.

Testing occurs in all the above phases. In the first phase, a data scientist tests the input data for quality issues and cleans the data. In the second phase, the data scientist uses the validation data to iteratively strengthen the model, and then one model is selected from multiple models based on their performance on the hold-out data. These two activities together are typically termed model testing. In the third phase, in most regulated industries and businesses, an auditor or a risk manager further performs the model validation or risk assessment of the model. Once deployed, the model is continuously monitored with the payload data to check the runtime performance. While data quality checking is incredibly important in AI development, in this article, I focus on the testing of AI models performed in the second and third phases.

Even though all these steps essentially perform ‘testing’, each step is unique in its importance and challenges. The first step in the model testing requires that the data scientist debug the failures in the validation data and repair the model using either hyperparameter tuning or changing the data. The testing with the hold-out data does not require debugging but needs comprehensive test data to compare multiple models. The risk assessment or model validation phase wants to independently test the model with new test cases and create a report on the overall model’s performance. This phase has black-box access to the model compare to the white-box access in the previous steps. This step needs comprehensive testing metrics for the whole model to decide on the deployment readiness rather than debugging each failure. It is in this step that models are potentially checked for metrics related to business KPIs. The monitoring step does not require test data, as payload acts as test inputs. However, this phase needs to determine if the model is not performing due to either the distribution drift in the payload data or insufficient prior testing and informs the data scientist on how to subsequently address any issue.

Performing testing across multiple modalities such as tabular, time-series, NLP, image, and speech-to-text is a daunting challenge for the field. Towards this effort, the Reliable AI team in IBM Research India has made significant progress in bringing all the testing techniques together in the AITest tool and made it available as a part of the Ignite Quality Platform [7], an offering of Global Business Services. Interested readers can read our demonstration paper in ICSE’21 [4].

There are many different dimensions of AI testing, namely test properties and metrics, test dataset preparation, debugging, and repairing. In this article, I discuss the properties and metrics dimension and I leave the other topics for the follow-up posts.

Properties

Testing properties of AI models are some of the primary ingredients for AI testing and these are also referred to as metrics or performance indicators. The nature of testing properties is very different from the nature of traditional software testing properties like null pointer exception, assertion failures (safety properties), and concurrency issues. For AI, most testing properties are related to the dimensions of trustworthiness. The pillars of Trustworthy AI include fairness, robustness, and explainability [2]. Additionally, we test for the generalizability of the model or the effectiveness of the model for unseen samples and metrics related to business KPIs.

Several metrics are used for measuring each testing property. For example, the property of generalizability has the metrics of precision, recall, and accuracy. Below, I describe few properties related to Trustworthy AI.

Group Fairness

We are typically concerned with two types of fairness in AI: group fairness and individual fairness. Disparate impact [8] is a well-known metric for group fairness that compares the proportion of individuals that receive a positive outcome for privileged and underprivileged groups.

As an example, let’s look at a case on comparing the numbers of men and women who are approved when applying for a bank loan. Men in this example are the privileged group and women are the underprivileged group. Say that we have 1000 samples with 700 men and 300 women who have applied for a loan. Out of the 700 men who applied, 300 have their loan approved, and out of the 300 women, 100 have their loan approved.

a formula with DI equal to the ratio of successes of the underprivileged group to all in the underprivileged group divided by the ratio of the successes in the privileged group to all in the privileged group

In this post, we are simplifying the explanation of disparate impact somewhat and for a fuller discussion see [8]. Disparate impact, labeled DI in the equation above, is measured as the proportion of positive outcomes in the underprivileged group divided by the proportion of positive outcomes in the privileged group. The positive outcomes are measured by dividing the successful outcomes, S, by the number of the entire group, A. In our current example, this becomes

equation with DI equal to the ratio of 100 to 300 divided by the ratio of 300 to 700, all of which is equal to zero point seven eight, or seventy eight hundredths.

If this value goes beyond a designated threshold (typically 0.8, based originally on EEOC standards of fairness in employment in the United States), then it signifies discrimination against the underprivileged group, which in this case is women. As we can see from the image below, the threshold in this example is slightly below where it should be to meet the threshold of 0.8. IBM AIF360 [1] provides a comprehensive list of metrics for measuring group fairness, including disparate impact.

two squares made of 100 smaller squares. the square on the left has eighty blue squares and twenty orange squares. the square on the right has seventy eight blue squares and twenty two orange squrares. — The image on the left shows the desired ratio of 0.8 of favorable outcomes for the underprivileged group to the favorable outcomes for the privileged group. The image on the right shows the actual ratio of 0.78.

Individual Fairness

Individual fairness [13, 14] examines model outcomes for two similar individuals, who differ only by one or more fairness/protected attributes such as gender, race, nationality, or other categories that are defined by law and regulation. To judge fairness, outcomes are examined to see how similar or different model outputs for the individuals are and, if they are different, to determine what aspects of data or the model led to this difference. Below we show an instance of gender-based individual discrimination of a tabular classifier (Dataset: Adult Income — 8 columns. Model: Logistic Regression). These two samples from the dataset differ only in the gender category, but they produce different income predictions.

Below is an example drawn from a sentiment classifier model. The two texts are exactly the same other than the modifier in the form of a racial group used to describe a noun phrase changes, as shown with these in all capital letters below. This one switch changes the polarity of the model’s output, likely due to the values associated with the racial groups in the language model.

In this final example of individual fairness, let’s examine input-output from a Speech-to-Text converter. In the first example, where the speaker has an Indian accent for English, the model matches the output. In the second example, however, when the speaker has a French accent, the model’s output does not match what the speaker says. Many Speech-to-Text models have problems with certain accents within English and other languages as well (see [9], [10]). As an accent typically reflects national origin, another protected category, this particular model shows a bias against national origin.

> MODEL OUTPUT: “I am not sure what this is”

> MODEL OUTPUT: “I am not sure who it is that she’s”

We measure individual fairness using error rate, i.e. percentage of discriminatory pairs in the test suite. Individual fairness is judged pairwise between individuals, and fairness is measured where output errors are due to the demographic differences in the input. Some examples of these errors are label flip, possible large variation in label confidences, or large word errors as seen in the Speech-to-text example above.

Both forms of fairness — group and individual, are important properties to measure from both ethical and legal points of view. As more businesses in regulated industries adopt AI frameworks, the need for greater trustworthiness and metrics will be necessary.

Robustness

Robustness is the property of Trustworthy AI that deals with whether the model is powerful enough such that different input variations should not affect the model prediction. We measure this property using the same metric as individual discrimination. Both of these metrics essentially measure the reaction to some change in input: for individual discrimination, this change is related to fairness attributes like gender, race, ethnicity, whereas the change examined in robustness needs to be small by some measure.

Below is an example from a Speech-to-Text converter where you can see the truncated output for the same vocal input after adding restaurant noise to a perfectly working sample.

> MODEL OUTPUT: “I am not sure what this is”

> MODEL OUTPUT: “I am”

Below is a second example of a robustness failure test case. In this case, we show how an MNIST classifier, built with a common image dataset consisting of handwritten digits, fails to correctly classify an image of a handwritten digit when the input image undergoes an inverse color transformation.

a two by three table showing two categories, image input and class prediction. In the first data row, there is an image of a handwritten version of the number six in white on a black background that is predicted to be a six. In the second data row, there is the same form of the handwritten digit, only this time the writing is in black on a white background. This digit is predicted to be a zero.

Explainability

Explainable AI allows humans to comprehend and trust the result of an AI model. As AI has become more advanced and neural models become the norm, models have become more ‘black-box’, i.e. difficult to interpret. In response, researchers have created several algorithms that allow humans to interpret the whole model for global explainability or to explain the individual prediction for local explainability [5]. It is key with both local and global explainability to check whether such algorithms will be able to faithfully explain a given model as their effectiveness depends on the complexity of the model.

AITest uses two metrics to measure the property of global and local complexity of models. Our first metric, tree-simulatability, determines how well an ‘interpretable’ decision tree model can effectively simulate a given model [12]. The data scientist testing the model can configure acceptable interpretable characteristics of the decision tree such as maximum path length, average path length, and the maximum number of decision nodes. Given a test dataset, this metric reports the fidelity of the interpretable model to the actual model by the percentage of decisions from the actual model and the decision tree approximation that are the same. If AITest can find a decision tree (which may not be the optimal tree) within the given limits that achieves the same label as the tested model for 80% of test samples, then the tree-simulatability is 80. This metric then essentially determines the global complexity of the model.

The second metric is related to the stability or robustness of local explanations [6] generated by local explainers such as LIME [11], which explains a prediction in terms of feature importance. The stability property deals with whether local explanations remain similar for similar inputs. In this case, AITest generates a test set and reports the percentage of test samples for which a given local explainer produces a stable explanation. To test the stability of an explanation, AITest perturbs the test sample and check if the resultant explanation remains the same as the explanation of the test sample under some conditions in feature importance.

In the below example, LIME, a local explainer, produces two different explanations for inputs x and x’, which only differ slightly in the field balance. In the explanation, the feature-importance of ‘age’ is reversed for these two very similar samples with the same predictions [dataset: bank market, model: decision tree]. Even though the change is small, the reversal of polarity may confuse the end-user. This is an example of an unstable local explanation.

side by side feature importance for the model predictions for inputs x and x’. For input x, the no side has pdays as strength 0.26, duration has strength 0.21, contact has strength 0.07, and age has strength 0.03, while the yes side marital with strength 0.04. For input x’, the no side has pdays with strength 0.3, duration with 0.22, default with 0.11, contact with 0.05, and the yes side has age with 0.03

It is possible to classify all the above properties into two types — one type that requires a ground truth, such as the evaluation of accuracy, and a second type that does not require ground truth, as the error rate for individual discrimination. These second types of properties are called metamorphic properties [3]. The Reliable AI team at IBM Research has spent a lot of time designing such metamorphic properties, as it is possible to evaluate such properties for any synthetic test samples (without worrying about gathering ground truth) beyond the training data split.

Conclusion

In this article, I described the importance of AI Testing in Data and AI lifecycle and further illustrated the key test properties across multiple modalities like tabular, text, image, and speech-to-text. Examining these test properties with metrics for each model you develop is crucial for ensuring a trustworthy AI practice in your business, and should be adopted and adapted as necessary.

For each property discussed in this article, the test results not only depend on the model but also on the test dataset. Testing with the right data is also necessary for illuminating the aspects of trust that are necessary. In the next article, I will discuss the importance of creating synthetic test datasets for the effective testing of AI models.

References:

IBM AIF360. Link.
How IBM makes AI based on trust, fairness, and explainability. Link.
Xie, X., Ho, J. W., Murphy, C., Kaiser, G., Xu, B., & Chen, T. Y. (2011). Testing and validating machine learning classifiers by metamorphic testing. Journal of Systems and Software, 84(4), 544–558. Link.
Aniya Aggarwal, Samiulla Shaikh, Sandeep Hans, Swastik Haldar, Rema Ananthanarayanan, Diptikalyan Saha: Testing Framework for Black-box AI Models. ICSE 2021 Demonstration. Link.
Interpretable Machine Learning Book. Link.
Agarwal, S., Jabbari, S., Agarwal, C., Upadhyay, S., Wu, Z. S., & Lakkaraju, H. (2021). Towards the Unification and Robustness of Perturbation and Gradient-Based Explanations. Link.
IBM Ignite Quality Platform Link.
Michael Feldman, Sorelle A. Friedler, John Moeller, Carlos Schneidegger, and Suresh Venkatasubramanian. (2015). Certifying and Removing Disparate Impact. Link.
Claudia Lopez Lloreda. (2020). Speech Recognition Tech is Yet Another Example of Bias. Link.
Archiki Prasad and Preethi Jyoti. (2020). How Accents Confound: Probing for Accent Information in End-to-End Speech Recognition Systems. Link.
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. (2016). Why Should I Trust You? Explaining the Predictions of any Classifier. Link.
Craven, M., & Shavlik, J. (1995). Extracting tree-structured representations of trained networks. Advances in neural information processing systems, 8, 24–30. Link
Dwork, C., Hardt, M., Pitassi, T., Reingold, O., & Zemel, R. (2012, January). Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference (pp. 214–226). Link.
Kusner, M. J., Loftus, J. R., Russell, C., & Silva, R. (2017). Counterfactual fairness. arXiv preprint arXiv:1703.06856. Link.

Author: Diptikalyan Saha is a senior technical staff member at IBM Research, AI. He leads the Reliable AI team in IBM Research, Bangalore, and performs research in the intersection of Artificial Intelligence and Software Engineering. He holds a doctorate degree in Computer Science from Stony Brook University.