Practitioner’s guide to Trustworthy AI

Published in

IBM Data Science in Practice

12 min readSep 19, 2022

Co-authored by Maleeha Koul, Data Scientist with IBM Data Science Elite, and Courtney Branson, Data Scientist with Expert Labs Trustworthy AI Practice.

What does it take to trust a decision made by a model? There is no one answer to this question and in fact, trustworthiness has many facets, or what we sometimes call ‘pillars’. Some commonly discussed pillars are drift, quality, fairness, explainability, transparency, privacy, adversarial robustness, and more. Each of these pillars considers very different aspects of trust in AI. The pillars are also not all created equal depending on the model and use case in question. Knowing how the data coming into the model has changed over time may be top of mind for one use case, whereas for another we may care more about whether the model is biased against certain demographic groups. It’s important to spend time thinking about these different areas before any development starts to ensure you’re putting the proper guard rails in place to ensure trustworthiness at every step of the AI build process. In this blog, we will look at the first four pillars namely drift, quality, fairness, and explainability, in more detail.

What is trustworthy AI?

AI workflows with predictive models are incomplete and practically non-operational for a business if they cannot be ‘trusted’. Model metrics for validating the performance can be evaluated in multiple ways like accuracy, error, etc., on test data and hold-out datasets. However, it’s not enough for a model to be trusted in production. A model in business operations needs to meet KPIs and provide value to an end user along with addressing any concerns of bias. These concepts of model monitoring encompass the notion of Trustworthy AI. A model which cannot provide a complete picture of the predictions can often be rendered useless eventually. For example, if a banker doesn’t know ‘why’ a specific person was predicted as risky for issuing credit, they cannot make an informed decision and hence the value of the AI model/workflow is almost negligible. The model predictions would instead be useful to the banker if a holistic picture of the prediction itself is provided in terms of what’s driving the prediction, how can it be changed (if possible), etc. Business users embedding AI models in their applications can leverage the explainability feature to better understand which factors contributed to an AI outcome for a specific transaction. For example, if a customer is denied a loan and that decision is partly due to an AI model prediction, the business needs to deliver a clear explanation of the decision to the customer. Overall, increase efficiency, coverage, and confidence in your models.

Getting started

To get started with Trustworthy AI in operation, this blog will use an example of credit risk where we use a machine learning model to predict whether a customer poses a risk of defaulting on the loan they are requesting. For a bank, it’s critical to understand the behavior and history of their customers to help make important decisions like issuing a loan or credit. However, it is very time intensive for bank employees (ie. the loan issuers) to go through this entire process on their own. So for better banking, this machine learning model predicts the customer’s riskiness to aid in the bank employee’s decision-making process for whether to approve the customer for a loan.

Essentially trustworthy AI concepts can be implemented using many Python packages or written from scratch. However, IBM OpenScale and AI360 packages provide comprehensive metrics for Trustworthy AI which are easy to use, implement and consume. In this article, we will include code snippets we used to set up our model monitoring within IBM OpenScale. It is important to note that the steps preceding the setup of the model monitors have not been included for brevity. If you would like more information on how to set up the early steps of the process you can read more here (link: https://www.ibm.com/docs/en/cloud-paks/cp-data/4.5.x?topic=openscale-configuring-watson)

Deep Dive

DRIFT

The performance of the model can degrade over time due to retraining the model on new data. This type of drift in model performance from its ideal behavior at the time of training and deployment to production induces a risk in the reliability of the model predictions. IBM Watson OpenScale provides the functionality to configure a drift monitor out-of-the-box by providing the deployed model and defining the thresholds. For the credit risk model, the accuracy at the time of training was 83.60% but there is a drop of 3.10% observed later. A minor change in the metrics may not be as threatening but each model can have its own threshold for violation. A drop in accuracy by 10% would be a violation of the credit risk model and will have to be re-trained or checked for the cause behind drift. Incoming data consistency can also change and induce drift in the model performance. So drift can be summarized as any changes in model performance from the time of training and deployment as well as the change in data from the training and testing data. To maintain quality models in production, measuring and monitoring drift is imperative.

QUALITY

Quality metrics define the performance of a specific model on unseen data (data that the model was not trained on). Depending on the type of algorithm used and the problem statement, quality metrics like accuracy, AUC-ROC, mean squared error, etc., are used to judge how well the model generalizes and performs in a real-world situation or production environment. To get quality metrics, Watson OpenScale offers a diverse set of metrics to choose from based on the model type. The credit risk model is a binary classifier, so we primarily use AUC-ROC and accuracy to monitor the performance. The area under the roc curve (AUC-ROC) is especially useful in distinguishing the FPR (False Positive Rate) and TPR (True Positive Rate) of the model. The model has an accuracy of 83.60% and an AUC-ROC score of .81 . The monitor uses test data for updating the metrics. (insert one more image for quality monitor)

FAIRNESS

Fairness metrics help us analyze whether our model and data are giving disproportionate advantages to one section of the population compared to others. There are hundreds of metrics that help analyze fairness. The first step is deciding which metric you need for your use case. This will depend on a variety of factors specific to your usecase such as whether your ground truth data encodes societal biases, whether the predictions are punitive or assistive in nature, and whether you are interested in group bias, individual bias, or both. IBM has released an open-source toolkit called AIF360 that includes a large variety of fairness metrics, as well as guidance on how to decide which metric is appropriate for your use case. (https://aif360.mybluemix.net/resources#guidance)

Since our data set uses historical data, it likely includes social biases. Because of this, we are going to use disparate impact to analyze the independence of the predictions and the protected classes. In a perfect world, the rate of loan approval would be about equal across all groups in our data set, which means we would expect our disparate impact to equal 1. This will analyze what we call group fairness, which is whether there is bias against a given group within our dataset and/or model.

The second choice we need to make is which protected classes we want to analyze fairness for. The most common examples of protected classes are age, gender, income level, and race. However, you can analyze fairness on any subset of the population that you think may be at risk. This can include less common groups such as marital status, education level, location, etc. The first thought most practitioners have is to simply leave these protected classes out of the training data entirely to avoid having the model learn unwanted patterns in these fields. However, more and more research shows that leaving out these features is not enough to avoid bias in these areas since they can be encoded in unexpected fields, such as zip codes. This is known as implicit bias. (https://hdl.handle.net/1813/104229). While it may still be beneficial not to train on the features that explicitly include protected classes, you need to keep track of them for each data point because we still want to ensure fairness for each protected class, whether used in the training data or not. To keep things simple for our example, we have chosen to look at only two protected classes: age and sex.

Next, we need to decide which subsections of our protected classes we are at risk of being biased against, which will be called the ‘monitored group’, and which subsection we think will be at an advantage, which will be called the ‘reference group’. (Note: these groups have different names based on what library you are using. They are sometimes called ‘unprivileged’/’privileged’, ‘minority’/’majority’, etc.) For our use case, we have decided the age range of 44–67 will be the reference group and that the age range of 18–43 will be the monitored group. We decided to monitor the younger age group as we believe they are more likely to be seen as risky investments whether that is due to lack of experience, credit history, or any number of various factors. Furthermore, we split the monitored group into 13 buckets to get a more granular view of how the fairness scores change as we get closer to the reference age frame. For sex, we have decided that ‘female’ will be the monitored group and ‘male’ will be the reference group. We have chosen this due to a variety of factors when looking at financial profiles of men vs women such as the gender pay gap.

There are two final decisions we need to make before we can start calculating metrics. The first is an acceptability threshold for our fairness metrics (i.e. how different we will allow our disparate impact ratios to be before we say there is bias). In our case, we have arbitrarily set these to be 0.99 for age and 0.80 for sex. However, in a real-world scenario, you would want this to be a line of the business decision in which regulatory requirements, industry best practices, research in the area, and maybe even external standards bodies are combined in order to come up with a threshold that is appropriate for your use case. The second is to decide what the ‘favorable’ vs ‘unfavorable’ prediction that is coming out of our model. Since we are predicting credit risk for bank loan approval, we consider the favorable outcome of our model to be ‘No Risk’ and the unfavorable outcome to be ‘Risk’.

The series of steps laid out above are what I consider to be the foundation of calculating these metrics. They don’t require any actual calculations to be performed. Instead, you focus on thinking through the problem entirely — usually with the help of domain experts that are closely involved in the problem area. Once you have these definitions, you can simply plug and chug to get your results. By taking the time to define these things first and really thinking through the problem we are able to move to the calculations more confidently. This also allows both the highly technical data scientists to receive buy-in and understanding from the business side before any code is written. It’s important to note that this is an iterative process. If we see later on that our assumptions here were wrong and in fact, a different portion of the population seems to be at a disadvantage, these definitions can be switched and the metrics recalculated. We must continue to validate the choices we make here to ensure fairness is maintained as time passes and data drifts. Now that all of these decisions have been made — and lots of thought put in — we can begin to calculate fairness for our dataset and model.

Defining attributes for monitoring fairness

There are two main ways that OpenScale allows you to view the fairness statistics coming in. The first is in a time graph as shown above. This allows you to see how the statistic is changing over time for your model. The red line at the bottom shows our threshold set at 80%. Since our calculated values are always above our threshold, we can say that our model has been acting in a non-biased manner with regard to the protected class of ‘sex’.

You can get additional information on how the metric was calculated by clicking on one of those data points on the time graph. Here you can see the measured values for both the monitored and reference groups. Females were receiving a ‘No Risk’ label 74% of the time and males were receiving it 72% of the time. Plugging that into our disparate impact formula from earlier, we get a disparate impact score of 1.03 and therefore a fairness score of 103%. If our disparate impact score was calculated to be below our threshold of 80%, then we would consider our model to be biased. In this case, you would want to perform bias mitigation.

EXPLAINABILITY

Model explainability comes up as a part of unboxing a model. Often times Machine Learning models are difficult to interpret by an end-user or a business consumer.

From the credit risk point of view, a bare prediction is not very useful to an end user. An auditor would want to know the why before making any decisions based on the model classifying a given customer as ‘risky’. Watson OpenScale also provides the ability to generate explanations for individual predictions using LIME explainability with a proprietary algorithm. SHAP explanations are also available. Looking at the explanations for the credit risk, the auditor can understand why customer 1 is classified as risky vs customer 2. The relative effect on each feature (predictor) also builds trust in the model and makes sure the model is exhibiting a baseline intuitive behavior from the perspective of the SMEs for the use case.

Yet another type of explanation called Contrastive explanation is additionally useful in understanding the model predictions. It depicts the possible changes in the data that could cause a different outcome than the one predicted by the model. In the credit risk model, as shown in the diagram, the model predicted the outcome as Risk with a confidence of 67.58%. Contrastive explanations show how the current set of values should be changed in order to have a No Risk outcome. This is especially helpful to an auditor who wants to find the possibilities of changing the risk associated with a consumer to no risk. The auditor can now make informed decisions and recommendations based on the detailed insights provided by explanations.

CUSTOM METRICS

With each of the pillars covered in this article so far we have chosen and analyzed the trustworthiness of our model using algorithms and statistics that were available out of the box with Watson OpenScale. However, as we’ve hinted throughout the article there are hundreds of different algorithms and statistics that could be used, whether through OpenSource such as AIF360, business KPIs, or proprietary algorithms that you have developed yourself. Watson OpenScale allows these metrics to be sent to the platform and displayed just like any of the other monitors we’ve already discussed here. For example, if you wanted a fairness metric that could analyze intersectional fairness, such as smooth empirical differential fairness, you could set up a monitor and calculate it along with the others. (link:https://www.ibm.com/docs/en/cloud-paks/cp-data/4.5.x?topic=monitoring-creating-custom-monitors-metrics) No matter what, you need to be thinking about the needs of the model, its users, and the people it will affect, in order to ensure you have the correct tools and monitoring in place to make your model worthy of their trust.

Try configuring trustworthy AI for your model yourself for free on IBM Cloud.

Practitioner’s guide to Trustworthy AI

What is trustworthy AI?

Getting started

Deep Dive

Written by Maleeha koul