Responsible AI in action, locally and on Azure Machine Learning

Published in

Microsoft Azure

20 min readAug 25, 2020

Demonstration of full awareness and control of our models is the business card to create trust in the artifacts we build and support nowadays; as a result, better interpretability leads to higher adoption.

Secure fairness in our models means avoiding they “learn biases”. Imagine for example a model that should filter the best resumes for an IT technical role; if you train your model with resumes of people who got hired and others who didn’t, it may learn biases like sex that potentially penalize women because their resumes look different from the training set. Don’t try to solve this unfairness by removing the protected feature: bias-blind isn’t a viable solution because the model may identify sex from other features! So we should rather consider ‘bias-aware’ models that measure and control the effect of protected attributes.

Tools to control fairness include SHAP libraries, FairLearn and Microsoft InterpretML that I’ll all leverage in the remainder of this article to document my attempt to train a bias-aware credit scoring model from the Census dataset containing population, racial/ethnic demographic information, employment and commuting characteristics of 32K individuals. Here are the steps we’ll follow together:

build three classification models leveraging Logistic Regression, SVM and CatBoost Classifier;
assess their fairness through the FairLearn dashboard that we use in a widget of our Jupyter notebook, and then upload it to an Azure Machine Learning Workspace;
use the InterpretML library to deep dive into un-mitigated models in order to understand if/why they don’t meet demographic parity;
finally, I’ll show you how to leverage state-of-the-art algorithms from SHAP to mitigate unfairness.

A little bit of theory

Regardless of fairness, what do we mean by “good” model? People sometimes use words as accuracy, precision, score, recall as synonyms, but these are all different metrics that assess different performances. So let’s quickly clarify the difference with this simple example: we trained a binary classification model to identify images of chairs from non-chairs. Now, over a dataset of 100 chairs plus 25 non-chairs (animals, landscapes, apples, tables…), our model identifies 90 chairs (85 of which are correct) and 35 non-chairs (20 right, 15 wrong). As a result:

total_positive = true_positive + false_negative → 100
total_negative = true_negative + false_positive → 25
total_population = total_positive + total_negative → 125
recall (or sensitivity, or true positive rate) = true_positive/total_positive → 85%
specificity (or true negative rate) = true_negative/total_negative → 80%
accuracy = (true_positive + true_negative) / total_population → 84%
precision = true_positive/(true_positive + false_positive) → 94.44%
f1_score = 2 * (precision * recall) / (precision + recall) → 89.47%

What is fairness instead? There are many ways that an AI system can behave unfairly, that we may group into 2 categories:

Where do you want to work?

The environment preparation offers a wide choice of solutions, either local or remote; if you opt for the first one, just create a Conda environment with Python 3.6> which contains azureml-interpret and fairlearn libraries for the basic features, plus azureml-contrib-interpret and azureml.contrib.fairness for experimental features (not fully supported yet), plus Jupyter and azureml-widgets to show fairlearn dashboards and shap charts within notebooks. Here are the 4 commands, which took about 10 minutes and 1.2Gb disk space to complete on my Microsoft Surface:

conda create -n responsible_ai_env python=3.6activate responsible_ai_envpip install azureml.core azureml-sdk azureml-widgets fairlearn azureml.contrib.fairness interpret azureml-interpret azureml-contrib-interpret interpret-community flask flask_cors gevent jupyter ipykernel catboost==0.18.1python -m ipykernel install name responsible_ai_env — user

Otherwise, you may create a new Azure ML Compute Instance and run Jupyter or JupyterLab from there:

In this case, you will have to add the above packages and kernel manually from an SSH session.

Connect to an Azure ML Workspace

We don’t need it at the beginning, but it’s a good practice to connect to the ML workspace as first thing. I’m using the interactive authentication but you may use an Azure Service Principal as well:

Data Ingestion

As mentioned earlier, I’ll use this census dataset of 32K records, which I collect through the shap library.

Here are the 12 features and their split/count by race:

Data prep

All features above look numeric, however some of them are just “numeric codes”. So for more accurate results, I separate categorical features from “real” numeric ones:

Now we transform the labels to numbers, which machine learning likes more:

Here we create a specific Python Pandas dataframe with the 2 fields we want to protect, Sex and Race:

Data split and Features enrichment

Separate training and test sets, and make some ID’s more meaningful:

Let’s do training!

One key benefit of fairness and interpretation tools is their ability to compare multiple models together. For this reason, I’m going to build three different classification models with Logistic Regression, SVM and Catboost:

Models registration

Registering models on Azure, despite optional, will allow us to access their FairLearn dashboard from there:

Grouped and Un-Grouped Metrics

First, we leverage Sci-Kit Learn scoring methods to assess the three classifiers against the whole test dataset; we call these outcomes Un-Grouped Metrics:

Then, we leverage FairLearn to assess the same metrics, but this time each metric is split by “protected features”. Such metrics are defined as Grouped Metrics. In this example I split them by race:

FairLearn Dahsboards

It’s now time to build a FairLearn dashboard. Once the object is created, we have two ways to read it: use Jupyter Notebook Widget or upload it to an Azure Machine Learning Experiment. We’re now implementing both options.

FairLearn Dashboard within Notebook Widgets

This widget is currently supported on a Jupyter Notebook service running locally or on Azure ML Compute Instances. The widget doesn’t work in JupyterLab instead, we’re working to support it soon.

Here is the code to create and show the interactive dashboard:

FairLearn Dashboard on Azure ML Experiments

This feature -in (public) preview at the time of writing- allows running the dashboard on the Microsoft Azure Portal which integrated this visualization within the Experiments blade of the new Machine Learning Studio interface. So we have to create an experiment, then create the dashboard and finally upload it. Let’s do it!

The first cell here below is quite similar to the previous one I used for the dashboard creation in the Jupyter notebook service. The _create_group_metric_set method has arguments similar to the Dashboard constructor, except that the sensitive features are passed as a dictionary (to ensure that names are available). We must also specify the type of prediction (binary_classification in this case) when calling this method:

In the following cell I create an Azure ML experiment and then I use the upload_dashboard_dictionary method of the azureml.contrib.fairness library to upload the dashboard to the “Fairness” section of the experiment:

The validate_model_ids parameter set to True in the previous code snippet means that the dahsboard will be associated to the registered model, in addition to the experiment. Needless to say, you have to have the model registered, which is the optional step we did earlier:

When you click the FairLearn dahsboard link, it’s opened within the experiment it was associated with:

Let’s now go and see the dashboard!

FariLearn Dashboard Analysis

Both methods shown above (Notebook widget and Azure ML Experiment) produce the same dashboard shown below. Select “Race” as protected feature:

The next chart shows the three models we trained as selectable points, and lets us choose the one we want to analyze. The x-axis represents accuracy, with higher being better. The y-axis represents disparity, with lower being better. Accuracy ranges from 84,7% to 87,2%. The disparity shown here ranges from 14,9% to 18,6%. The most accurate model achieves accuracy of 87,2% and a disparity of 14,9%, while the lowest-disparity model achieves accuracy of 87,2% and a disparity of 14,9%:

We’re lucky: the third model (based on CatBoost) reports the highest accuracy (=87.2%) and the lowest disparity at the same time. We choose this one!

This picture tells us that the overall accuracy of 87.2% is different when we calculate it for each value of the protected attribute “Race”; in fact, it’s 12% more accurate (=98.1%) for “Amer-Indian-Eskimo” people than for “Asian-Pacific-Islander” guys (=86.1% accuracy), likely because of a bias that this model learnt from our limited dataset:

Another disparity in predictions: 21.3% of “white” race got the loan approved, against 6.38% of the “Other” race. In other words, individuals of white race have 3 times higher acceptance rate:

This doesn’t necessarily identify an *unfairness*: this result may be justified by other key features of the dataset like working hours per week or education number. We clarify this in a minute.

What we actually owe to secure is that such disparity doesn’t depend only on the protected attribute. Even better, I’m going to show how to calculate each attribute positive or negative contribution to the final decision.

Before digging into the SDK, it’s important to consider if our model is a Glass Box or a Black Box.

InterpretML and SHAP: Glass Boxes

Using the classes and methods in the Microsoft InterpretML SDK, we can:

explain model prediction by generating feature importance values for the entire model and/or individual datapoints;
achieve model interpretability on real-world datasets at scale, during training and inference;
use an interactive visualization dashboard to discover patterns in data and explanations at training time.

The third model we trained -based on CatBoost classifier- is actually transparent since it’s natively interpretable. In other words, its explanation is lossless and it exposes methods to extract each specific attribute weigth for every element of our dataset, allowing us to precisely identify the contribution of each feature to the overall rating of that element.

Here we train the new cbc_model using the same training dataset and algorythm (CatBoost Classifier) which produced the best model above:

The get_feature_importance method returns the “shap features” containing the weigth of each feature for each element of the dataset:

Out of the (6513X12) dataset we passed, the method returned a new dataset with one column more that contains the same value (-2.29763) for all records; we capture this value as the “expected value” or the “base value”, then we remove this column from the “shape values” array, which is now 6513X12 as the original tests dataset shape:

Each row of this “shape_values” dataset contains the weights of each feature of that element

Let’s analyze shap features in detail. Before we do it for single elements, let’s see a high level picture (Global Explanation): summary_plot method of shap library (included in the azureml-interpret package) summarizes and orders features by relevance. As we see, Relationship leads the choice, followed by Age and Education-Num; this doesn’t surprise us, does it?

Well, the previous chart is quite clear but there is one piece of information that it’s NOT telling us: which values, for each feature, are influencing predictions positevely or negatively?

Here is where SHAP starts shining: the summary plot is showing exactly this thing, with red meaning high values (relative to each feature range) and blue meaning low values. For example, the higher the Education or Occupation level or Age, the better the credit opportunity.

Now let’s see how to analyze a single element.

Here is the rule: for each row, the aritmetic sum of 12 features in the shap values array is added to the expected value: if the result is >0 then the prediction is positive, otherwise this person will get the credit denied. Let’s double-check this using the first record (index=0). The indexes of the shap array match those of the X_test; a positive shap value identifies a feature whose value is improving the positive rate: so the age of 57 years is improving the absolute rate by 0.79, while EducationNum=9 (relatively low in the 1:14 range) lowers the rate by 0.37. I put just a few arrows to avoid confusion, but we see that features sum=1.9, which sounds good:

This last picture is quite clear, but it’s not immediatly visible since it requires calculations and mappings. Once again, SHAP comes to help with the force plot chart which makes this analysis a joke showing positive contributions in red, negative in blue, where the boundary between red and blue represents the total sum:

Just recall to run shap.initJS() before plotting the chart to prevent this error:

To prove the correctness of our interpretation we manually raise the “blue” Education-Num feature from 9 to 12. In this case, the force plot chart moves Education-Num from right to left (or from blue to red); in other words, its contribution helps raising the final rate, the total sum becomes +0.15 from -0.39 and prediction becomes positive: credit is granted!:

We can also look at multiple values at the same time: we just pass a shap array with multiple rows, and SHAP will rotate 90° left the force plot chart to accommodate all values, whose output value will always be the boundary between red and blue, as shown in the tootip that reports 0.1482 in bold for the first element, as the previous chart told us (0.15). This is an interactive chart so you may also play with the ordering options or the output value:

Microsoft InterpretML also offers the Marginal Analysis that shows information like the Pearson Correlation Index that measures linear correlation between two variables X and Y, where Y can be the label (loan granted/rejected) and X is one of the features, like Age; so the following chart confirms that Age is quite correlated to Y, and also gives us some data about the age distribution (674 individuals whose age is 30, 31, 32 or 33):

I said “can be the label” because we may also use the same Marginal Analysis to show correlation “between single features”; the next plot box shows that men (Sex=1) of this training set cover a smaller set of occupations, but concentrated in the higher part. It’s not too clear, but it also shows in orange the occupation distribution, regardless of other specific features:

InterpretML and SHAP: Black Boxes

The models we trained so far in this tutorial were pretty simple. This allowed us to ask them directly the shap features.

For more complex models like neural networks that internally produce most of the data they consume -hosted in tens/hundreds of hidden layers- we need to “build” the shap feature ourselves. There are multiple techniques to achieve this, but at high level the process is to create small “perturbations” on the features, and check how much the label is impacted by this task, either positively or negatively.

KernelExplainer is one implementation of this practice: it builds a weighted linear regression by using your data, your predictions, and whatever function that predicts the final values. It computes the variable importance values based on the shap values, and their coefficients calculated with a local linear regression.

The 2 immediate drawbacks of these black boxes are that 1) we get an approximation of the shap features -rather than the precise value we got earlier from get_feature_importance- and 2) this task takes a long time to run. It’s not a case that we get a warning if the dataset we pass to KernelExplainer is bigger than 100 records which will take 6 minutes to be analyzed, while we could achieve 32K rows of shap features from CatBoost Classifier in a fraction of second.

Anyway, this is the only choice right now and it works quite well. So let’s see how to achieve this!

To make us confident in this tool effectiveness, I’ll use KernelExplainer to try achieving very similar results to the ones we got with the Glass Box. So as first step I extract a subset (100 records) of our original dataset, for both training and test:

Then, I initialize the KernelExplainer passing that dataset and the cbc_model we trained earlier with the CatBoostClassifier. Now extract the expected value but… wait! It’s very different from the glass one (which was -2.297, remember?):

Well, so what? Did I ever say we would have ever got the same expected_value? I didn’t actually, so please Mauro stay calm and complete the experiment before complaining. For sure, this means that if we want the same results, then the shap features will have to be different. Let’s see.

I’m temporarily disabling the warning because one would be generated in the loop of 100 iterations. After 108 seconds, we have our shap features whose shape is 100X12 as we’d expect, being the weight of each feature of each element of the reduced dataset. And they look different from the glass ones, which is not a bad news but still neither a good news right now:

So a good idea now is to compare the glass and black summary plots. They look quite similar, don’t they? The dataset used for the black one is smaller actually, but it’s important to notice that the first 4 features and last 5 are the same, at global level:

Good enough? Not really, and I want to provide you with a stronger evidence. Here it is: we take the same record (index=0) that we’ve just used with the glass box, then we leverage KernelExplainer to calculate its features importance with the black box. Here is their comparison again:

Well, we knew that base values and the scale are different, but the results are… pretty identical. Super!

Model Interpretability in Azure Machine Learning

So far we saw how to explicity engage a specific SHAP explainer to interpret model behaviours. Now we move one level above to see how Azure ML Interpret can automatically choose the best explainer for us.

The interpretability package of the Azure Machine Learning Python SDK allows to explain the entire model behavior or individual predictions on our personal machine or on Azure through an interactive visualization dashboard.

azureml-interpret uses the interpretability techniques developed in Interpret-Community, an open source python package for training interpretable models and helping to explain blackbox AI systems. Interpret-Community serves as the host for this SDK’s supported explainers, and currently supports 6 interpretability techniques, including 4 SHAP explainers like the KernelExplainer described above.

Besides such interpretability techniques, we support another SHAP-based explainer, called TabularExplainer which leverages 4 SHAP explainers but also as also offers significant feature and performance enhancements over the direct SHAP Explainers, including summarization of the initialization dataset and sampling the evaluation dataset. More details here.

The following diagram shows the current structure of supported explainers.

One great benefit of TabularExplainer is that depending on the model, it uses one of the supported SHAP explainers, including those supported by glass boxes and black boxes. I’m now going to show this feature.

If you followed my instructions to build the Conda environment, it’s all set. Otherwise you may refer to this article to properly install and upgrade the single packages. After that, we can resume our example and use TabularExplainer to go beyond SHAP Kernel Explainer features used above.

TabularExplainer is included in Microsoft interpretml.blackbox library:

Now create Global Explanation for the entire model behaviour. Important to notice: we can use the full X_test (pointed by the arrow) that contains 6513 rows rather than X_test_reduced as we did for KernelExplainer, as TabularExplainer automatically identifies an existing glass box classifier, exposed by the CatBoostClassifier algorythm:

Now that we own the global explanation object, we can extract the ranked features values and names, which confirm the hi-level results we achieved earlier:

Now generate Local Explanations for individual predictions of the whole X_test. Even in this case, the command execution is pretty instantaneous thanks to the glass box classifier identified by TabularExplainer, so we don’t need to use X_test_reduced version of the dataset:

To prove the perfect match between Tabular Explainer and the native explainer of CatBoost, here we analyze the same element (index=0) whose last prediction above after changing the Education_num from 9 to 12 above was [0.46 0.54], remember? Then we extract the ranked importance features and values for this element:

Now we draw the force plot and… voilà, we get exactly the same chart that we built earlier, where 0.15 is the overall rate!:

InterpretML Dashboards

Before moving to the last part of this article focused on mitigation, I’ll comment the InterpretML dashboards that we can build and run locally or upload to an Azure ML Experiment.

Both types of dashboards leverage the **TabularExplainer** object. In order to allow you to restart from scratch, I’m creating this object again, this time bound to the Logistic Regression model

I re-create the global explanation, this time using the reduced dataset because TabulaExplainer will consider the logistic regression model as a black box. In fact, logistic regression doesn’t expose a method like **get_feature_importance** as CatBoostClassifier does

We’re ready to generate the **ExplanationDashboard**. This command creates one locally. I strongly suggest to click on “**Open in a new tab**” to better interact with this dashboard.

I’m now going to drive you through a couple of interesting examples, inspired by the great digital breakout session “Responsible ML” presented at Microsoft Builld by my collegue Sarah Bird.

The first tab evaluates the performance of our model by exploring the distribution of our prediction values, and the values of our model performance metrics. In this case, I’m further investigating the model by looking at a comparative analysis of its performance across different ***cohorts*** (or ***subgroups***) of our dataset. This picture shows that women have an higher average rejection rate: the difference between men and women is shown by the red arrow. It also shows that Accuracy is better for “women” cohort, while Precision and Recall are worse (I explained these metrics at the beginning of this article)

Now we see an important unfairness evidence within the “What-if” section: first of all, it clearly shows that “all but one” of these women are concentrated in the 50–100% rejection probability (red rectangle), while just one woman (50 years old) pointed by the green arrow will get the credit accepted, being rejection rate lower than 50%. But now let’s concentrate on the “real datapoint 55” (red arrow), whose rejection probability is 76.5%… *right now*; in fact, I’m going to “perturb” it in a minute, as this interactive chart allows us to do it.

So I’m now changing its “Sex” feature from 0 (=woman) to 1 (=man). As soon as I do it, the new “Modified 55” lowers the rejection probability from 76.5% to 50.88%. This identifies a clear unfairness in our model, that we’ll try to tackle in the last paragraph of this article.

For both datapoints 55 (in red) and “modified 55” (blue), we can also observe the “Age” impact on “probability rejection”: the higher the age, the lower the rejection probability (e.g. the acceptance rate). In other words, getting a loan looks easier for elder people.

We can also select multiple datapoints like these three ones and see which features are influencing the reject/accept decision more.

Uploading an explanation dashboard to Azure ML Experiments

I showed earlier how to upload a FairLearn dashboard to an Azure ML Experiment. Well, we can leverage the same experiment to upload an Explanation Dashboard too: just load the library, create the ExplanationClient object and upload the global explanation:

Let’s now analyze a couple a such visuals.

**Dataset Exploration** displays an overview of the dataset along with prediction values. In this case, people who received the credit are mainly concentrated around the 40-hours/week range, and their education level is >=9. The color (orange for Approved, blue for Rejected) is the third dimension.

The same Data Exploration plot in this case shows distribution of approved and rejected credits by age and workclass; as we see, credits are mostly approved for people between 40 and 55 years old, whose workclass is 4 (full-time employee).

**Global Importance** helps understanding of underlying model’s overall behavior; it aggregates feature importance values of individual datapoints to show the model’s overall top K (configurable K) important features. We can’t set other parameters.

**Explanation Exploration** demonstrates how a feature affects a change in model’s prediction values, or the probability of prediction values. The distance from the X axis in this case tells us how much the **sex** attribute affects credit rejection: the farther the individuals are above the line, the more they are penalized. The color instead tells us if, regardless of sex influence, the credit was accepted or denied; this is the reason why most orange points are below the X axis. The example pointed by the green arrow is an exception showing that, although sex negatively affected his likelihood, the credit was still granted.

**Summary importance** uses individual feature importance values across all data points to show the distribution of each feature’s impact on the prediction value. Using this diagram, we investigate in what direction the feature values affect the prediction values: blue points mean low values, while red means higher values. Here we see how much the value of each feature influenced the loan rejection (above the X axis) or to grant it (below), considering that the further the value above the axis, the higher influence. The element pointed by the red arrow indicates a high capital gain value which positively and dramatically influenced the loan grant.

Click on the point highlighted by the red arrow to see where the other features of the same observation are located (chart above), and how much they influenced the final decision to grant the loan (below). As I said, the capital gain is by far #1 reason of load acceptance, followed by Educagtion-Num and Occupation. Instead, Marital Status played against, but not enough to prevent the loan.

Mitigating unfairness

Mitigation techniques depend on several factors, like the specific industry which the model is used with. They may also depend on existing constraints: for example, if the model cannot be retrained then we may use different threasholds per group to calibrate it in a post-processing step.

If, instead, the model CAN be retrained, there are multiple reduction methods available, which expect to change factors and re-train the model by re-weighting and re-labelling data to try improving its fairness criteria.

So far we trained fairness-unaware predictors and I showed that it leads to unfair decisions under a specific notion of fairness called demographic parity. So we’ll now try to mitigate unfairness by applying the GridSearch algorithm from the Fairlearn package. It’s one of the most famous calibration methods that I’m now going to demonstrate to find the optimal hyper parameters of a model to get more ‘accurate’ predictions; this example comes from Sarah Bird’s Responisble ML presentation too.

GridSearch works by generating a sequence of relabellings and reweightings, and trains a predictor for each. Since demographic parity requires that individuals are offered the opportunity independent of membership in the sensitive class, we start building this object that prepares a pool of 70 models that will be trained using logistic regression with different hypter parameters.

Ready to train 70 models in 90 seconds? This command specifies just the column of sensitive features, in addition to the training data:

Now load all these 70 predictors (plus census_unmitigated, the original one) into the Fairness dashboard, which will result somewhat confusing and slow due to the number of models and predictions (71 pairs), which we may reduce with a technique like this one:

For the purpose of this exercise I used the same code used above for the 3 un-mitigated predictors, and I will load all our 71 models into the dashboard. So here is the fairness dashboard, which shows in the “Model comparison” tab all the 71 models:

The red arrows identifies the original un-mitigated model, which presented 17.4% disparity in predictions between men and women: 25.4% of men got the credit approved, against 8% of women. The blue rectangle describes the lowest-disparity model (mitigated_predictor_31) that presents a discrepancy of 1.4% rather than 17.4%.

Here they are, with the unmitigated model on the left showing 17% disparity in selection rate, and the “mitigated model” on the right whose disparity goes down to 1%, together with the overall accuracy which is 82% in the second one while it’s 87% in the unmitigated predictor:

Deep learning support in Interpret-Community

The interpretation topics and techniques discussed here are extended by Interpret-Community, including model-specific explainers like SHAP Deep Explainer that builds on a connection with DeepLIFT.

TensorFlow and Keras models using the TensorFlow backend are supported, and there is also preliminary support for PyTorch.

Image “features” are just their pixels after all, so the summary plot we drew above to show feature importance might simply be implemented by highliting the relevant/non relevant pixels which determined the classification:

Responsible ML Resources

Lots of resources are available for the topics I discussed in this article. The main challenge, as always, is to distill them and identify the most effective ones for our purposes, choosing the right level of detail.

In addition to the official documentation, reported below, as mentioned in some points above I found Sarah Bird’s Responisble ML presentation a great guide to keep as the index for the topics I have explored:

You can find the full Jupyter notebook here, just recall to create the Conda environment as described at the beginning of this article.

Conclusions

Operationalizing responsible AI is a key commitment embraced by every company today.

Vendors like Microsoft offer teams of experts like the AETHER Committee -AI, Ethics, and Effects in Engineering and Research- and ORA -Office of Resonsible AI- to support Customers developing their AI strategy based on principles like Fairness, Reliability, Safety, Privacy, Inclusiveness, Transparency, Accountability.

Our contribution as Data Scientists, Data Engineers, Developers and other technical personas is crucial to such principles implmentation. I hope this article has helped you to raise your awareness on these topics and to get some practical tools and examples to move forward in this direction.

Mauro Minella is Big Data & AI Cloud Solution Architect, formerly Developer Evangelist and some more roles in Microsoft since some time ago. He teaches Statistics and Big Data at Cattolica University of Milan. He publishes technical articles also on YouTube and Linkedin.