Comparing VerifyML, AI Fairness 360 and Fairlearn
An overview of the 3 different open source AI fairness assessment tools and a comparison of the feature set
If you missed our earlier posts on VerifyML, check out the following articles:
Libraries to assess model fairness are an important part of a data scientist’s toolkit — they are adopted in many industries from social media to pharmaceutical and banks to evaluate model bias.
Used correctly, these tools can help an organization deploy fair and robust models, thus improving users’ trust and confidence in AI systems. While much can be said on how such tools are deployed and used in the business process, this article focuses on a technical evaluation of the different solutions available.
VerifyML Vs Fairlearn Vs AI Fairness 360
With any of these 3 open-source tools, businesses can start evaluating their models on battle tested libraries instead of developing customized fairness assessment tools from scratch.
Let’s take a look at the different features of each toolkit and understand how they can be used.
To start, we have our very own VerifyML, the winning solution of the Global Veritas Challenge. VerifyML is a governance framework and Python library that aligns data, product, and compliance teams across the AI development lifecycle.
Started out in 2018 by the Microsoft team as a Python package with algorithms to mitigate unfairness in classification models, Fairlearn has since expanded into a fairness assessments toolkit and is currently a community-driven project.
AI Fairness 360
IBM’s AI Fairness 360 (AIF360) is an extensible toolkit that helps teams examine, report, and mitigate discrimination and bias in machine learning models throughout the AI application lifecycle. It is part of IBM’s trusted AI product suite which includes AI Explainability 360 and Factsheets 360.
For simplicity, we will compare these solutions on 3 different aspects:
- Display of results
- Coverage of fairness measures
- Mitigation strategies
Display of Results
VerifyML treats each fairness consideration as a test and the user has to specify the metric used and an acceptable threshold. Each test automatically generates a description, the user-specified threshold, and the outcome of the test.
Every test result is also accompanied with a comparison chart, showing how the disadvantaged group fares against the other subgroups. By approaching fairness as tests, VerifyML helps users avoid unintended biases. It makes it clear to the users whether their expectations were met and possible areas where mitigation strategies could be applied.
Fairlearn also allows a user to specify the sensitive attribute and the metrics that they are interested in comparing. This produces a comparison plot of the different metrics across groups.
A standout feature of the package is the interactive dashboard which provides a nice summary of the different fairness metrics. One can select different sensitive features or performance metrics and see how the comparison changes. The user has to decide whether those metrics and results actually matter for their particular business use case.
AI Fairness 360
AIF360 takes a similar approach to VerifyML and allows a user to configure the threshold and test outcome. Unlike VerifyML, it does not allow the users to vary the thresholds across different tests e.g. setting a threshold of 1.5 for one test involving age and 1.3 for another test involving gender. This is an important feature as the reasonable threshold differs on the context and should be left customizable to the users.
AIF360 also comes with a striking dashboard as part Watson’s studio. In general, while the metrics and algorithms are open-source, the full end-to-end responsible AI governance workflow is part of IBM’s core AI product offering.
Coverage of fairness measures
Fairness is essentially contested and the appropriate measure of fairness is often context-dependent. While it is the responsibility of the user to decide what is ultimately fair or not, the toolkit should provide a wide range of fairness measures to aid its users in their justification. In this section, we assess the toolkits based on their support for different performance metrics and fairness outcomes.
In the performance metric section, all of the 3 toolkits provide the core metrics used in binary classification problems such as false positive rates, false negative rates and selection rate rates.
VerifyML and Fairlearn also support metrics for regression problems such as MSE and MAE which AIF360 does not currently provide.
Support for multi-classification problems and other non-supervised learning problems seem to be lacking in all 3 solutions. Data scientists would currently have to refer to other standalone packages to find suitable solutions for those problems.
What sets VerifyML apart from the rest is in the coverage of fairness tests. The most common test used in the fairness tools is the disparity test. In short, it calculates the difference/ratio of the performance metrics of any 2 subgroups and assesses whether one group is disadvantaged against the other.
While all of the 3 toolkits provide a disparity test, VerifyML also provides additional tests (more details in the image above). Let’s take a quick dive into the min/max metric threshold test and and understand why having alternative measures of fairness is important. This test checks if the fairness metric of the subgroups passes the minimum/maximum threshold specified.
For example, let’s assume that a bank currently employs a rule-based approach for a fraud detection model which has a false positive rate of 2.5%. The bank is interested in deploying a new AI model. We can use this test in this scenario to ensure that all the customers would be better off in the new AI model than the rule-based one. By setting a max FPR of 2.5% with the threshold test, the test will only pass if every subgroup’s FPR is lower than that. In this case, we are probably less concerned with disparity among the subgroups as long as the new model performs better.
The other 2 toolkits, on the other hand, do not offer much options in the assessment other than the disparity test.
If you’re interested to learn more about the available tests in VerifyML, please refer to our Github repo.
As part of the assessment tool to improve fairness in the existing model, mitigation steps can be taken across 3 stages of the data science pipeline:
- Data pre-processing
- Model building
- Model post-processing
A simple mitigation strategy might be to remove the protected attribute from the model, incurring a potential trade-off in performance for fairness.
VerifyML currently provides threshold optimization (as part of min/max threshold test) and shapely feature importance test as part of the mitigation strategies.
The other 2 toolkits provide a more robust suite of tools for mitigation strategies to correct unfairness. Some of the strategies include, reweighing, adversarial debiasing, correlation removal and exponential gradient modelling. While the built-in mitigation tools provide the users with an out-of-the-box solution, it is still up to the users to decide on the optimal approach to fix the problem.
All 3 toolkits allow for easy comparison between the original model and the improved model. For example, Fairlearn has a model comparison dashboard that plots the fairness tradeoffs between models, while VerifyML provides both a python function to generate a model comparison report as well as a web application to compare between models.
This table provides a quick summary of the 3 toolkits, examining how they display fairness results, the coverage of the methods and mitigation strategies.
The comparison also shows how the different solutions could complement each other. An initial overview of different metrics could be generated with Fairlearn; mitigation methods adopted from AIF 360; and test suites generated using the VerifyML framework.
With VerifyML, we aim to make it easy for enterprises to adopt fair AI workflows on top of our open-source solution. As part of that, we are improving the workflow of data teams by automatically generating model documentation and alerts using Github Actions. Since model risk management involves multiple stakeholders, we are also working on no-code solutions for teams to collaborate and contribute insights effectively throughout the model release cycle.
Feel free to reach out through the Github repository for technical questions or through our contact-us page if you would like to collaborate with us on proof of concepts.