Aequitas Flow step-by-step: a Fair ML optimization framework

By Sérgio Jesus, Inês Silva, Pedro Saleiro, Hugo Ferreira, Pedro Bizarro

Sérgio Jesus
Feedzai Techblog
9 min readAug 12, 2024

--

In this blog post we will visit Aequitas Flow, an Open-Source framework designed to run complete and standardized experiments of Fair ML algorithms. We encourage you to try Aequitas Flow with the Google Colab Notebooks, which are available in the project’s GitHub repository.

This blog post is based on the paper by Sérgio Jesus, Pedro Saleiro, Inês Silva, Beatriz M. Jorge, Rita P. Ribeiro, João Gama, Pedro Bizarro, and Rayid Ghani.

Table of Contents:

1. What is Aequitas Flow?
- 1.1. For Practitioners selecting a model
- 1.2. For Researchers running a benchmark
2. Install Aequitas Flow
3. The components of Aequitas Flow
- 3.1. Experiment
- 3.2. Optimizer
- 3.3. Datasets
- 3.4. Methods
- 3.5. Audit
4. Conclusion

What is Aequitas Flow?

Aequitas Flow is the codename for the latest version of Aequitas, a well-established package for fairness auditing in the ML community. This version extends the package to include experimentation with Fair ML algorithms.

Aequitas started as a software to diagnose and alert for disparities in ML models’ decisions depending on sensitive attributes, such as race, gender, or age. To achieve this, the package runs a Bias Audit, in which it calculates several metrics across all data groups, determined by the sensitive attributes, and compares them to identify any gap in performance.

While some other packages already implement Fair ML methods, they also introduce an overhead of technical knowledge to configure, evaluate, and deploy these models. Because of this, we extended Aequitas to enable users to conduct experiments using a wide variety of fair ML methods from the literature in an intuitive user experience. Aequitas Flow was built on the principles of extensibility and reproducibility: it allows users to incorporate their own datasets and methods in the framework using familiar interfaces of sci-kit-learn and pandas. It ensures their work can be replicated by performing an extensive collection of configurations and information.

The framework is designed for two different experiences, depending on the user’s end goal. It provides the means for:

  • Practitioners — to select a model for a given application or dataset;
  • Researchers — to run a benchmark, or check if a given method is significantly better than others.

For Practitioners selecting a model

A Practitioner can use Aequitas Flow to select a model, for example, to go to a production environment. To help in this task, Aequitas Flow creates an interactive plot to compare models on a given dataset. This provides an intuitive way to view the achieved performance and fairness of the trained models.

Figure 1 — Example of an experiment to select a model. Each model is represented by a circle in the graph on the left. The blue line shows the Pareto line, with models for which no other model is simultaneously better in both Performance and Fairness and the recommended model is marked with a star. The models were trained with a dataset provided by Aequitas.

This plot shows all the models of an experiment drawn in a two-axis grid, where the X-axis has the performance metric (e.g., it can be Accuracy, TPR, Precision, etc.) and the Y-axis the fairness metric (e.g., Demographic Parity, Equal Opportunity, etc., more about fairness metrics here). Generally, we aim to maximize both these metrics, and as such, the best models are in the top right corner.

The plot highlights the model with the best combination of the selected performance and fairness metrics, being marked as “Recommended” (with a star). This best combination is, by default, given by the model with the highest combination of performance and fairness. The weight of each metric can be adjusted to focus on more performing or fairer models, with an alpha parameter in the plotting method.

Additionally, the Pareto frontier is highlighted in blue. This frontier is composed of the models that present a dominant tradeoff (i.e., no model can obtain better performance without losing fairness and vice versa). The plot lets the user interactively click on any models and see how they compare to the “Recommended” model. The comparison includes a comprehensive list of metrics of fairness and performance, as well as the hyperparameters of the model. In the provided example, we compare the “Recommended” model with the one that achieved the best predictive performance (+1.9 p.p. in Accuracy). We notice, however, that it has a much lower value of fairness (-36.9 p.p. in Equal Opportunity).

For Researchers running a benchmark

A common objective among researchers is to run a benchmark, i.e., compare a new method with other methods established in the literature for a wide array of tasks or datasets. Using Aequitas Flow, it is possible to run the necessary experiments and analyze the results quickly and straightforwardly.

Figure 2 — Comparing the tradeoffs between fairness (equal opportunity, in this case) and performance (accuracy, in this case) for several algorithms.

This plot has each method (represented by a different line) drawn within all the possible tradeoffs between fairness and performance. The X-axis determines this tradeoff, and is what we call the alpha parameter. To the left of the X-axis, the value plotted in the Y-axis is more influenced by the fairness metric (in this case, Equal Opportunity), while to the right, it is more influenced by the performance metric (Accuracy). Methods that are above the others in certain regions are considered dominant. For example, FairGBM (in orange) is dominant in the region of fairer methods.

The shaded regions between each method are estimated through a bootstrap of controllable size. This emulates the expected result and confidence interval for a certain number of trained models for each method. This allows researchers to identify both the methods that are expected to obtain the best results (either in fairness, performance, or a combination of both) and methods that are more or less stable in training.

The Aequitas Flow framework provides the necessary tools to easily conduct and visualize these experiments while allowing for flexibility in the chosen metrics, datasets, and sensitive attributes being analyzed.

Install Aequitas Flow

Whether you are a practitioner, a researcher, or are just interested in the package, you can easily try it out for yourself! We provide resources to enable anyone to quickly start using the Aequitas Flow framework.

Aequitas Flow is the latest version of Aequitas, and therefore, the first step is as simple as installing the current version of Aequitas via pip, using:

To facilitate understanding how to configure an experiment and obtain results using Aequitas Flow, we have curated a series of example notebooks, which you are free to adapt as you follow along.

Across these notebooks, we explain the multiple components of the package and provide examples on running different pipelines, from processing raw public data to training Fair ML models and generating the resulting plots.

The available Google Colab Notebooks are the following:

Figure 3 — Start of an interactive notebook in Google Colab
Figure 4 — Example of how to configure an experiment

The components of Aequitas Flow

Now that you know what Aequitas Flow is and how to use it, we will explore its package contents in more detail.

Aequitas Flow is designed to offer an intuitive user experience while still being flexible enough for more experienced users to extend its components to meet their unique requirements. We finish this blog post by checking the architecture that makes Aequitas Flow a versatile and empowering tool.

Experiment

The Experiment is the main orchestrator of the workflow within the package.

It processes configurations — either in the form of files or Python dictionaries — specifying the methods, datasets, and optimization parameters to be used in the experimental process. The Experiment then handles the initialization and population of the necessary classes, ensuring they interact deterministically throughout the execution process. When an experiment is completed, the results are readily available to be analyzed.

This component can be instantiated to simplify the experimental process to only require the dataset and can execute an experiment using default settings for methods and optimization. This feature is intentionally designed to streamline initial experiments and reduce configurations.

Optimizer

The Optimizer component manages hyperparameter selection and model evaluation.

Having the hyperparameter search space of the methods and a split dataset, it handles the hyperparameter tuning, evaluating the performance of models, and storing the resulting objects, scores, and predictions. The Optimizer component leverages the Optuna framework for hyperparameter selection and employs the Aequitas package for fairness and performance evaluation of the models.

Datasets

This component has two primary functions:

  • loading the data;
  • generating splits according to user-defined configurations.

It maintains details on target attributes, sensitive features, and categorical variables, presenting the data in an extension of the pandas dataframe format.

The framework initially encompasses eleven datasets, selected for their use in research, including those from the BankAccountFraud (created by Feedzai) and Folktables collections. The Dataset component allows users to supply their own datasets, supporting both CSV and parquet formats.

Methods

These components are dedicated to data processing and creation and adjustment of predictions for validation and test sets.

Aequitas Flow provides interfaces for three types of Fair ML methods:

  • Pre-processing methods, which modify the input data;
  • In-processing methods, which aim to enforce fairness during model training;
  • Post-processing methods, which adjust the resulting scores or decisions.

Additionally, classical ML methods are included in the “base estimators” category function similarly to in-processing methods. These methods adhere to standardized interfaces to ensure seamless operation and facilitate function calls within the experiment class.

Currently, 15 methods are supported, including pre-processing techniques such as undersampling and oversampling, label massaging and suppression, and label flipping. Regarding in-processing techniques, the package includes FairGBM, a gradient boosting machine algorithm with fairness constraints developed by Feedzai, and the methods of Exponentiated Gradient and Grid Search for transforming fairness constraints into cost-sensitive classification. For post-processing, the framework includes different implementations of group-wise thresholding.

Audit

The Audit component simplifies the process of calculating and analyzing any disparities in performance between different groups. The Aequitas toolkit offers a suite of confusion matrix-based metrics for auditing the existence of bias.

Users can specify a group as a reference for the disparity comparison and select the appropriate fairness metric for their analysis. Aequitas Flow then leverages the Audit class to create a dataframe of metrics and disparities for each group in the dataset, allowing for a comprehensive analysis of the predictions that a method has produced for a given dataset.

Aequitas Flow further provides tools for visualizing the results of the Bias Audit, as shown below in Figures 5 and 6.

Figure 5 — Summary of an Audit. In this example, we focus on two metrics (False Positive Rate and False Discovery Rate), and three protected attributes (Race, Sex, and Age group).
Figure 6 — Disparity plot of an Audit. Here, we select a specific attribute to plot (in this case, Race). With this plot, we see the relative differences in metrics between the groups of the dataset.

In Figures 5 and 6, we are able to view the results of the bias audit. Figure 5 presents a summary plot of the bias audit. It identifies the groups that have significant differences from the metrics of the reference groups and highlights them as red. We can get more information by hovering over these groups, such as the group’s name, size, and metric value.

Figure 6 shows the relative differences in metrics between groups. The examples illustrate two error rates: FPR, which is the percentage of negative labels incorrectly labeled as positive, and FDR, which is the percentage of positive labeled instances which are negative. The user can select the metrics of these plots.

Conclusion

As more people navigate the complex landscape of fairness in machine learning, Aequitas Flow stands as an effort to drive ethical open-source innovation. We invite you to explore Aequitas Flow and contribute to its development by visiting our GitHub repository. To contribute to the package’s development, feel free to open an issue in the repository or create a merge request regarding one of the open issues.

--

--