Bias in AI: A Python primer

6 min readMar 7, 2023

AI systems can be biased against some minority groups. Among some of the high profile cases, Amazon’s recruiting algorithm was biased against women, and COMPAS incorrectly identified black inmates as more likely to re-offend than white inmates. Maybe you’re putting a model into production and you’re worried that it might be biased. The purpose of this post is to introduce the basic concepts of bias and fairness in ML, and allow you to identify them in your own models.

Setting up the problem

What is bias? In this context, an unfair outcome of a model for some underprivileged groups.

What do we assess here? A model’s performance with relation to bias, and whether there is bias in the data.

How? By comparing some metrics between different demographic groups.

Which metrics? It depends! Are we interested in equal representation first or accuracy of the model? Of course accuracy is always important otherwise we may as well choose at random. But the question one can ask may instead be:

Is equal representation important in my application (i.e. for a recruitment process for instance)?
How much trust do I have in the ground-truth label I work with?

Depending on the answers to these questions, one may want to use equality of opportunity metrics or equality of outcome metrics.

Equality of opportunity. The idea of equal opportunity means that people are all given an equal chance to compete. The selection process is set up in a way that there is no room for arbitrariness and prejudices, so that the most relevant individual will be selected irrespective of who they are. In effect, we want that the probability of a person in the positive class being correctly assigned a positive outcome and the probability of a person in a negative class being incorrectly assigned a positive outcome should both be the same for the privileged and unprivileged group members (for instance male and female).
Equality of outcome. When the likelihood of positive outcome is equal for individuals regardless of whether they are in the protected group or not. This means that a similar proportion of individuals from each group should be given a positive outcome.

Illustration with binary classification

A handy way of grasping these concepts is to visualise the confusion matrix of each group we are interested in. Let’s say we want to check for potential bias for female and male in an recruitment system. The outcome is pass (1) or fail (0).

The confusion matrices below define the performance of the recruitment algorithm for both the group Female and Male. It shows the following:

TP and TN in green: number of true positive and negative for each group: i.e. the number of individuals that “should” have passed/failed (according to the ground-truth labels we have available in the training data) and for which the outcome was successfully predicted by the model.
FP and FN in red: number of false positive and negative for each group: i.e. the number of individuals that “should” have failed/pass and for which the outcome was wrongfully predicted by the model.

We will compare some “female” group metrics to the equivalent “male” metrics and derive a concept of bias from this.

Equality of outcome: We take the disparate impact metric as an example, also called impact ratio (IR). For this, we do not care about the actual (ground-truth) labels. We simply look at the proportion of individuals in each group that were predicted to pass, also called Selection Rate (SR)

SR_male: 37/50=0.74
SR_female: 26/50=0.52

And we divide: IR = SR_female/SR_male = 0.7. Generally, below 0.8 will be considered to be biased against the group which was used in the numerator, here female.

Equality of opportunity: We take the average odds difference (AOD) metric as an example. Here, we care about labels and how closely the model follow them, i.e. we want the model to have similar accuracy, precision, recall etc. for each group.

It can be tempted to look at accuracy only in this case as it has traditionally be the metric of choice for measuring the performance of a model. However, the problem with only looking at accuracy is that you may have exactly the same accuracy (proportion of correctly predicted i.e. true positive and true negative), but maybe some privileged group (in this case, men) will be more often wrongfully predicted as successful because of past biases skewing the source data in their favour, giving them an unfair advantage. The average odds difference solves this problem by comparing False Positive Rates of different groups and True Positive Rates.

where 0 represents the underprivileged group, FPR=FP/(FP+TN) is the false positive rate, and TPR = TP/(FN+TP) the true positive rate.

Here FPR_male = 7/17=0.41, FPR_female = 4/22 = 0.18, TPR_male = 30/33=0.91 and TPR_female = 22/28 = 0.79.

A rule of thumb is that values between -0.1 and 0.1 are considered acceptable. In this case, there is a bias against Female.

Okay, and in Python?

We use here the holisticai library (Doc and Github). We import the “adult” dataset which gives us some individual’s features and demographic data. We also have some labels for this data, which is whether the individual is earning more than 50k or less. We will train an XGBoost classifier on the data to try to predict if someone earns more or less than 50k.

We first import the data, transform categorical features in numerical data, and split it into train and test data.

from holisticai.datasets import load_adult
from sklearn.model_selection import train_test_split

data,label = load_adult(as_frame=True,return_X_y=True)
# transform categorical column into numerical
cat_columns = X.select_dtypes(['category']).columns
cat_columns = [col for col in cat_columns 
                    if col not in ['race','sex','age','label']]
data[cat_columns] = data[cat_columns].apply(lambda x: x.cat.codes)
# concatenate the labels to the data dataframe to split
data['label'] = label 
data_train,data_test = train_test_split(data,test_size=0.3, random_state=42)

Note that we do not transform the demographic or label columns as we will remove them from the features X. We then extract our features X, labels y and group membership vectors using the function defined below.

def get_X_y_groups(data):
    # get label
    y = (data['label']=='>50K').astype(int)
    # get group vectors for bias metric functions
    group_f = (data['sex']=='Female').astype(int)
    group_m = (data['sex']=='Male').astype(int)
    # prepare features X
    X = data.drop(columns=['race','sex','age','label'])
    return X,y,group_f,group_m
X_train,y_train,_,_ = get_X_y_groups(data_train)
X_test,y_test,group_f_test,group_m_test = get_X_y_groups(data_test)

Note that we get binary vectors of the length of the input dataframe that flags individuals for a particular group, here ‘Female’ and ‘Male’. These will be passed as arguments to the metric functions.

Finally we train a XGBClassifier model and predict the outcome on the test set.

from xgboost import XGBClassifier

# Train model
model = XGBClassifier()
model.fit(X_train,y_train)
# Predict
y_pred_test = model.predict(X_test)

Now we’re ready to get some metrics!

from holisticai.bias.metrics import disparate_impact,average_odds_diff
from sklearn.metrics import accuracy_score

acc = accuracy_score(y_test,y_pred_test)
di = disparate_impact(group_f_test,group_m_test,y_pred_test)
aod = average_odds_diff(group_f_test,group_m_test,y_pred_test,y_test)

This should give us an accuracy of 0.87, disparate impact of 0.32 and average odds difference of -0.06.

There is a bias against female regarding the equality of outcome metric, but no significant bias regarding the equality of opportunity metric.

Closing remarks

Data needed. You need the demographic data and predicted outcome to be able to measure bias. For some metrics (equality of opportunity), you also need the ground-truth labels.
Reference group. Metrics are calculated for a group by using a reference group to which they are compared. Usually, the reference group will be the one with the highest selection rate. However, if that group only has a few individuals or a small sample size compared to the total number of individuals, it won’t be used as a reference.
Intersectionality. It is good practice to check for bias for intersectional groups as well, for instance we can compare black-female / black-male / white-female to the white-male group using the same strategy as above.
Beyond binary classification. Binary classification is the classic case taken as an example for fairness, but many real-life applications are not binary classification. They are instead regression, multi-class classification, clustering or recommender systems for instance. The holisticai library has metrics for all these case and we will extend on this in a future blog post. Stay tuned!

As we’re constantly looking to learn, we would love to hear your feedback in the comments.

Happy coding 🛠️

Holistic AI is an AI risk management company that aims to empower enterprises to adopt and scale AI confidently. We have pioneered the field of AI risk management and have deep practical experience auditing AI systems, having reviewed over 100+ enterprise AI projects covering 20k+ different algorithms. Our clients and partners include Fortune 500 corporations, SMEs, governments and regulators.

We’re hiring :)

Bias in AI: A Python primer

Setting up the problem

Illustration with binary classification

Okay, and in Python?

Closing remarks

Written by Holistic AI Engineering