Preventing AI Systems from Amplifying Bias with Adversarial Learning

Intro, cause and practical way of dealing with bias with adversarial learning

Rashmi Margani
Published in
10 min readMay 23, 2019


When said amplifying bias, the first question that comes to our mind, How AI system amplifies the bias?

Discriminative or Generative models are the cause of bias amplification. Because discriminative models are more of“black box” and learn to answer just specific training dataset.

The next question raise, What is Discriminative model and how it leads to cause of bias?

Discriminative models also referred to as conditional models, it tries to train the model by just depending on the observed data while learning how to do the classification from the given statistics. Discriminative models, such as neural network, logistic regression, SVM, conditional random fields.

So, How the AI algorithm amplify bias?

Now, Let’s take an example of word embedding bias and it’s the origin of amplifying bias.

The power of machine learning systems not only promises great technical progress but risks societal harm. popular word embedding algorithms exhibit stereotypical biases, such as gender bias. The widespread use of these algorithms in machine learning systems, from automated translation services to curriculum vitae scanners, can amplify stereotypes in important contexts. Although methods have been developed to measure these biases and alter word embeddings to mitigate their biased representations, there is a lack of understanding in how to word embedding bias depends on the training data. Given a word embedding trained on a corpus, certain bias metric method identifies how perturbing the corpus will affect the bias of the resulting embedding. This can be used to trace the origins of word embedding bias back to the original training documents.

Fig shows how certain cultural context of information learned by Athe I system leads to bias

How they affect?

If doing a classification task they are likely to be trained to maximise classification accuracy. This means that the model will take advantage of whatever information will improve accuracy on the dataset, especially any biases which exist in the data.

For Example, let’s see the case study of a real-time automated hiring platform, how it is affecting the candidate without getting into the job.

“Amazon ditched AI recruiting tool that favoured men for technical jobs.”

How did it happen?

The company’s experimental hiring tool used artificial intelligence to give job candidates scores ranging from one to five stars — much like shoppers rate products on Amazon. “They literally wanted it to be an engine where I’m going to give you 100 resumes, it will split out the top five, and we’ll hire those.”

Amazon recruitment System bias against women

But by 2015, the company realized its new system was not rating candidates for software developer jobs and other technical posts in a gender-neutral way. That is because Amazon’s “ AI models in recruiting platform” were trained applicants by observing patterns in resumes submitted to the company over a 10-year period. Most came from men, which lead to the reflection of male dominance across the recruitment platform.

How to deal with the gender-neutral way of bias in AI(NLP) system?

Word embedding models have become a fundamental component in a wide range of Natural Language Processing (NLP) applications. However, embeddings trained on human-generated corpora have been demonstrated to inherit strong gender stereotypes that reflect social constructs.

Embeddings are a powerful mechanism for projecting a discrete variable (e.g. words, locales, URLs) into a multi-dimensional real-valued space. Several strong methods have been developed for learning embeddings. One example is the skip-gram algorithm. In that algorithm, the surrounding context is used to predict the presence of a word. Unfortunately, much real-world textual data has a subtle bias that machine learning algorithms will implicitly include in the embeddings created from that data. This bias can be illustrated by performing a word analogy task using the learned embeddings

Now, let’s discuss the most powerful adversarial learning to Mitigate bias.

How Adversarial learning helps for Bias Mitigation?

The adversarial method removes some of the bias from embeddings is based on the idea that those embeddings are intended to be used to predict some outcome 𝑌 based on an input 𝑋 but that outcome should, in a fair world, be completely unrelated to some protected variable 𝑍. If that were the case then knowing 𝑌 would not help you predict 𝑍 any better than chance. This principle can be directly translated into two networks in series as illustrated below. The first attempts to predict 𝑌 using 𝑋 as input. The second attempts to use the predicted value of 𝑌 to predict 𝑍. As in Figure Below,

The architecture of Adversarial network

However, simply training the weights in W based on ∇𝑊𝐿1 and the weights in 𝑈 based on ∇𝑈𝐿2 won’t actually achieve an unbiased model. In order to do that you need to incorporate into 𝑊’s update function the concept that 𝑈 should be no better than chance at predicting 𝑍. The way that you can achieve that is analogous to how Generative Adversarial Networks (GANs) (Goodfellow et al. 2014) train their generators.

In addition to ∇𝑊𝐿1, you incorporate the negation of ∇𝑊𝐿2 into 𝑊’s update function. However, it’s possible that ∇𝑊𝐿1 is changing 𝑊 in a way which will improve accuracy by using the biased information you are trying to protect. In order to avoid that you also incorporate a term which removes that component of ∇𝑊𝐿1 by projecting it onto ∇𝑊𝐿2. Once you’ve incorporated those two terms, the update function for 𝑊 becomes:


The description of how to incorporate adversarial networks into machine-learned models is very generic because the technique is generally applicable for any type of systems which can be described in terms of input 𝑋 being predictive of 𝑌 but potentially containing information about a protected variable 𝑍. So long as you can construct the relevant update functions you can apply this technique. However, that doesn’t tell you much about the nature of 𝑋, 𝑌 and 𝑍. In the case of the word analogies task, where 𝑋 =𝐵+𝐶−𝐴 and 𝑌=𝐷. Figuring out what 𝑍 should be is a little bit trickier though. For that, please refer to a paper by Bulokbasi et. al. where they developed an unsupervised methodology for removing gendered semantics from word embeddings.

Now, Will Deep Dive into Implementation part of mitigating bias using adversarial learning

The first step is to select pairs of words which are relevant to the type of bias trying to remove. In the case of gender, choose word pairs like “man”:” woman” and “boy”: girl” which have gender as the only difference in their semantics. These word pairs can compute the difference between their embeddings to produce vectors in the embeddings’ semantic space which are roughly parallel to the semantics of gender. Performing Principal Components Analysis (PCA) on those vectors then gives the major components of the semantics of gender as defined by the gendered word pairs provided. So, let’s define the function for performing principle component of the embedding,

def find_gender_direction(embed,
"""Finds and returns a 'gender direction'."""
pairs = [
("woman", "man"),
("her", "his"),
("she", "he"),
("aunt", "uncle"),
("niece", "nephew"),
("daughters", "sons"),
("mother", "father"),
("daughter", "son"),
("granddaughter", "grandson"),
("girl", "boy"),
("stepdaughter", "stepson"),
("mom", "dad"),
m = []
for wf, wm in pairs:
m.append(embed[indices[wf]] - embed[indices[wm]])
m = np.array(m)
# the next three lines are just a PCA.
m = np.cov(np.array(m).T)
evals, evecs = np.linalg.eig(m)
return _np_normalize(np.real(evecs[:, np.argmax(evals)]))
# Using the embeddings, find the gender vector.
gender_direction = find_gender_direction(embed, indices)
print "gender direction: %s" % str(gender_direction.flatten())

Once done with the first principal component of the embedding differences, start with projecting the embeddings of words onto it. This projection can then be taken as the protected variable 𝑍 which the adversary is attempting to predict on the basis of the predicted value of 𝑌. Let’s now look at the words with the largest negative projection onto the gender dimension.

words = set()
for a in analogies:
df = pd.DataFrame(data={"word": list(words)})
df["gender_score"] = df["word"].map(
lambda w: client.word_vec(w).dot(gender_direction))
df.sort_values(by="gender_score", inplace=True)
print df.head(10)

Let’s now look at the words with the largest positive projection onto the gender dimension.

df.sort_values(by="gender_score", inplace=True, ascending=False)
print df.head(10)

Training the model

Training adversarial networks are hard. They are touchy, and if touched the wrong way, they blow up VERY quickly. One must be very careful to train both models slowly enough so that the parameters in the models do not diverge. In practice, this usually entails significantly lowering the step size of both the classifier and the adversary. It is also probably beneficial to initialize the parameters of the adversary to be extremely small, to ensure that the classifier does not overfit against a particular (sub-optimal) adversary (such overfitting can very quickly cause divergence!). It is also possible that if the classifier is too good at hiding the protected variable from the adversary then the adversary will impose updates that diverge in an effort to improve its performance. The solution to that can sometimes be to actually increase the adversary’s learning rate to prevent divergence (something almost unheard of in most learning systems). The same debiasing model for word embeddings can be found in my GitHub, please look into it to reproduce the experiment. Below is the code for training the model.

class AdversarialEmbeddingModel(object):
"""A model for doing adversarial training of embedding models."""
def __init__(self, client,
data_p, embed_dim, projection,
projection_dims, pred):
"""Creates a new AdversarialEmbeddingModel.
client: The (possibly biased) embeddings.
data_p: Placeholder for the data.
embed_dim: Number of dimensions used in the embeddings.
projection: The space onto which we are "projecting".
projection_dims: Number of dimensions of the projection.
pred: Prediction layer.
# load the analogy vectors as well as the embeddings
self.client = client
self.data_p = data_p
self.embed_dim = embed_dim
self.projection = projection
self.projection_dims = projection_dims
self.pred = pred
def nearest_neighbors(self, sess, in_arr,
"""Finds the nearest neighbors to a vector.
sess: Session to use.
in_arr: Vector to find nearest neighbors to.
k: Number of nearest neighbors to return
List of up to k pairs of (word, score).
v =, feed_dict={self.data_p: in_arr})
return self.client.similar_by_vector(v.flatten().astype(float), topn=k)
def write_to_file(self, sess, f):
"""Writes a model to disk."""
def read_from_file(self, sess, f):
"""Reads a model from disk."""
loaded_projection = np.loadtxt(f).reshape(
[self.embed_dim, self.projection_dims])
def fit(self,
"""Trains a model.
sess: Session.
data: Features for the training data.
data_p: Placeholder for the features for the training data.
labels: Labels for the training data.
labels_p: Placeholder for the labels for the training data.
protect: Protected variables.
protect_p: Placeholder for the protected variables.
gender_direction: The vector from find_gender_direction().
pred_learning_rate: Learning rate for predicting labels.
protect_learning_rate: Learning rate for protecting variables.
protect_loss_weight: The constant 'alpha' found in
num_steps: Number of training steps.
batch_size: Number of training examples in each step.
debug_interval: Frequency at which to log performance metrics during
feed_dict = {
data_p: data,
labels_p: labels,
protect_p: protect,
# define the prediction loss
pred_loss = tf.losses.mean_squared_error(labels_p, self.pred)
# compute the prediction of the protected variable.
# The "trainable"/"not trainable" designations are for the predictor. The
# adversary explicitly specifies its own list of weights to train.
protect_weights = tf.get_variable(
"protect_weights", [self.embed_dim, 1], trainable=False)
protect_pred = tf.matmul(self.pred, protect_weights)
protect_loss = tf.losses.mean_squared_error(protect_p, protect_pred)
pred_opt = tf.train.AdamOptimizer(pred_learning_rate)
protect_opt = tf.train.AdamOptimizer(protect_learning_rate)
protect_grad = {v: g for (g, v) in pred_opt.compute_gradients(protect_loss)}
pred_grad = []
# applies the gradient expression found in the document linked
# at the top of this file.
for (g, v) in pred_opt.compute_gradients(pred_loss):
unit_protect = tf_normalize(protect_grad[v])
# the two lines below can be commented out to train without debiasing
g -= tf.reduce_sum(g * unit_protect) * unit_protect
g -= protect_loss_weight * protect_grad[v]
pred_grad.append((g, v))
pred_min = pred_opt.apply_gradients(pred_grad)
# compute the loss of the protected variable prediction.
protect_min = protect_opt.minimize(protect_loss, var_list=[protect_weights])
step = 0
while step < num_steps:
# pick samples at random without replacement as a minibatch
ids = np.random.choice(len(data), batch_size, False)
data_s, labels_s, protect_s = data[ids], labels[ids], protect[ids]
sgd_feed_dict = {
data_p: data_s,
labels_p: labels_s,
protect_p: protect_s,
if not step % debug_interval:
metrics = [pred_loss, protect_loss, self.projection]
metrics_o =, feed_dict=feed_dict)
pred_loss_o, protect_loss_o, proj_o = metrics_o
# log stats every so often: number of steps that have passed,
# prediction loss, adversary loss
print("step: %d; pred_loss_o: %f; protect_loss_o: %f" % (step,
pred_loss_o, protect_loss_o))
for i in range(proj_o.shape[1]):
print("proj_o: %f; dot(proj_o, gender_direction): %f)" %
(np.linalg.norm(proj_o[:, i]),[:, i].flatten(), gender_direction)))[pred_min, protect_min], feed_dict=sgd_feed_dict)
step += 1

def filter_analogies(analogies,
filtered_analogies = []
for analogy in analogies:
if filter(index_map.has_key, analogy) != analogy:
print "at least one word missing for analogy: %s" % analogy
filtered_analogies.append(map(index_map.get, analogy))
return filtered_analogies
def make_data(
analogies, embed,
"""Preps the training data.
analogies: a list of analogies
embed: the embedding matrix from load_vectors
gender_direction: the gender direction from find_gender_direction
Three numpy arrays corresponding respectively to the input, output, and
protected variables.
data = []
labels = []
protect = []
for analogy in analogies:
# the input is just the word embeddings of the first three words
# the output is just the word embeddings of the last word
# the protected variable is the gender component of the output embedding.
# the extra pair of [] is so that the array has the right shape after
# it is converted to a numpy array.
protect.append([[analogy[3]], gender_direction)])
# Convert all three to numpy arrays, and return them.
return tuple(map(np.array, (data, labels, protect))

The adversarial method helps to reduce the amount of bias in word embeddings and, generalizes quite well to other domains and tasks. By trying to hide a protected variable from an adversary, a machine-learned system can reduce the amount of biased information about that protected variable implicit in the system. In addition to the specific method, there are many variations on this theme which can be used to achieve different degrees and types of debiasing.

Hope you enjoyed reading this story and found it helpful. Thank You.



Rashmi Margani

You can find me writing more on AI ,Algorithms & Math and many more