The Fairness of Finding People’s Names in Text

Liz Gallagher
Wellcome Data
Published in
8 min readJan 19, 2021

Is Our Algorithm Better at Finding John Smith than Other Names?

Figure 1 Is our model equally good at identifying all these people’s names from this COVID-19 WHO policy brief document?

We trained a named entity recognition (NER) model to find people in policy document text — described in this article. The repercussions of this model performing differently for different names could be quite damaging. For example, what if the model worked the best for finding popular US male names, but is less effective at finding minority ethnic names? We don’t want our tool to contribute to issues of exclusion and under-representativeness in academic research.

As outlined in previous Data Labs articles (here and here), as a team we are embedding assessing fairness in the tools and models we build, and thus in this blog post I will describe how we did this for our NER model using a synthetic dataset.

This post has three parts;

  1. How we created the synthetic dataset,
  2. Our fairness results (TL;DR the model is unfair, but not too bad),
  3. Retraining the model using synthetic data (TL;DR the model performs worse, but the model fairness improves).

How we did it — creating synthetic data

In order to evaluate the fairness of the NER model we need to tag each name in a corpus of texts with metadata about the name. For example, is it a predominantly male, female or neutral name, is the name common to a particular ethnicity? That way we could calculate and compare the performance of the model on various subsets of the data, for example, how does it do on predicting just the male names in comparison to just the female names?

To get this dataset of text and name metadata we could simply tag the names in the model’s test data with the various metadata about the name. However, tagging the names with this metadata would be time consuming and prone to its own biases (what if you came across a name you hadn’t heard of before), so we opted to create a synthetic dataset.

Policy text templates

For this we needed to take sections of policy document text, and tag the first names and/or surnames. We then used these tagged text fragments as templates to insert other names into.

Figure 2 Creating synthetic data from an original policy document text.

Names datasets

The names we chose to insert into the text templates to create the synthetic data came from a variety of datasets. The ones we included are:

Datasets of peoples’ names from ONS, Wikipedia, NLTK, Harvard Dataverse, and the US Census.

Most of these can be further broken down by gender, country, continent, and ethnicity.

Synthetic data

Thus by inserting the different names datasets into the policy text templates we can create various groups of synthetic data to evaluate fairness between.

In total, we tagged first names and surnames in 157 text extracts from a random selection of policy documents. In these we tagged 506 surname entities and 177 first name entities. To create the synthetic data each of the 157 text extracts was repeated 50 times with different random names inserted from one of the datasets. Thus, we evaluate the model on a total 8,850 first name entities and 25,300 surname entities.

Fairness Results — there are differences in how well different names perform

Photo by Tingey Injury Law Firm on Unsplash

This is how well our model performed on some of the different groups of synthetic data (with the model’s test data performance at the top), more results are given in the Appendix 1:

How well the model performs on different groups of names.

We debated as a team about how to interpret the results of the different performance metrics. It’s common just to look at accuracy rate or F1, but we know that it is better to look at a range of metrics as they each tell you something different about how the algorithm is performing. For example F1 can be misleading in cases where the opposite types of error have occurred in different groups, i.e. male names may be frequently mistakenly assigned (false positives) and female names may be frequently not found (false negatives), but the overall F1 for both groups may still be the same.

Significance and differences in groups’ results — how do we know what is “too” different?

Thus, from the results above we could say certain pairs of results have a X% difference in the metric, but is this difference significant? We repeated the evaluation of each dataset 20 times with different independent subsets of the synthetic data — allowing us to perform a two-sample t-test to see if we can reject the null hypothesis that the means of these sets of results are the same.

To do this, first we identified datasets that it would be interesting to see the difference between and performed the t-test. When the p-value is less than 0.05 we can say there are significant differences in the average of these 20 evaluation results between the 2 datasets. Some of the pairs which we found to be significant in at least one of the three metrics are given below, with the full results in the Appendix 2:

The difference in model results between two groups of names. *The p-value for the t-test is less than 0.05

We didn’t find significant differences in how the model performed in any of the metrics between male and female names when using the NLTK and the Wikipedia data. We also didn’t see statistically significant differences in the white US surnames and Asian pacific islander and hispanic surnames, or the white US first names with black US first names datasets. Thus it seems like we are generally only seeing gender and ethnicity biases when using the UK first name and Wikipedia datasets respectively.

Conclusions — Asian first names perform worst, US first names perform best

Thus, we have found some biases with our NER model. It performs significantly better for US surnames than it does with double barrelled US surnames (+10% on all metrics). Another difference in results is between United States Wikipedia first names and Asia Wikipedia first names — the latter perform about 5% worse on the F1 metric, 7% on the recall metric, and 2% worse on the precision metric. In general, comparing US and UK Wikipedia first names with non-UK and non-US first names, we see the latter performs 4% worse on the F1 score, 2% on precision and 6% worse on recall. Comparing gendered names performance using the NLTK or the Wikipedia data shows no significant difference in results, however comparing the UK male first names vs UK female first names datasets do show significant differences for the recall metric but only of 2% worse.

To conclude, we see the worst performance when popular first names in Asia are used and the best when popular first names in the US are used.

We note that it is not straightforward to tell how much of a difference between groups is a noteworthy difference, even if the difference is statistically significant. We decided to take a user-centred approach and we thought about the impact of a 10 point difference between groups (which was the highest difference found). If the model is used 100 times, then that means that in 10 of those times it will perform worse for one of the groups. Is this noteworthy? Considering that our tool currently has very few users, we decided that this was not something that would significantly impact users, but it is something we will need to review in the future.

Retraining using synthetic data — a trade-off of our model’s general performance and being less unfair

After creating this synthetic data we thought we’d see what effect it had on the training of the model. Does it perform better on real test data when trained on these extra names?

We decided to just add synthetic data from the 4 worst performing names datasets; ‘Wikipedia first names — Not US or UK’, ‘US surnames — double barrelled’, ‘UK first names — female’, and ‘Wikipedia first names — Asia’. We added one random synthetic data point from every text template in the training data for each of the 4 datasets, therefore multiplying the size of the training set by 4. To avoid over-fitting it was important to not use the same text templates in the training and test sets. In comparison to our previous model where no synthetic data was added to the training data, we see a slight increase in the precision metric but a big decrease in the recall metric. Since our training data has a different distribution to our test data this result might be expected — the problem has become harder.

Model results when adding synthetic data to the training data.

We can also evaluate the fairness results again on this newly trained model. The results show an improvement (I’ve given the previous results in brackets for ease of comparison):

The difference in model results between two groups of names, using a model trained with synthetic data. For ease in comparison, the previous results from the model with no synthetic training data are given in brackets. *The p-value for t-test is less than 0.05

Final thoughts

As may be expected, our analysis of fairness in our NER model showed that our model performs slightly better for US/UK first names. Furthermore, the largest difference comes between US first names and Asian first names. We have also shown how by adding synthetic training data we don’t necessarily see an improvement in the model performance, but we do see an improvement in the fairness of the model.

We hope we have highlighted one way in which you can uncover model unfairness in a relatively simple way by creating synthetic data, our approach is similar to [1]. We’ve also shown how in this analysis you may be faced with a decision of improving this fairness at the expense of decreasing the performance of the model overall. A nice review of some tools for evaluating model fairness and bias is covered in this post.

References

[1] DiCiccio, C., Vasudevan, S., Basu, K., Kenthapadi, K., & Agarwal, D. (2020, August). Evaluating fairness using permutation tests. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1467–1477).

--

--