Responsible AI Series Part III: Right Monitoring

by Batuhan Akcay, Software Engineer at Eightfold.ai

Published in

Engineering at Eightfold.ai

4 min readMay 6, 2023

As part of our series of blog posts on Responsible practices in AI, we would like to deep dive on the following aspects for part III,

Active Monitoring

As part of our beliefs, we set the groundwork for monitoring the performance of our AI models extensively across different aspects including latency and bias.

The latency and accuracy metrics are plugged-in and visualized with the help of a dashboard monitored by the engineers in the team on a regular cadence to ensure the performance is within acceptable parameters. Alarms are set-up on these metrics that are triggered when these values cross a certain predetermined threshold notifying the engineers of the same and prompting immediate response. For all the job positions that users accessed in production, calculate the median of the probabilities of every profile’s match to the position for which it was considered. Our standard is for this graph to have a generally flat trend.

Figure 1: Probability prediction over time

We curate continuously growing golden datasets using a human-in-the-loop approach for all our currently active models. Models in production are evaluated on those datasets on a regular basis. In addition to above, we have dashboards that can generate the above discussed bias metrics on varying parameters and granularities to observe and address any anomalies.

Perturbation Testing

Adverse impact analysis evaluates fairness of the match score model for different candidates and provides a global picture of the fairness of the match score model. With perturbation testing, fairness of the match score model is evaluated on a more granular level by evaluating fairness on an individual candidate basis. This is done by using two slightly different resumes to create the candidate data that is used by the match score model. One of the resumes is the original resume of the candidate and the other resume is a slightly modified version of the original resume, in which some text is modified to imply the candidate may belong to a different gender/race subgroup than the gender/race subgroup of the candidate.

An example of a resume modification used in perturbation testing can be seen in the image below, where the text in the resume describing the name of the candidate is replaced to imply candidate may belong to another gender subgroup.

Figure 2: A pair of resumes used in perturbation testing. The resume on the left is the original resume and the resume on the right depicts the perturbed resume.

Methodology

To evaluate fairness of the match score model on a more granular level, perturbation tests measure whether the match scores for a position are statistically similar for the candidates using the original resumes and candidates using the modified resumes, given a resume modification, a position, and a list of candidates.

After computing the match scores for a position both for candidates using the original resumes and candidates using the modified resumes, independent Samples T-Test is used to compute t-score and p-value for the null hypothesis that the match scores using the original resumes and match scores using the modified resumes have identical mean (expected) values. The t-score quantifies the difference between the means and the p-value quantifies the probability of obtaining a t-score with an absolute value at least as large as the one observed if the null hypothesis is true. A higher p-value for this null hypothesis suggests that there is not strong evidence for the difference between the means (t-score) of match scores to be statistically significant. Therefore, a low t-score and a high p-value for perturbation tests suggest that the difference in match scores for candidates using the original resumes and candidates using the modified resumes are low and not statistically significant, and so it can not be claimed that bias is introduced with the resume modifications. On the other hand, a high t-score and a low p-value suggest that the difference in match scores for candidates using the original resumes and candidates using the modified resumes are high and statistically significant, and so it can be claimed that bias is introduced with the resume modifications.

Below are the formulas used to compute t-score and p-value for the Independent Samples T-Test used in the perturbation tests:

Figure 3: Formulas used for Independent Samples T-Test in Perturbation Testing

External Audits

External bias audits provide objective perspective from industry experts on biases within AI systems. At Eightfold, we employ external bias audits to build trust with stakeholders, customers, and the public, as we demonstrate our commitment to transparency and fairness.

Next Parts of Responsible Practices in AI

Thank you for reading this blog post as part of the Responsible practices in AIseries. If you enjoyed this blog post keep out an eye for the next part that will be released next week.