Responsible AI Series Part II: Right Governance

by Yovahn Hoole, Software Engineer at Eightfold.ai

Published in

Engineering at Eightfold.ai

10 min readMay 6, 2023

As part of our series of blog posts on Responsible practices in AI, we would like to deep dive on the following aspects for part II,

Right Governance

In recent years, employers and regulators alike have grown increasingly cognizant of the ethical implications of using AI based tools in connection with employment decisions.

A key aspect in mitigating these risks is a robust and transparent methodology for measuring AI bias in selection processes. At Eightfold, we seek a solution that effectively bridges the gap between model evaluation frameworks in place today and the decades of research in employment law and adverse impact analysis. Model evaluation frameworks focus on a machine learning model’s ability to understand and generalize patterns within a dataset. In the context of algorithmic fairness, these frameworks help answer the question:

“Is the performance of the model employed by the recruitment tool dependent on subgroup membership?”

In cases where the underlying data is biased however, even a model that performs equally well across subgroups can result in unequal outcomes. To this end, adverse impact analysis broadly covers the analysis of disparities in employment outcomes. As a result, adverse impact analysis helps answer the question:

“Does the use of the recruitment tool in question result in disparate outcomes across subgroups?”

Both model evaluation frameworks and adverse impact analysis provide unique insights into algorithmic fairness and are part of Eightfold’s measurement of bias analysis as applied to real-world data.

Adverse Impact Analysis

Background

Existing methodologies of adverse impact analysis have been historically employed for evaluating and analyzing adverse impact in human decisions. Given the scale of data at which AI operates, some of the assumptions behind these methodologies do not necessarily apply causing incongruent behavior. We, however, can take inspiration from these to develop tests that are applicable at different scales of data.

A core component of adverse impact analysis examines selection rate differences among subgroups. It is intended to assess disparities in selection processes. Even unbiased selection processes, when evaluated on a finite sample, may result in selection rate differences due to sampling error. Significance testing is the process through which selection rate differences that are potentially indicative of discrimination are distinguished from those that occur simply due to chance.

In statistical significance testing, a null hypothesis about the total population is tested against a sample of the population. In the context of adverse impact analysis, the null hypothesis is that there is no substantial difference in selection rates between two subgroups. Under a set of assumptions, a statistical significance test validates the null hypothesis against applicant flow data by determining the probability of observing the selection rates seen in the sample when the null hypothesis is true. When this probability is below a certain threshold, the selection rate differences are deemed statistically significant. When this probability is greater than the threshold, the differences in selection rates are not significant enough to reject the null hypothesis. As non-significant differences can also result from insufficient data due to small sample sizes, a failure to reject the null hypothesis does not necessarily imply an impartial selection process.

Tests of statistical significance have their share of limitations. Type I and Type II error rates express the probabilities of a test resulting in a false positive and false negative respectively. Statistical power is the complement of the Type II error rate and denotes the probability that a test will correctly reject the null hypothesis when a substantial difference is present. In an ideal world, both Type I and Type II error rates would be low, however, reducing one type of error often results in increasing the other. In the development of a testing framework, we seek a balance between the two.

Additionally, when large sample sizes are used in statistical significance testing, even small, practically insignificant differences can be statistically significant. To alleviate such concerns practical significance testing is used. Practical significance tests offer domain specific heuristics that are used to determine whether a difference has meaningful impact in the real world. At large sample sizes where statistical tests are practically unreliable, practical significance tests are a useful complement. However, practical significance tests may be unreliable in small sample sizes.

Methodology

Approach 1

A commonly used approach to structure adverse impact analysis is through a 2 by 2 contingency table. The contingency table compares the selection rates of a given process between a focal and comparator group. The focal and comparator groups are two subgroups within a protected category we want to compare. In the context of match scores, a candidate is, for purposes of this approach, considered “selected” if the match score they received is greater than some cut off score T. The simulated selection rates can’t be controlled due to the nature of the computation and solely depend on the model’s predictions and thresholds set. The comparison of selection rates is as follows (Table 1):

The Selected column represents the number of applicants who had a score above T. The Not Selected column represents the number of applicants who had a score below T. The primary attribute analyzed in adverse impact analysis is the selection rate. The selection rates for the focal group, the comparator group and the overall applicant pool are defined as follows:

To illustrate the application of Table 3, consider the following scenario: a given position receives 100 applicants. Of these 100 applicants, 15 applicants declared their race as Asian, 25 declared their race as Black, and 60 declared another race or chose not to declare their race/ethnicity. A recruiter then uses a cut off match score of 3.5 to filter out applicants. Of the applicants who declared their race/ethnicity, 7 Asians out of 15 received a match score greater than or equal to 3.5 and were thus “selected.” Similarly 14 out of 25 Black applicants who declared their gender received a score above 3.5 and were selected. In this scenario, the generated contingency table will be

The goal of this analysis is to determine whether applying such a cut off score will lead to adverse impact. Contingency tables such as the one above, provide a digestible view of applicant flow across two subgroups of a protected category and also simplify statistical calculations.

As to statistical tests, the first test we will consider is the Z or Two Standard Deviation Test which is calculated as follows:

This test is used to determine the statistical significance of selection rate differences. When the absolute value of the test statistic is greater than 1.96 (i.e., Z <-1.96 or Z > 1.96), the test indicates a statistically significant difference between the two selection rates. At an intuitive level, the test assumes that, under the null hypothesis, the differences in selection rates are normally distributed with a mean centered at 0, and a standard deviation estimated from the contingency table as:

The estimation of the standard deviation from the contingency table, particularly its reliance on sample sizes in the term from the above equation (Eq. 1)

results in monotonically increasing z-statistic with increasing sample size. Consider that the overall selection rate is fixed at 30% and that selection rates between the focal and comparator groups vary by 1%. Further assume that the number of applicants from the focal and comparator groups are equal such that the test statistic equation simplifies to:

The following plot (Fig. 1) shows the value of the Z statistic with number of applicants from each group varying from 2 to 50,000 applications.

As can be seen from the above figure, the same difference in selection rates increases in statistical significance as the sample size increases. Practically however, an absolute difference of 1% in selection rates may not be a significant difference regardless of the sample size. Intuitively, as the number of applications increases, the estimated standard deviation decreases. As a result, even small differences in selection rates may be more than 2 standard deviations away from 0. Particularly at the scale of millions of applications, the Z test becomes an unreliable indicator of bias.

In these cases of very large sample sizes, commonly used practical significance tests such as the 4/5ths rule can be more reliable. The 4/5th rule [REF], is a guideline that suggests that the adverse impact ratio can be defined as,

should be between 0.8 and 1.25. When the IR is below 1, it is an indicator that the comparator group is preferred over the focal group and when IR is above 1 it is an indicator that the focal group is preferred over the comparator group. In a perfectly neutral process the ratio would be 1, however the 4/5th rules sets the guideline that slight deviations from 1 will generally not be considered a substantially different rate of selection, while ratios outside of the 0.8 to 1.25 range will generally be considered a substantially different rate of selection. As the 4/5ths rule’s notion of significance is independent of sample size, the 4/5ths rule provides practically useful results at large sample sizes. At small sample sizes however, selecting one more applicant from the disadvantaged group instead of the advantaged group can flip the result of the test. Statistical significance makes statistical significance tests robust to such small perturbations at small sample sizes. As practical significance tests do not have such an understanding, additional heuristics such as the “flip flop” rule are applied in practice to make the 4/5ths rule more robust at small sample sizes.

A notable limitation of the IR can be observed at extremely low selection rates (<5%). When overall selection rates are low, small differences in selection rates have a much larger impact on the IR than at high selection rates. To understand this point, assume that there are 100 male applicants and 100 female applicants. Of these, 1 male applicant is selected and 2 female applicants are selected. The selection rates for males and females are 1% and 2% respectively and the selection rate with men as the focal group is 0.5 which falls below the threshold of 0.8. Conversely if 4 male applicants and 5 female applicants were selected, the impact ratio is 0.8 which just passes 4/5ths rule. In essence, the same difference of 1 additional selection yields a significant result at low selection rates and an insignificant result at slightly higher selection rates.

While the notion of statistical significance allows statistical significance tests to differentiate statistically significant differences from those that occur simply due to chance, statistical tests tend to be too conservative in flagging statistically significant results when sample sizes are small. In these cases, the test is said to have “low power.” As such, one of the assumptions of the Z test is that the large sample assumption holds. Fisher’s exact test (FET) is used when the large sample assumption does not hold. In the case of FET, the test assumes that marginal frequencies are held constant, and the test calculates the “exact” probability of selecting the observed number candidates from the focal group under the null hypothesis. This probability p can expressed as [REF]:

As exact tests do not rely on approximating the null distribution, but rather compute the p value directly from the true null distribution, exact tests such as FET are the preferred test when the large sample assumption is not met. At large sample sizes however, calculation of the FET becomes non trivial as the product of large factorials can quickly lead to arithmetic overflows.

Overall, each of the three tests described so far — Z test, FET, and IR — has a set of key limitations that prevent practitioners from solely relying on one.

Statistically, Z-Score tends to be less reliable when it comes to small sample sizes. At large sample sizes too, z-score has the tendency to be sensitive to small, practically insignificant differences in selection rates. That leaves us with moderately sized samples where the test becomes more reliable.

Fisher’s Exact Test, on the other hand, is effective in both small and moderately sized samples, however as the sample size increases, it gets harder to compute as factorials lead to arithmetic overflow.

The impact ratio is sensitive in small sample sizes or small selection rates. In moderate and large sample sizes, the metric is effective and reliable but may lead to over interpretation of small differences at low selection rates.

Approach 2

Similar to the above approach of simulated selection rate, in this method we have a predetermined threshold computed using the median of the scores present in the dataset of interest. Using this median value as the threshold, we compute the selection ratios for each group within the protected category. The selection ratios are then used to compute the impact ratios using the group with the maximum selection rate as the comparator.

Approach 3

Another approach that we’ve seen in the literature employed to perform the adverse impact analysis involves taking the ratio of the average scores associated with different groups within a protected category. This approach might be applicable for systems that assign a score to the position-profile pair based on the suitability of the candidate for that role. The idea being that the ratio between the focal-comparator group should be as close to 1 as possible.

For this particular analysis, the ratios closer to 1 are preferable.