A Framework to Assess Gender Inequity in Hiring using Data

Gender inequity is an important topic in the tech industry. Despite its importance, well-designed studies, based on rich data are scarce in the public domain. Within firms, where the data is abundant, lack of a rigorous scientific framework prevents many HR departments from truly understanding the root-cause of any inequity that may exist, resorting to reliance on anecdotal evidence.

The primary goal of this post is to propose a replicable framework and methodology to assess inequity in recruiting, with a case-study to illustrate this methodology.



The dataset contains 1,382 applicants — 1029 (74.4%) males, 353 (25.5%) females — that applied to a Data Engineering position.

Gender of each applicant is predicted using Atipica’s Gender Prediction model, which uses information found in the resume such as name of the applicant to predict gender. The accuracy rate is 96% (Error rate of 4%). This error rate is taken into account for all analyses.

Skills are extracted using Atipica’s Skill Mapper model.

Measure of Inequity

A good way to identify potential inequity between genders is by comparing their rejection rates for a specific position. Given all things are equal between applicants from each group, we would expect the rejection rates to be similar.

In this study, we measure inequity by comparing rejection rates in the Application Review stage. The reason we limit rejection rates to just Application Review is two-fold:

1. There are many factors that go into assessing an applicant — ex. communication skills during a phone screen — that can’t be assessed with just a resume. To limit the number of confounding variables, we limit to just Application Review.

2. Application Review stage often has the largest impact that propagates through the hiring funnel. We find that ~90% of all applicants get rejected at this stage.

Any difference in rejection rates in Application Review can then generally be attributed to:

Objective attributes

  • Difference in years of experience
  • Difference in education/degree
  • Difference in skill-set

Subjective attributes

  • Difference in “quality” of education
  • Difference in “quality” of experience
  • Any form of conscious/unconscious bias

Since subjective attributes are subjective in nature, and difficult to control for, we limit the controls to objective attributes.

Testing for Significance

We test for statistical significance with the following hypotheses:

In testing for statistical significance, we must take into account two sources of error:

  1. Sampling error
  2. Error from gender prediction (4%)

Since we need to account for the error in gender prediction — traditional parametric tests such as t-test cannot be used.

As such, we use permutation testing with Monte Carlo method. To take into account gender prediction error, we exchange 4% of samples (gender prediction error) between males and females for each iteration, and test the exchangeability hypothesis.


Difference in Rejection Rate

The rejection rates for males and females in application review are 83.0% and 88.6% respectively.

Rejection Rates between Males and Females

Females are more likely to get rejected than males by 5.6%, and this difference is statistically significant (p=0.03).

Next, we assess if any differences in objective attributes (skills, education, years of experience) explain this difference in rejection rates.

Difference in Number of Skills

On average, females list 96 skills per resume, and males list 93 skills. Based on the charts below, the mean and median number of skills are similar.

Descriptive Statistics for Number of Skills

Although there is a difference in the average number of skills, it is too small to be meaningful. Additionally, this difference is not statistically significant (p=0.38).

Difference in Skill-Set

To explore differences in skill-set, 34 relevant skills for the job are extracted from the job description using the Skill Mapper model. These 34 skills are compared with skills mined from applicants’ resume. The chart below shows the fraction of applicants with a given skill. For example, ~80% of males and ~80% of females list “java” in their resume.

Qualitatively, we can see that the distribution of skills between males and females are roughly the same for each skill.

To quantify similarity of the aggregate skill-sets between males and females, we can find the average (normalized) difference of the distributions.

Let Aᵢ and Bᵢ proportion of males and females with skill i respectively. For example, A_java = 0.8 and B_java = 0.8. Then the average difference in proportions (D) across n skills is:

This means, on average, for a given skill i, the percentage of males with skill i will be 2.2% points greater than percentage of females with the same skill — not a large difference.

Percentage point differences are easy to understand, but the drawback is that it fails to capture relative difference. We can take into account relative difference, by normalizing the difference by the mean:

After normalizing, the “difference” in aggregate skill-set between males and females is about 5.3% — or in other words skill-sets are 94.7% similar.

A difference of 5.3%, while small, could be meaningful, especially if this difference of 5.3% stems primarily from a really important skill.

To address this, we can test if the difference in proportions for a given skill between males and females is statistically significant using permutation-test/Monte-Carlo method described earlier. For example: if 50% of males and 53% of females have the skill “hadoop”, we test if this difference of 3% is statistically significant.

Out of the 33 relevant skills, only 2 skills — sql and statistics — show any difference between genders (In both of these skills, higher proportion of females have that specific skill).

Skills with Statistically Significant Difference

In essence, we can conclude that the skill-sets are for the most part similar between genders. For skills (sql and statistics) where they are not similar, higher proportion of females have the skill.

Difference in Years of Experience

The following table shows mean and median years of work experience. Note that the job description doesn’t specify required years of experience.

There is a small difference of 0.6 years (p < 0.01). Although the difference is statistically significant, it is too small to be meaningful in practice. That is, a difference in 6 months of experience is unlikely to be a decision-maker.

Difference in Education

The following table shown distribution of highest degree. The job description recommends a BSc or MSc degree.

Higher proportion of females have graduate degrees (Master and Doctoral) compared to males. That is 82.7% of female applicants have a graduate degree, whereas only 69.9% of male applicants have a graduate degree. This difference is statistically significant (p < 0.01).


Females get rejected at a higher rate compared to males by 5.6% despite:

  • Higher proportion of females with graduate degrees
  • No meaningful difference in years of experience between males and females
  • No meaningful difference in relevant skill-set between males and females

Although it would be unfounded to claim that the difference in rejection rates is due to (un)conscious bias, we have at least eliminated some objective attributes (experience, education, skill-set), and can reasonably claim that difference is likely due to some subjective attributes.


This study has the following limitations.

Quality of Experience and Education

Although years of experience and education is a baseline measure, the context is extremely important in determining who passes application review. However, this is very subjective, and thus hard to control.

Skills are not in context

A candidate with 5 years of experience with a certain skill is considered to be similar to someone that lists the skill in the “skills” section.

All relevant skills are weighted equally

All 34 relevant skills are weighted equally, which may not be true in practice as some skills are more relevant for the job than other. However, this can be mitigated by a recruiter giving weights/scores to each skill.

Projects are not considered

Projects/github are sometimes assessed in Application Review, which are not considered in this case study.