Why You Should Avoid P-Values When Examining the Gender Pay Gap.

Maarten De Schryver
10 min readJan 23, 2023

--

There are many reasons why organizations may focus on equal pay. One reason is that equal pay is a matter of fairness and justice, and it is important to ensure that all employees are paid fairly for their work regardless of their gender. Another reason may be legislation, such as the EU Pay Transparency Directive, which is currently provisional but may require employers with over 100 employees to report gender pay gap information to their employees and representatives in the future.

Calculating a pay gap is quite straightforward. This is certainly the case in a situation where the context, environment and some demographics are equal (identical). It is indisputable that employees with the same experience and performing the same job at the same level and generating the same output should not be paid differently based on their gender.

But the reality is more complex and that makes the calculation of the pay gap more challenging. Today, and certainly in a knowledge economy, we speak of jobs of equal value instead of equal (identical) jobs. Because most jobs are not one-to-one identical, wages can and will differ. In addition, the context in which a salary was decided can differ (e.g., during a recession vs economic boom). As a result, there will be variations in salaries between employees who perform similar jobs. With a proposed pay gap limit of 5%, the EU recognizes that an identical wage is not realistic, however pay differences should not be disproportionate.

Important, a higher purpose of exploring the gender pay gap in organizations is to investigate whether there is a gender bias or discrimination so that proper actions can be taken to close the gap. But a gap does not necessarily mean that there is gender bias or discrimination. There can only be gender bias if decisions about employees differ systematically solely based on their gender.

As such, an investigation of a gender pay gap is an evaluation of several decisions. It is not an evaluation of a particular case or individual, rather the evaluation of a process. ., A gender pay gap evaluation at the group level can supply valuable information about individual differences, but that requires an alternative approach. If an employee earns significantly less than others, but within the cohort (i.e., a group of employees performing a job of equal value) colleagues of the same sex do not systematically earn less than colleagues of the other sex — there seems to be no evidence for gender bias. Other reasons may then be the cause of the lower wages. The key question then becomes: how do we decide if observed differences are both systematic and related to gender?

When studying different methods for answering if the gender pay gap is systematic, most approaches using a statistical regression-based method rely on significance testing and p-values. The first question these methods ask is: do we see a significant difference in wages between men and women? In statistical terms, a null hypothesis, an alternative hypothesis, and a level of statistical significance (alpha) are then formulated. The null hypothesis assumes innocence: there is no difference in wages among gender. The alternative hypothesis stated that the (expected) wages differ among gender. If a certain threshold is surpassed, the null hypothesis is rejected (i.e., there is evidence that the null hypothesis might be false). This threshold depends on the significance level, which is chosen based on the importance of avoiding false positives (when there is no actual difference, but the null hypothesis is rejected) or false negatives (when there is an actual difference but the null hypothesis is not rejected). By convention, a significance level of 5% is often — arbitrarily — used. One could set the significance level to 1% or 0.1% as well. In the context of the gender pay gap, it is not straightforward to set the proper level for alpha, as any argument can be challenged by those who wish to doubt the results.

The output of the statistical analysis provides an estimate of the difference in expected wages and a p-value. The p-value is a measure of the likelihood of obtaining the observed result or results that are even more extreme if the null hypothesis is true. If the p-value is less than the predetermined level of significance, then the null hypothesis is rejected in favor of the alternative hypothesis. This means that the results of the analysis are unlikely to have occurred by chance and there is evidence to support the alternative hypothesis. If the p-value is larger than alpha, it can only be concluded that there is no evidence of a difference. Next, the question is then asked whether the observed difference — if significant — is economically relevant. Otherwise said, does the difference also matter practically? If the analysis shows that the difference is both statistically and economically significant, the results of the regression method flag a gender bias.

Although this method is common in science, we argue that this approach is problematic when applied in an organizational setting. More precisely, it is the use of p-values that makes the regression method problematic, not the modeling approach itself.

As an alternative, we propose using standardized effect size measures instead of p-values. If the null hypothesis is incorrect, standardized effect sizes reflect the extent to which it is incorrect. We will show that when evaluating salaries within an organization, standardized effect sizes can provide information about the degree to which a difference is systematic. Examples of effect sizes are the (point-biserial) correlation coefficient (rpb), Cohen’s d (d), or the probability of superiority (ps). They are called standardized measures because they take the observed variation into account to express the effect. For instance, A d-value of 0 suggests no difference, while a d-value of 1 indicates that female and male salaries differ by 1 standard deviation (the most commonly used measure of dispersion). The ps should be interpreted in probability terms. If you choose a random man and a random woman, what is the probability that the man has a higher salary? In case of no differences, ps will be 50%. In case all men have a higher salary compared to women, ps will be 100%. Standardized measures are particularly relevant in the context of gender bias, as it allows us to decide if decisions (such as pay) are systematically biased based on gender.

In what follows, we discuss why the use of p-values should be avoided and illustrate how these issues can be avoided by putting the focus on effect sizes.

1. The Influence of Sample Size on Significance Testing: why large organizations will always suffer from pay gap.

An important property of the regression method is that the sample size affects the accuracy of an estimate. The larger the sample size, the more confident we can be in the obtained estimate. This is because a larger sample size provides more information and, as a result, a more accurate estimate of the population parameter. In the context of hypothesis testing, this means that if there is a difference in the population, we are more likely to reject the null hypothesis with a larger sample compared to a smaller sample.

For example, consider two companies (Company 1 and Company 2 in Figure 1) with the same pay gap of 10 EUR. The first company has 200 employees while the second company has 100 employees. Both companies have a female/male ratio of 50:50 and salaries vary similarly. When testing, the pay gap of 10 EUR in the first company is statistically significant, while the pay gap in the second company is not. Both companies have similar salary distributions, an equal ratio female/male workforce, and the same size pay gap. They only differ in the total number of employees and that is why one company is in trouble and the other not.

Figure 1 — Simulation of significance testing in small / large groups

Alternatively, we could rely on the standardized effect sizes for both scenarios. In the example, for both organizations, we calculate Cohen’s d and the PS. Cohen’s d is one of the most familiar measures of effect sizes for the comparison of two groups and can be used to express the size of the pay gap, independent of the sample size. In this case, both scenarios have comparable effect sizes — which correspond to small to medium effects based on interpretation guidelines in Table 1. From the ps’s we learn that for Company 1, the probability that men have a higher salary compared to women is 58.4%. In company 2, that probability is 60.6%.

Table 1 — Effect size interpretation guidelines

Also, figure 1 illustrates that if a company were divided into two entities with 100 employees each, both significance tests may conclude that the observed differences of 10 EUR are due to chance and are therefore insignificant. Besides highlighting the unfortunate negative property that sample sizes may have in significance testing, it also illustrates the potential for companies to (unconscious or unintentional) manipulate their sample sizes to avoid significant pay gaps. However, when considering the standardized effect size measures, no substantial differences are observed, and the conclusion remains: small to medium effects are observed both for the entire company and for the two separate entities.

2. Economic Significance?

An important consequence of the property discussed above is that the smallest differences in mean pay will be considered significant for large companies.

To better understand the practical implications of the pay gap, it is therefore common and needed to consider the observed mean difference in pay as economically relevant or irrelevant. For example, some people may consider a difference of 10 EUR in monthly salaries of 1,500 EUR to be significant, but not a difference of 10 EUR in salaries of 5,000 EUR. Others may consider a difference of even 1 EUR to be significant for principal reasons. This illustrates that determining economic significance is a subjective decision that can depend on the specific context and individual perspectives.

Take the following scenario, for example. In Figure 2, two organizations are presented with an equally large absolute pay gap equal to 100 EUR. In the first organization, salaries range from 2400 EUR to 3700 EUR. In the second organization, salaries have less variability and range from 2850 EUR to 3250 EUR. In terms of effect sizes, the pay gaps translate to Cohen’s d’s of respectively .5 (medium effect) and 2 (very large effect).

Figure 2 — Absolute vs relative differences

Therefore, it is also important to consider the entire distribution of salaries and take variance (a statistic for variation) into account when examining the gender pay gap. This can help to identify systematic differences in pay that may be indicative of bias. A difference of 100 EUR could seem substantial, but when a ps of 55% is observed, the conclusion of systematic bias is incorrect. On the other hand, 1 difference of 1 EUR might seem negligible, but when a ps of 95% is observed, we should conclude systematic bias.

3. Different unit of analysis.

The regression method and the use of p-values are common practices in science. Many published studies on the gender pay gap utilize significant testing to examine possible pay gaps between male and female employees in large populations (e.g., at the country- or sector-level). Because not all data are available and not all organizations can be examined, researchers rely on random samples for estimating the population gaps. In this case, significant testing and the use of p-values is the right method because not all organizations are included in the study. Another sample may result in a different estimate as estimates are subject to sampling error: the risk of making false conclusions is real. Proper inference tools are thus required to draw the correct conclusion.

However, when evaluating the gender pay gap within a single organization, all salary data are typically available and there is actually no need to control for type-1 errors (i.e. false positives). Remember, the p-value expresses the likelihood of obtaining the observed result or results that are even more extreme if the null hypothesis is true. Only in the case where the expected salaries of males and females are identical, the null hypothesis will always be false: there is a difference in the population. A logical consequence is that the p-value loses its meaning and thus its usefulness. As discussed, we show that there is a worthy alternative: standardized effect size measures.

Figure 3 — Significance testing when all data is available

So far, the observed variance in pay has typically been treated as a sampling error rather than as a reflection of inherent variation in salaries. By considering both the mean difference and the variance, we will improve our understanding of the gender pay gap and identify potential areas for improvement. The usefulness of significance testing in an organization is open to debate. But what should be considered independently of this discussion are the practical concerns surrounding the use of p-values. Large organizations will always suffer from pay gaps, while small organizations can avoid negative results. A significant result does not yet point to systematic differences, and therefore not to gender bias. By measuring the pay gap correctly, we can take the necessary actions. Only then does measurement allow for improvement.

Acknowledgement

Sinan Polatoglu

Maisie Maede

Useful References

Commission welcomes the political agreement on new EU rules for pay transparency

Magnusson, K. (2022). Interpreting Cohen’s d effect size: An interactive visualization (Version 2.6.0) [Web App]. R Psychologist.

De Schryver, M., & De Neve, J. (2019). A tutorial on probabilistic index models: Regression models for the effect size P(Y1 < Y2). Psychological Methods, 24(4), 403–418.

--

--