Decoding Earning Disparity: An Exploratory Analysis
A visual journey with Plotly and Seaborn
I am all for recycling this week!
This is part of a project I did for an introductory course on Programming with Python and deemed worth sharing on a wider platform. The idea then was to take a simple dataset; one that doesn’t require much bandwidth on cleaning, and dive into unearthing underlying truths in the data to weave a story worth telling.
Introduction and Rationale
Motivation: Gender Pay gap
Over the recent years, there have been reports of gross inequalities in the compensation provided to women and men for the same designation and expertise in a job. According to a 2019 report by CNBC, this is especially prevalent in Healthcare, Financial Management, and the Legal Profession, where women are known to be offered a lower remuneration as opposed to their male counterparts. This difference in wages between males and females is commonly known as a Pay Gap.
Data Source: The data used is from an open source dataset available on Kaggle, sourced originally from Glassdoor, a website where employees can post reviews about current and past employers. The platform is typically used by candidates who want to understand the work culture and salary insights of a prospective employer. The dataset chosen contains information on users, ranging from their educational background to their current seniority level and designation.
Through this analysis, the aim is to analyze the pay structures of various candidates across the given attributes to get a comprehensive understanding of the data. Subsequent analysis is done specifically to understand if the trends of employees’ salaries across profiles vary for Males and Females, to identify the presence of a Pay Gap if any, and additionally infer how the Gap varies across different factors.
1. Setting up environment
# Importing necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
%matplotlib inline
import plotly.graph_objects as go
from plotly import __version__
import cufflinks as cf
from plotly.offline import download_plotlyjs,init_notebook_mode,plot,iplot
cf.go_offline()# Reading the data
glassdoor_data=pd.read_csv("Glassdoor Gender Pay Gap.csv")glassdoor_data.head()
The granularity in this dataset is at an individual candidate level, and the information captured includes their Job Title, Gender, Age, their most recent Performance Evaluation, Educational Background, Department of work, Seniority level in the Profession in terms of Work Experience in Years and Compensation— broken into Base Pay and Bonus.
A few more calculated variables are incorporated in the next steps to make the analysis richer.
# Defining colour schemes to be used in the notebook for plotly plots
notebook_colours=["plum","slateblue","navy","firebrick",
"darksalmon","slateblue","maroon","lightskyblue","blue","darkmagenta"]
This is a pallet that I customized and often use in many notebooks as it comes in handy while using viz libraries.
2. Data Preparation and cleaning
As standard practice, going ahead with some basic hygiene checks on the dataset before proceeding to exploration.
# No null values in the data
glassdoor_data.isnull().sum()
# Summary Stats of numerical variables
glassdoor_data.describe()
# Summary stats of categorical variables
glassdoor_data.describe(include=np.object)
2.2 Data Preparation
2.2.1. Convert seniority and performance evaluation to factors
Since Seniority and Performance Evaluation are recorded as ordinal variables (meaning discrete levels, ranging from 1–5 in this case), they need not be stored as float and are converted to categorical (object) variables for this analysis.
glassdoor_data['Seniority'] = glassdoor_data['Seniority'].astype(object)glassdoor_data['PerfEval'] = glassdoor_data['PerfEval'].astype(object)glassdoor_data.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 JobTitle 1000 non-null object
1 Gender 1000 non-null object
2 Age 1000 non-null int64
3 PerfEval 1000 non-null object
4 Education 1000 non-null object
5 Dept 1000 non-null object
6 Seniority 1000 non-null object
7 BasePay 1000 non-null int64
8 Bonus 1000 non-null int64
dtypes: int64(3), object(6)
memory usage: 70.4+ KB
2.2.2 Calculate Total Pay
To calculate an employee’s Total Pay per annum, their Base Pay and Bonus Pay are added.
glassdoor_data['TotalPay']=glassdoor_data['BasePay']+glassdoor_data['Bonus']
2.2.3. Creating Age buckets
From the summary statistics we know that age ranges between 18 to 65. Analyzing the trends across various age in brackets would give better insight than at each individual age. The age is bucketed into 4 groups of 12 years each.
# Defining labels for creating age groups
labels = ['18 - 30', '31 - 42', '43 - 54', '55 - 65'] # # Creating Bins in Age
bins=[17,30,42,54,65] # the lower bin value is included in pd.cut 17---> starts at 18age_binned=pd.cut(glassdoor_data['Age'],bins=bins,labels=labels)
glassdoor_data['AgeBuckets']=age_binned
glassdoor_data['AgeBuckets']=glassdoor_data['AgeBuckets'].astype(object)glassdoor_data.head()
# Now checking the summary statistics of Seniority and PerfEval
glassdoor_data.describe(include=np.object)
Data Summary
- There are slightly higher number of males than females
- Operations is the most common department
- Marketing Associate is the most common job title
- Most candidates have a Seniority of 3 years
- 5 is the most common performance rating
- Most candidates’ educational background is only up to High School
- Most candidates fall into the 18- 30 age group
3. Exploratory Data Analysis
3.1 Univariate distribution plots
To understand the spread of the data in each of the variables, the following distribution plots are generated.
3.1.1 Categorical Variables
# For categorical variables
iplot(cf.subplots([glassdoor_data['Gender'].figure(kind='hist',color=notebook_colours[0]),
glassdoor_data['AgeBuckets'].figure(kind='hist',color=notebook_colours[1]),
glassdoor_data['Seniority'].figure(kind='hist',color=notebook_colours[2]),
glassdoor_data['PerfEval'].figure(kind='hist',color=notebook_colours[4]),
glassdoor_data['Education'].figure(kind='hist',color=notebook_colours[6]),
glassdoor_data['Dept'].figure(kind='hist',color=notebook_colours[7]),
glassdoor_data['JobTitle'].figure(kind='hist',color=notebook_colours[8])],shape=(3,3)))
There are no stark outliers or missing values as per the distribution plot of any of the categorical variables.
3.1.2 Continuous Variables
iplot(cf.subplots([glassdoor_data['BasePay'].figure(kind='hist',color=notebook_colours[8]),
glassdoor_data['Bonus'].figure(kind='hist',color=notebook_colours[6]),
glassdoor_data['TotalPay'].figure(kind='hist',color=notebook_colours[4])],shape=(3,1)))
Base pay, Bonus and Total pay are fairly uniformly distributed with a peak in the middle for each of the variables.
3.2 Multivariate Distribution Plots
This section has exploratory analysis to understand the distribution and behaviour of two or more variables in the dataset.
3.2.1 Gender Diversity across different attributes
The frequency distribution of males and females across each of the categorical variables is plotted in this section.
# Creating list of categorical variables to iterate through
bivar=['JobTitle', 'PerfEval', 'Education', 'Dept', 'Seniority', 'AgeBuckets']# Plotting count of males and females for each categorical variable
for i in bivar: fig=px.histogram(glassdoor_data,x=i,color='Gender',color_discrete_sequence=notebook_colours,barmode='group',title='Gender diversity across {}'.format(i))
fig.show()
Findings for Gender Diversity
Age Groups
- 18–30 age group has highest number of males, 43–54 is the only age group with more females than males
- Women are approximately equally distributed in all age groups, between 113–120 in each group
Department
- Women are lesser than men in every department
- Least women are in management, followed by engineering
Job Title
- Most women are Marketing Associates, while the same job title has the least number of men
- Manager and Software Engineer job titles have least number of women
Educational Background
- Most number of women are high school graduates
- Women are lesser than men at every level of education background except college
Performance rating
- Most females are rated 1 out of 5 in their performance evaluation
- As compared to Males, the frequency of females receiving a perfect evaluation score of 5 is significantly less
Seniority
- Each level of seniority has lesser number of females compared to men, except for candidates with 5 years of experience
3.2.2 Jobs vs Educational backgrounds
Frequency distribution of educational backgrounds across different Job Titles is plotted in this section.
px.histogram(glassdoor_data,x="JobTitle",color="Education",barmode="stack",
color_discrete_sequence=notebook_colours)
Key Insights
Marketing associate is the most popular job title within which most candidates have educational background up to High School
IT jobs have a majority of candidates with College level of education
Both Graphic Designer and Software Engineer jobs have a majority of Masters level of education
Data Scientist jobs have the highest number of PhD scholars
3.2.3 Analysis of Salary Components by Department
The components of salary — Base pay and Bonus are different for different job titles within each department.
# Analyzing the average slaries of jobs within each department
px.histogram(glassdoor_data,x="Dept",y="BasePay",color="JobTitle",barmode="group",
histfunc='avg',title='Base Pay offered within each department',
color_discrete_sequence=(notebook_colours))
px.histogram(glassdoor_data,x="Dept",y="Bonus",color="JobTitle",barmode="group",histfunc='avg',
title='Bonus offered within each department',
color_discrete_sequence=(notebook_colours))
px.histogram(glassdoor_data,x="Dept",y="TotalPay",color="JobTitle",barmode="stack",histfunc='avg',
title='Total pay offered within each department',
color_discrete_sequence=(notebook_colours))
Key Insights
Highest paying jobs: Managers are paid most in each department in terms of average Base Pay and Total Pay, followed by software engineers
Least paying jobs: Marketing associates are the least paying jobs in terms of Base Pay and Total Pay
Bonus for different job titles differ across each department
3.2.4 Understanding the relationship between Components of Pay
The BasePay and Bonus for Males and Females across different departments is analyzed using a scatter plot with a regression line to understand if the nature of relationship between the variables is linear.
px.scatter(glassdoor_data, x="BasePay", y="Bonus",trendline="ols" ,color="Gender", facet_col="Dept",color_discrete_sequence=notebook_colours)
Key Insights
The scatter plots do not follow any set trend and the r² values of the regression lines are very small. Even though the relationship appears to be negative, the variability in Bonus cannot be explained completely with Base Pay. Nothing conclusive can be said about the relationship between the bonus and base pay for males or females in any of the departments.
3.2.5 Pay at levels of Seniority
The following heatmap is plotted to understand the trends in pay offered with increase levels of seniority.
# Reshaping the data to a matrix format for heatmap
seniority_pivot = glassdoor_data.pivot_table(index = 'Seniority',columns='JobTitle',values = 'TotalPay')
#agg function by default is mean
seniority_pivot
# Heatmap of payscale with seniority
plt.figure(figsize=(12,6))
sns.heatmap(seniority_pivot,linewidths=1,linecolor='black',cmap='BuGn')
Key Insights
The colour gradient indicates the magnitude of average total pay in each subgroup. A darker shade indicates a higher magnitude of average pay and vice versa. As expected, for each of the job titles, the average total pay offered per annum increases with increasing level of seniority.
4. Analysis of Disparity in Pay in Females vs Males
4.1 Calculation of Gender Pay gap by department
This section involves the calculation and visualization of the difference in pay across the Genders (if any).
# At every level of seniority the quartiles for females are lower than males
px.box(glassdoor_data,x="Dept",y="TotalPay",color='Gender',color_discrete_sequence=notebook_colours)
# Calculating average pay across Department and gender
gender_dept_pay=glassdoor_data.groupby(['Dept','Gender'],axis=0,as_index=False).mean()
gender_dept_pay
Tabular data below indicates that the average Base Pay and Total Pay offered to women in each department is lesser than their male counterparts.
# Pivoting to get the data at required level for ease of calculationpaygap_dept=gender_dept_pay.pivot(index='Dept',values=['Bonus','BasePay','TotalPay'],columns='Gender')# Calculating difference in total pay
paygap_dept['DeptPayGap']=paygap_dept['TotalPay','Male']-paygap_dept['TotalPay','Female']
paygap_dept.head()
Key Insights
Pay gap is most dominant in the Engineering department where women are paid $111,00 less than men per annum on an average. In earlier analyses it was evident than the number of women in engineering department is the lowest. The considerable difference in pay may be one of the reasons why women feel discouraged to go into engineering.
4.2 Understanding Gender Pay Gap by job titles within each department
The following analysis is done to understand if there are certain job titles within each department that drive the income disparity at a departmental level.
### Reshaping data to get average of pay in each for males and females in each job title within departmentgender_job_dept_pay=glassdoor_data.groupby(['Dept','JobTitle','Gender'],axis=0,as_index=False).mean() gender_job_dept_pay.head()
# Treemap with outermost layer as department, then job title and gender in the inner layer
fig = px.treemap(gender_job_dept_pay, path=['Dept','JobTitle','Gender'], values='TotalPay',
color='TotalPay', color_continuous_scale='bugn',
title="Earning disparity in Job Titles within each department",
labels={"TotalPay":'Average Total Pay'},width=1200, height=600)
fig.show()
The above treemap helps understand the income disparity with respect to job titles within each department. The colour gradient indicates the magnitude of average total pay in each subgroup. A darker shade indicates a higher magnitude of average pay and vice versa.
Key Insights
It is observed that the Pay gap within a given department is not concentrated in one job role. It is a cumulative effect of discrepancy in men and women’s earnings across all jobs in a department that give a net effect of lesser average pay in women.Sales: Women are paid lesser IT, Marketing associate, sales associate, software engineer and warehouse associate job roles
Engineering: Women are paid lesser in Manager, Marketing Associate, Sales Associate and Software Engineer roles
Management: Women are paid lesser in Financial Analyst, IT, Manager, Marketing Associate, Sales associate and Software Engineer roles
Operations: Women paid lesser in Driver, Financial analyst, Software engineer roles
Administration: Women paid lesser in Driver and Marketing Associate roles. There are no female software engineers in admin.
4.3 Understanding gender Pay gap by Seniority in each department
While earnings do increase with increasing seniority, the earnings for women stand lower than men at each level.
### Reshaping data to get average of pay in each for males and females in each seniority level within departmentgender_seniority_dept_pay=glassdoor_data.groupby(['Dept','Seniority','Gender'],axis=0,as_index=False).mean()
gender_seniority_dept_pay.head(6)
# Scatter plot with average total pay and seniority
fig=px.scatter(gender_seniority_dept_pay,'Seniority',
'TotalPay',color='Gender',size=(gender_seniority_dept_pay['TotalPay']/10000)-6, # factor of total pay calculated
color_discrete_sequence=notebook_colours, facet_col='Dept',labels={"TotalPay":'Average Total Pay'})fig.show()
Key Insights
It is observed that for each level of seniority within the departments, women are paid lesser than men. This trend varies across departments but is most prominently seen in sales department for individuals with a seniority of 5. It appears that even if women have the same number of years of work experience in a given department, they don’t earn the same as men.
Caveat here on Bubble/Scatter Plots, I have given a size element to the data points in this scatter plot to which is an exponential factor of average total pay. This is done specifically to highlight the differences in pay for males vs females with increasing levels of seniority. As observed, the size of the bubble keeps increasing with every level yet, the size of the purple (male) bubble is always bigger than the pink one (female).
4.4 Performance evaluation and Earning Disparity
Women have a fixed range of performance evaluation that does not cross 4.0 in any department. A deep dive was done to analyze if poor performance rating was the reason behind lower average pay in women.
# Box plot to understand spread of performance rating
px.box(glassdoor_data,x="Dept",y="PerfEval",color='Gender',color_discrete_sequence=notebook_colours,
title='Performance Evaluation in Departments')
# Aggregating numerical attributes at a Department, Evaluation and gender levelgender_eval_dept_pay=glassdoor_data.groupby(['Dept','PerfEval','Gender'],axis=0,as_index=False).mean()
gender_eval_dept_pay.head()
# Scatter plot to understand pay disparity in performance ratings
px.scatter(gender_eval_dept_pay,x="Dept",y="TotalPay",color='Gender',color_discrete_sequence=notebook_colours[6:8],
title='Average Pay by performance evaluation',facet_col='PerfEval',size=(gender_eval_dept_pay['BasePay']/10000)-8)
Key Insights
From the box plot it is evident that the distribution of women’s performance in evaluation is left skewed in management and men’s performance evaluation is right skewed in engineering department. However, to further understand if performance evaluation is the reason behind women’s salaries being lower than men’s, the average salary at each performance rating in a department is plotted.
It is observed that men make more than women in each department, regardless of their performance rating. The only exception to this is for employees rated 2/5 in Management department and those rated 5/5 in sales department where there is no considerable difference in Total Pay for men and women.
It can be concluded that women earn lesser than males despite having the same performance evaluation.
4.5 Understanding earning disparity with Educational Backgrounds
To understand if educational background was the reason why women were paid lesser than males, the average pay for males and females in a given department was compared based on their level of education.
### Reshaping data to get average of pay in each for males and females by educational background
gender_ed_dept_pay=glassdoor_data.groupby(['Dept','Education','Gender'],axis=0,as_index=False).mean() gender_ed_dept_pay.head()
# Sunburst chart with innermost layer as education, then department and gender in the outest layerfig = px.sunburst(gender_ed_dept_pay, path=['Dept','Education','Gender'], values='TotalPay',
color='TotalPay', color_continuous_scale='bugn',title="Earning disparity in levels of education",
labels={"TotalPay":'Average Total Pay'})
fig.show()
The above sunburst chart helps understand the disparity in earnings in males and females with relation to their educational backgrounds. The colour gradient indicates the magnitude of average total pay in each subgroup. A darker shade indicates a higher magnitude of average pay and vice versa.
Key Insights
Employees who have studied up to PhD level of education earn the highest. Out of employees who have studied up till PhD, indiviudals working in the engineering division earn the highest.However, women are found to be earning lesser than their male counterparts on an average in each department despite having the same level of education.
Conclusion
Several factors were analyzed in an attempt to understand the disparity in earnings in individuals from the given dataset. There are multiple perceivable factors that contribute to the Total Pay received by employees. While educational background, seniority, job title and department affect the Pay scale, gender should not play a role in determining an individual’s salary in an ideal scenario. However, through multiple explorations it is observed that this is not the case.
Engineering Department has the highest pay gap where women earn $111,00 lesser than men on average per annum
Within departments, there are perceivable pay gaps in most job titles. However, the gap is reversed in Data Scientists and Graphic designers in Management where women earn higher than men
There is a perceivable disparity in each seniority level wherein women earn lesser than men for the same years of work experience. This is most prominently seen in Sales professionals with an experience of 5 years.
The performance rating for women in management is left skewed as compared to the distribution for men. However on analyzing the average salary for men and women in each department and their performance rating, it was found that men make more than women in each department, regardless of their performance rating
Employees with PhD level of education earn the highest in each department. However, even then, women earn lesser than men despite having the same educational background
Through analysis of the above components, no causal factor could be established, due to which a female’s average pay should be lesser than a male’s. Therefore, it appears that a systemic bias is in play because of which women earn lesser than men despite having similar educational background, department of work, job title, performance rating and seniority.
That’s all folks!
This concludes a deep dive analysis on a simplistic dataset using Pandas for Data wrangling and Plotly & Seaborn for Visualizations. I am duty bound to warn that this is an analysis that holds true for the given dataset, and not an opinion. Reserving my personal take on social issues for another time and place, preferably offline.
Till then, thanks for reading and please feel free to reach out and drop your comments and suggestions below!