Analyzing Race/Gender/Job Diversity in the US

Aman Singh Thakur
Nerd For Tech
Published in
7 min readJun 27, 2021

TABLE OF CONTENTS

  1. Introduction
  2. About the Dataset
  3. Technology Stack
  4. Methodology
  5. Interesting Insights
  6. Future Scope
  7. Experience and Important Links

Introduction

With the waves of protests regarding racial discrimination last year, many companies have vowed to be more ‘Diverse’, ‘Inclusive’ and ‘Equal’. Covid-19 itself has proved to be harsher on minorities than the affluent. The world is slowly awakening to the cosmopolitan and diverse identity of itself. However, we still require decades of incremental work left to do to be completely inclusive of talent. One way to monitor these changes is through analyzing the number of working professionals across their Race, Ethnicity and Gender for different industries, regions and states in the US. While examining the data, I could make 13 interesting insights that I hope you also find fascinating.

About the Dataset

As part of the mandate under the Civil Rights Act of 1964, the Equal Employment Opportunity Commission requires periodic reports from public/private/union/labour employers that indicate the composition of their workforces by gender, ethnicity and job profiles. The dataset is -

Job patterns for minorities and women in private industry (EEO-1: 2017/2018)

As a result of this initiative, data is collected annually from almost 75,000 private employers with 100 or more employees or federal contractors with 50 or more employees. The dataset has around 20 million data points in total.

Below is the summarised distribution of employees across various industries for you to get a sense of the dataset:

Fig 1: Industry wise distribution for employees for EE01 2018 dataset

Below are the attributes used to perform this analysis:

  • Region: Indicating regional composition of employees — Midwest, Northeast, South, West
  • States: Indicating state-wise composition of employees — All 50 states
  • Industries/North American Industry Classification System (NAICS): Indicating 20 different types of industries in the US

Below are the columns used to perform this analysis with :

  • Race: Indicating different races in the US — White, Asian, Black or African American, Biracial, Native Hawaiian or other Pacific Islander, Hispanic or American Indian or Alaska Native
  • Gender: Indicating different races in the US — Male, Female
  • Job Profile: Indicating 20 different types of job levels — Labor, Service, Clerical, Technicians, Sales Workers, Mid Off and Managers, Operatives, professionals, Senior Off and Managers, Craft

At runtime, According to each use case, a subset of data was used to create different views.

Technology Stack

With the help of Jupyter Notebooks, The Python code was written and made open-source on Github. Pandas data frames and python paradigms helped in achieving lightweight, scalable and efficient code. In order to visualize different comparisons, Python libraries like ‘matplotlib’ came in extremely handy, especially to create different graphs using the same subroutine. The efficient use of data structures and algorithms is prevalent throughout the codebase which helped in streamlining plotting a variety of graphs quickly.

Methodology

Due to the large datasets, it’s impossible to do an exhaustive search for every scenario and then derive insights. For this analysis, I have focused more on aggregate level information. The dataset offers attributes like Region, State and Industry type to allow us to aggregate all of the information available for each type. This focused approach will help to get a consolidated view of the diversity in different industries, regions and states in the US. The dataset also offers attributes like Race, Gender and Job profiles to compare the data for minorities and majorities and highlight areas where minorities need help.

To start with, I have drawn graphs to summarise the dataset for 2017/2018 for different industry types present in the United States. With a minute yearly variation of data within industries and diverse representation, it’s clear that the US has different types of stable businesses offering plentiful jobs in every domain.
Interestingly, the dataset has maximum data from health care, Manufacturing and Retail industries showing that these three industries provide the highest number of jobs in the country. However, The Agriculture, Education, Mining and Oil Extraction, Public Administration, Real Estate and Utilities accounts for less than 5 million employees per business to the total workforce.

The pièce de resistance is the consolidated research done for the minorities based on race/gender/job profile.

Fig 2: Snippet from Wikipedia page — Race and ethnicity in US

This dataset has optimistically all the employees counted for medium/big firms in the United States. A large segment of the population will still be uncounted. The entire population demographics is available in this statistic from US Census Bureau Estimates.

Interesting Insights

Let’s break down this analysis further to understand it deeply –

Fig 3 : Race/Gender/Job profile VS Industries in 2018

Industry-wide Comparison

  1. All industries except Agriculture and Public Administration seem to be dominated by Whites. Industries like Health Care, Manufacturing and Retail command with the most number of employees. However, In these industries, More than half of the industry is dominated by Whites.
  2. The good news is that the maximum number of minorities are present in fields dominated by Whites indicating that diversity is directly proportional to the number of jobs in the industry.
  3. After Whites, Blacks/African American and Hispanics seem to lead the numbers for the minority camp. The dataset accounts for millions of Asians and Biracials. Since they constitute only 1.5% of the US population, Native Hawaiian/Pacific Islander/American Indian/Alaska Natives are the most underrepresented minority in the US.
  4. Not surprisingly, Asians are the most dominant minority in the Scientific community due to the plethora of Indian students choosing to pursue their research in US academic institutes.
  5. Industries like construction, manufacturing and warehousing are heavily dependent on men. Women continue to dominate in the medical and health care field. These trends are prevalent for decades as their origin seems to be deeply rooted in our society biases.
Fig 4: Race/Gender VS Regions in 2018

Region-wide Comparison

  1. Again, Whites have dominated in all four regions in the US as they comprise 60% of the entire populace. Except in the western part of the country, The blacks/African American population is the most dominant minority. In the West, Hispanics are in large quantity due to the shared border with Mexico.
  2. Across the four regions of the US, Gender diversity is excellent, maintaining close to a 50% ratio.
Fig 5: Race/Gender VS States in 2018

State-wide Comparison

  1. To dig further into the hidden treasures of the dataset, we can look at diversities at each state level. Even though Whites dominate in all states of the US, California and Texas seem to be the most diverse states in US because of the large Hispanic presence in these states.
  2. States like New York, Ohio, Pennsylvania, Washington, Florida and Illinois have a diverse minority present but the gap is too big between them and the majority. While states like Connecticut, Iowa, Kansas, Kentucky, Nebraska, New Hampshire and Wisconsin have virtually non-existing minority population.
  3. Again, the data supports that all states have almost equal number of men and women employed.

General Insights

  1. There is a minute variation of data between 2017 and 2018.
  2. Across Industries, Professional jobs continue to be in high demand. Most employment levels peak for the same given industry. Therefore, It seems more jobs lead to more employees at every skill/hierarchical level.
  3. Interestingly, since these datasets have data before the wave of inclusivity and diversity due to racial discrimination in the US and the Covid-19 pandemic, it’s hard to predict similar movement in 2019/2020/2021. But at the same time, when the final datasets are available, it’ll be interesting to see which minorities accounted for the increase in racial/gender/job profile diversity due to the arduous effort distributed across the globe.

Future Scope

As highlighted in general insights #3, when data collected after COVID will arrive, we can identify how much change the protests have brought about to the overall picture of diversity. Doing a time-series analysis would help us bring advanced analysis tools to predict the estimated time for us to be truly diverse in every sense of the word.

However, This dataset still contains boundless hidden insights. Although I have only scratched the surface, I hope I can collaborate with fellow enthusiasts to dig deeper into this dataset in the future.

Experience and Important Links

This project has been one of my most satisfying endeavours to date.I feel much more well-versed in the country’s demographics because of this incredible dataset. Also, I experimented with the true power of Pandas data frames and open-source technologies in the world of data science. As a fellow engineer working on large-scale datasets daily, I am still amazed at how data continues to empower communities.

I am deeply grateful to the United States Equal Employment Opportunity Commission for taking the pain to create this dataset. Also, I would like to extend my sincere thanks to Miss Arunima Singh for her contribution in producing great insights and developing robust graph layouts.

Our ability to reach unity in diversity will be the beauty and the test of our civilisation — Mahatma Gandhi

Diversity Analysis Notebook || Github Codebase || Dataset

--

--

Aman Singh Thakur
Nerd For Tech

Computers are useless. They can only give you answers. I am seeking questions !