Salary for US professions analysed with Machine Learning
SHAP values for industry, education, job title, and more, together with a detailed gender gap analysis
Published in
5 min readMay 13, 2023
In this article, I use the publicly available dataset of about 28,000 responses on a salary survey conducted during 2021–2023 (mostly during 2021) on the AskAManager.org website. The data is also publicly available on the Kaggle website. Full details of the analysis can be found in this public Kaggle notebook.
Step 1 — data preprocessing
Here, data preprocessing consists of the following steps:
- selecting only US responders with salary in USD;
- selecting the label as total yearly compensation (the sum of basic and additional compensations), and converting it to thousands of USD/year;
- excluding null compensations, together with 1% of the lowest and 1% of the highest salaries;
- replacing values in selected columns, and normalising job titles and industry values;
- encoding rare categorical variables (in employee_residence, job_title, and experience_level columns) with no more than 70 different categories in each column and at least 20 data samples in each category;