Salary for US professions analysed with Machine Learning

SHAP values for industry, education, job title, and more, together with a detailed gender gap analysis

Dmytro Iakubovskyi
Data And Beyond
Published in
5 min readMay 13, 2023

--

Photo by Mapbox on Unsplash

In this article, I use the publicly available dataset of about 28,000 responses on a salary survey conducted during 2021–2023 (mostly during 2021) on the AskAManager.org website. The data is also publicly available on the Kaggle website. Full details of the analysis can be found in this public Kaggle notebook.

Step 1 — data preprocessing

Here, data preprocessing consists of the following steps:

  • selecting only US responders with salary in USD;
  • selecting the label as total yearly compensation (the sum of basic and additional compensations), and converting it to thousands of USD/year;
  • excluding null compensations, together with 1% of the lowest and 1% of the highest salaries;
  • replacing values in selected columns, and normalising job titles and industry values;
  • encoding rare categorical variables (in employee_residence, job_title, and experience_level columns) with no more than 70 different categories in each column and at least 20 data samples in each category;

--

--

Dmytro Iakubovskyi
Data And Beyond

Top writer in AI, Movies | Senior data scientist | Editor in Data And Beyond | https://www.linkedin.com/in/dima806/