Salaries for Data Science professionals explained with Machine Learning

SHAP values of employee residence, experience level, company location, and more

Dmytro Iakubovskyi
Data And Beyond
Published in
4 min readJan 5, 2023

--

In this article, I have analysed the dataset that contains detailed information about 600 salaries in the Data Science domain (worldwide) in the years 2020–2022 taken from the ai-jobs.net website. This dataset is publicly available on Kaggle. Full details of the analysis can be found in this public Kaggle notebook.

Photo by Chris Liverani on Unsplash

Step 1 — data preprocessing

Here, data preprocessing consists of the following steps:

  • converting the label (yearly gross salaries) to kUSD/year;
  • excluding 1% of the highest and 1% of the smallest salaries;
  • encoding rare categorical variables (in employee_residence, job_title, and experience_level columns) with no more than 20 different categories in each column and at least 10 data samples in each category;
  • finally, dropping unused columns.

Step 2 — setting a Machine Learning model to predict the yearly gross salaries

The data prepared with the previous step are randomly split between training and test samples…

--

--

Dmytro Iakubovskyi
Data And Beyond

Top writer in AI, Movies | Senior data scientist | Editor in Data And Beyond | https://www.linkedin.com/in/dima806/