Salary Prediction with Machine Learning (Part 1).

8 min readFeb 5, 2022

Data Science is a very broad industry that has birthed many other recent data roles such as data analysis, machine learning engineering, data engineering, analytics engineering, and a few others. While some people have these roles well defined, others work across many of these branches without even knowing.

I recently stumbled upon a dataset that contains details of data scientists’ earnings/salaries across some countries, based on their education level and years of experience, so I thought it would be interesting to explore.

This article will be giving details of the project on data scientists’ annual salary predictions, which I worked on.

Prerequisites to understand this project include :

Basic knowledge in Python programming
An understanding of data science

The whole process is broken down into 4 stages;

Data Collection.
Data Preprocessing
Model Building
Model Deployment

Data Collection: Data salaries are not easily available as HR personnel claims they are proprietary. Therefore, we resorted to using the publicly available data from Stack Overflow Annual Developer Survey. Here is the link

Stack Overflow

2021 With nearly 80,000 responses fielded from over 180 countries and dependent territories, our Annual Developer…

insights.stackoverflow.com

Data Cleaning and preprocessing: The first step in data cleaning and preprocessing is importing the libraries and dataset. A python library is a collection of related modules that can be called and used. I would be using four main libraries which are pandas (for data analysis), Numpy (for numerical operations), seaborn (for data visualization and exploratory data analysis), and matplotlib.pyplot ( for data visualization and graphical plotting). These libraries can be called and used with the help of the “import” keyword.

Importing all the necessary libraries

Import and load the dataset from the drive: Since I used google colab I had to import and load the dataset from drive.

The above shows the sample of the dataset.

The dataset above contains 64461 rows and 61columns.

Let’s start cleaning!!!

Selecting and keeping the columns/features needed for the prediction: When building a machine learning model in real-life, features selection is very important because it’s almost rare that all the features in the dataset are useful to build a model. So we selected only a few columns needed for the prediction in order not to bother the user with having to fill in too much unnecessary information. The columns are Country, Edlevel which is the education level, YearsCodePro which is the number of years of experience, Employment (full-time or part-time), ConvertedComp (the annual salary in dollars) this feature is still going to be renamed.

Dealing with Missing Values: I would be using the columns where the salary is available, so I will be dropping columns with Nan salary.

Let’s take a quick look at the dataset

Here we see that we have 34,025 data entries, three columns are objects which means they are strings, only the salary column is a float. So we would be dropping the rows where the columns are not numbers.

I dropped the employment columns since it wasn’t really needed for the prediction. Let’s take a quick look at our dataset again

Now we would be cleaning each of the columns data

I would start with the country data

The value_counts() function in python will return the series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Here we see that the U.S.A has the most data and we have some countries with one data point which we will get rid of because our model can not learn from just one data point.

We will be cleaning our country column with the use of the function above after naming the function “shorten_categories”, we fix a cut-off value. If the number of data points for each country is greater than the cut-off value we keep it. Other wise we combine it to a new category called ‘Other’.

Now, after running the above we discovered that the new category we created has the most data points.

I would like to look at the relationship between the salary column and the country column by plotting a boxplot.

From our plot we can see that we have lot of outliers .So now we would keep the data where we have more information by keeping salaries that are lesser than or equal to $250,000 ,greater than or equal to $10,000 and drop the other category.

Let’s plot it again

We can see that the outliers has reduced.

Cleaning the YearsCodepro feature

The unique() function in python is used to get unique values of the series object. Unique() functions are returned in order of appearance.

After running this, we discovered that all the arrays came out as strings, for the computer to understand these, we would convert the arrays to floats. If any is less than a year, it will return 0.5 while if it is more than a year, we ascribe 50. Otherwise, convert it to a float. We would do that by creating a function called clean_experience.

After running the above we see that the output came out as integers.

Cleaning the EdLevel feature

Here, we have different education levels. We would be focusing on the Bachelor’s , Master’s and other Post graduate degrees. Anything apart from these would be called Less than a Bachelor.

After running the code above we would see that we have only five outputs.

Now, we are almost done with the data cleaning.

As we all know, the computer does not understand strings and we do have columns containing strings! It is, therefore, necessary to transform the string values into unique values. To do this, we would be using LabelEncoder. Label Encoding in Python is part of data preprocessing. Hence, we will use the preprocessing module from the sklearn package and then import LabelEncoder

Create an instance of LabelEncoder() and store it in the LabelEncoder variable which is the lb_edlevel.

Apply fit and transform which does the trick to assign numerical value to categorical value and the same is stored in a new column called “edlevel”.

Let’s take a look at the EdLevel

We no longer have strings here again, the LabelEncoder has transformed the EdLevel column to integers which the model can now understand . We would do the same for the country column also.

Create an instance of LabelEncoder() and store it in the LabelEncoder variable which is the lb_country.

Apply fit and transform which does the trick to assign numerical value to categorical value and the same is stored in new a column called “country”.

Let's have a look at the unique values for the country column

Now, label encoder has given each country a unique integer value.

Let’s check the dataset

Now, our data is ready for training and testing.

Data Splitting: Data splitting is commonly used in machine learning to split data into a train, test, or validation set. This approach allows us to estimate the model performance. Here, we would only be training and testing the data.

We would split our data into X and y. X will contain the features and y will contain the dropped target which is the salary.

After doing that we would be splitting X and y dataset into the test and train. To do this we will be using the train_test_split function.

The train_test_split is a function in Sklearn model selection for splitting data arrays into two subsets: for training and testing data. With this function, you don’t need to divide the dataset manually. By default, Sklearn train_test_split will make random partitions for the two subsets.

We would be training with 70% and testing with 30% of the dataset and set our random_state at 42. The random state ensures that the outputs are generated in the same order whenever it is being runned.

So it’s time to build our model!!!

Three different algorithms would be used to build the model then we will pick the algorithm with the least error.

We would start with Linear Regression, Linear regression is a basic and commonly used type of algorithm for predictive analysis, we would start by importing it from sklearn.linear_model.

Create an instance of LinearRegression() and store it in a variable, fit it on the training dataset and store it in a variable. Then predict on the testing dataset and store it in a variable.

In regression predictive modeling, we use error metrics to calculate the model performance. The error metrics we would be using is the RMSE which is the root of the mean square error. This error metric would be used to show the difference between the predicted value and the actual value.

Our output is shown below

We can see that the difference between the actual and predicted value using LinearRegression algorithm is $39,558.79 which is very high

Let’s try the DecisionTreeRegressor algorithm

Import DecisionTreeRegressor from sklearn.tree, create an instance of DecisionTreeRegressor and store it in a variable, fit it on the training dataset and store it in a variable. Then predict on the testing dataset and store it in a variable.

The difference between the actual and predicted value using DecisionTreeRegressor is $33,962.56 which is a little bit high.

Let’s try the Random Forest Regression algorithm.

Import RandomForestRegressor from sklearn.ensemble. Create an instance of RandomForestRegressor and store it in a variable, fit it on the training dataset and store it in a variable. Then predict on the testing dataset and store it in a variable.

Finally, the RandomForestRegressor algorithm gave us the least error. Now we want to find the best parameter for our model using the Gridsearchcv.

Grid search is the process of performing hyperparameter tuning to determine the optimal values for a given model. This is significant as the performance of the entire model is based on the hyperparameter values specified. It is a useful tool to fine-tune the parameters of your model.

The way it works is by importing it from sklearn.model_selection, define the set of different parameters, then create a parameter dictionary containing the keyword argument in RandomForestRegressor, (you can check out the documentation for that ) then create an instance of the regressor algorithm used, an instance of Gridsearchcv containing the regressor algorithm, parameter, and the scoring. Lastly, fit the Gridsearchcv to the training data set.

After running the above code, we get the best estimator and store it in a variable called model. Then fit it on our training dataset and use it to predict on our test data set.

Following this, the error value has reduced a bit from $33,617.45 to $32,911.09 which is still fair.

Making a predictive system

For instance a user inputs his country’s information as United States; EdLevel as Master’s degree; and the yearscodepro as 15years

Below is the result for the user’s annual salary .

In conclusion, we have seen the step-by-step approach to building the model for our salary prediction web app. In my next article, I will be sharing how to deploy this model.

Salary Prediction with Machine Learning (Part 1).

Stack Overflow

2021 With nearly 80,000 responses fielded from over 180 countries and dependent territories, our Annual Developer…

Cleaning the YearsCodepro feature

Cleaning the EdLevel feature

Written by Babatunde Oreoluwa