Predicting Data Science Salaries Scraped From Indeed.com

Zain Khan
I,RR
Published in
9 min readJul 20, 2020

One of my key projects for the General Assembly Data Science Immersive course was to scrape job listings from any job site we choose and apply machine learning algorithms to predict salaries for Data Science roles in the United States.

Note: This project was completed in July 2020.

Introduction and Pre-Processing:

The first task was simple and straightforward, find the appropriate tags for each job cards and then extract the respective job title, salary and location to our notebook for pre-processing.

Web Scraping:

This is the base URL that I used to scrape all of the key job information.

URL = “https://www.indeed.com/jobs?q=data+scientist&l=San+Francisco&start=10"

The job titles were represented by q query and the location by l. I used the usual web scraping methods illustrated below and crafted 4 separate functions to extract the relevant information from each job listing.

My web scrape in action. Beautiful. Beautiful Soup.
Each function used to extract the information required.

From there, I needed to begin scraping with all of the cities I want to include in the scrape as well as job titles. This took a casual 3 hours and 15 minutes. Something I wish I had known at the start.

Slow and steady wins the race.

After that long and painful web scrape, I needed to put all of our jobs into a dataframe.

After a 3 hour web scrape, the hour it took to complete this felt like a minute.

With all of the data required in a dataframe, I can now run all the amazing models I had learned over the first 6 weeks of the course. Let’s begin by looking at the amount of jobs I actually managed to scrape:

Job         17714
Company 8756
Location 2782
Salary 860

Since this project is focused on predicting salaries, I need to clean our data to make sure that the job listings I keep for our analysis have an associate salary. This is where I was nervous. The sunk cost fallacy was kicking in. I had waited almost 5 hours to web scrape and add all my data to a dataframe so I needed this to work smoothly.

After checking for any null values, I found that our lovely ‘Salary’ column had exactly 14,046 missing values. This is getting tight. After removing the rows without an actual salary, I had to clean the column further to remove all of the daily and monthly salary rates.

I really like the look of this cell.

A total of 912 values by the end of the data cleaning. Not the biggest sample but sufficient for this project. I probably should not have selected the United States because unlike the United Kingdom most employers or agencies do not list salaries on job postings. You live and learn. I, too, am a self-learning machine.

Here is what my cleaned dataframe looks like at the end of pre-processing:

I felt like a proud father looking at this.

EDA and Setting Up Our Target:

The EDA performed on this data set was basic but, as always, necessary. I needed to check if there were any strange and out of the ordinary values in the salary column.

count       912.000000
mean 120666.084978
std 44826.590910
min 3.000000
25% 91039.000000
50% 115555.000000
75% 145000.000000
max 475000.000000

There was a salary value at $3 per year which is definitely a data point that I need to discount so I did just that. Without even thinking twice, I removed it. That’s just how I operate at this point. However, it makes perfect sense because with 912 values, I can afford to move on quickly.

This made me question whether or not there were other glaring outliers in my dataset.

A histogram on all salaries.

There are a few values on the higher end which I have an internal debate about removing. Since I am going to be creating dummy variables for my target based on the percentile score, does it really make a difference to my data? Should it? I am going under the assumption that it won’t necessarily effect my model too much, however, after writing and re-reading my analysis, I may learn otherwise.

Creating My Target:

I decided to test my models on two different target variables.

1- Multi-class target: 4 different classes each defined by the percentile score of the target.

Simple and potentially effective.

Bonus: Baseline Accuracy:

2- High/Low target: a simple binary classified target for salaries higher or lower than the median.

Extremely simple, and will most likely yield the best results.

My Models:

The first task was to only use location as a feature for our model. To do that, I had to separate all the locations by city and state and then dummify each city so that we could get a variety of location features for our model.

This dummification process gave me a total of 196 columns with each city or area as a feature in our model.

Some interesting cities in there even though we had specified a city list when web scraping. So it goes.

Every X variable (predictor) was split and scaled, it’s just good practice. The target variable here is the multi-class target.

I started by running a grid search on my logistic regression first which yielded a fairly low score:

The best parameters for our model: {'C': 719.6856730011528, 'fit_intercept': True, 'max_iter': 100000, 'penalty': 'l2', 'random_state': 7, 'verbose': 1}The best LogisticRegression score: 0.31711145996860285The best score on the test set: 0.28205128205128205

This is when I decided to start adding more features like job seniority (analyst, junior and senior) as well as the job titles (data scientist, machine learning, etc..).

Once again, we split and scale our predictors and begin with the following models:

1- KNN

2- DecisionTree

3- Bagging Classifier with KNN as the base estimator

4- Bagging Classifier with DecisionTree as the base estimator

5- Random Forest

The scores were not flattering at all. In fact, they were barely better than when I just had cities as the predictor. DecisionTree was our best model, but it almost hurts to show how bad of a score it produced.

DT CV training score:	 0.36538461538461536
DT training score: 0.7225274725274725
DT test score: 0.34615384615384615

I tried and tested many different parameters, went back to clean the data further until I realised maybe I should change my target variable to the simple binary target defining high and low salaries (above or below the median).

Let’s begin again, this time with a different target, High/Low Salary. Here are the scores and results.

As we can see, an incredible improvement.

The scores are an obvious improvement and giving our model less work to do when predicting the target seems to have solved our scoring issue. However, we can see that there is still sufficient variance in our model. The cross-validation scores and test scores are still approximately close but our training data seems to vary significantly in comparison.

This made me want to try and improve the model as much as possible. Others in my class were getting scores slightly (but significantly) better. I figured it may be because they were using data from the United Kingdom which has much better reported and listed salary data.

I attempted to remove cities that had under 5 job listings because my thinking was that I could feature engineer myself to a better score.

I was wrong.

I performed the following on my data:

Remove all cities with under 5 jobs reported.

This lead to scores with much less variance but lower scores overall:

logit
Best Parameters:
{'C': 0.0071968567300115215, 'fit_intercept': True, 'max_iter': 100000, 'penalty': 'l2', 'random_state': 7, 'verbose': 1}
Best estimator mean cross validated training score:
0.6497874350495983
Best estimator score on the full training set:
0.6771978021978022
Best estimator score on the test set:
0.6538461538461539


cart
Best Parameters:
{'ccp_alpha': 0, 'max_depth': None, 'max_features': 3, 'min_samples_split': 40}
Best estimator mean cross validated training score:
0.6580255077940482
Best estimator score on the full training set:
0.7129120879120879
Best estimator score on the test set:
0.6703296703296703


knn
Best Parameters:
{'n_neighbors': 1}
Best estimator mean cross validated training score:
0.5768445914029287
Best estimator score on the full training set:
0.7074175824175825
Best estimator score on the test set:
0.6593406593406593


random_forest
Best Parameters:
{'max_depth': 3, 'n_estimators': 100}
Best estimator mean cross validated training score:
0.6401133679735475
Best estimator score on the full training set:
0.6565934065934066
Best estimator score on the test set:
0.6428571428571429

I was really not impressed at this point.

I decided to stick with my earlier model and try GradientBoosting to eliminate the effects of poorly chosen variables in my model and analyse the results.

This is what I ended up with:

0.7651098901098901 (training score)
0.6222527472527473 (cv score)
0.7032967032967034 (test score)
Feature importances of my GradientBoostingClassifier

That’s not too bad. Along with the list of feature importances, I think I have a good model that works better than the baseline. That’s all I could have hoped for at this stage.

Here is what I would have done differently:

1- I noticed that a few people in class had taken the job summary into account as well. I should have done that had I had more time. I would have used NLP to analyse the different keywords in the job summary and added it as a feature.

Adding in the summary as a feature would have eliminated some of the guess work that went on once the data was scraped and the EDA began. It would have taken me some more time but it would have been work it to improve my model by 10–15% which, in hindsight, seems doable.

2- I should have spent more time on EDA to make sure that the right kind of cities and job titles were included and filtered through. This comes back to point 1 above. If I had a summary variable to work with, I could have eliminating a lot of guess work and answered many questions about the validity of my features.

3- Pick better data. The United States’ job postings were probably not the best selection. The United Kingdom results would have given me more data to play with. My logic was that the US is the largest market for data scientists, hence it will have significantly more data points than the UK (which turned out to be true). However, the quality of the data is equally as important, and while my scraping and analysis was not the best, it would have improved with a significantly better dataset.

Classification Reports:

I genuinely love classification reports. An insight into how well our models actually are in making any accurate predictions. The final model we selected as the go-to was the last one we discussed, GradientBoostingClassifier.

Confusion matrix for our training set.
Confusion matrix for our test set.

Precision measures the percentage of our results that were actually relevant to our model (TP/TP+FP) while recall is the percentage of total results that were correctly classified by our model (TP/TP+FN).

Let’s put it this way. Our model managed to correctly predict the salary 70% of the time (looking at the test report) but only did so 70% of the time.

The ROC-AUC curve is another great illustration of this idea. The area under the curve represents just how often our model is able to predict the classes. The higher the AUC, the better our model is at predicting classes.

Not too bad considering the flaws that I see in it myself.

Conclusion:

My model could have been improved in significant ways and the data could have been gathered and cleaned much more effectively. However, I believe I did the best I could with the time given to us.

Our 70% precision and recall score respectively is better than the baseline, which means we will still predict the classes 20% better than the baseline. That’s a good start. Judging by our ROC curves, we have 3/4 chance that our model will accurately predict the salary based on the models we have selected and tuned.

Notes for the future: get better at recognising patterns, making decisions, and clearly illustrating the different steps and procedures en route to a model. The modelling part is easy and simple, but the pre-processing, cleaning and EDA are where the magic happens.

--

--