Web Scraping for Data Scientist Salaries Across the USA

Valeria Rozenbaum
3 min readApr 24, 2017

--

Our third project consisted of web scraping data scientist jobs from Indeed.com and creating a model that is able to predict whether a job’s salary will be above or below the median by the title.

I began by separating the scraper into 4 different functions that focused on scraping individual parts of each listing: Location, Salary, Company and Job Title.

The function that extracts the Company name from the listing

I then tied the whole scraper together using the previously created functions. The complete scraper looked for jobs in 17 states including the District of Columbia.

The full web scraper used the 4 previously created functions to extract listing information across 17 states

The results were then added to a data frame with the individual fields under their own columns. I would then run an a custom EDA function on the data frame to see what the results brought in.

My last scrape of indeed.com brought in a total of 5,168 listings.

There was a lot of clean up that needed to be done on the salary data. For this, I used regex to get rid of symbols like dollar signs and commas. Unfortunately, I found that the majority of the job listings on Indeed did not contain salary information and when they did, the salaries were usually in ranges. So in order to gain as much salary data as possible, I ran my scraper every night until I accumulated a little over 10,500 results. Aside from just being missing or in ranges, there were a good amount of salaries that were hourly, monthly and even weekly. It would have been possible to take these salaries and get the yearly income for each, but that would have also required me to make a rather large assumption about the type of jobs these were. Instead, I decided to focus on just the job listings with yearly salaries and drop the rest from the data frame.

In order for the yearly salary ranges to be useful, I created a custom function that took the average of the individual ranges and appended them to a newly created column. In the end, the final data frame consisted of 493 job listings with the median coming out to be 100k.

To be able to run a model on the data, I needed a way to distinguish which salaries were above and below the median. In order to do this, I created a new column in the data frame called ‘Salary_Above_Median’ and then customized a function which ran through all the salaries and stamped a 1 in the new column if the salary was above the median and a 0 if the salary was below. Overall, the data showed that the salaries were relatively close, with 236 listings being above the median and 257 below.

--

--