Our third project was meant to test our ability in web-scraping, natural language processing, and classification modeling. The project started off by building a web scraper to scrape indeed.com, a popular job posting website used across the world. More specifically we were looking at Data Science positions across major metropolitan areas in the United States (I guess they were trying to get us excited for our impending job search). I ran my scraper through about 40 cities, scraping around 10,000 jobs with details that included title, company, location, and salary, of these only about 600 contained actual numbered salary listings. Not a very great feeling to wakeup to after letting you scraper run all night, but viewing all that data you pulled was pretty awesome.
Our goal was to predict whether or not a given salary in our home city, Washington D.C. in this case, would fall above the median salary, which was around $85,000. This was all being done help our fictitious boss undercut his newly hired data scientists starting salary. A big part of the challenge here was the tiny percent of job listings that even had salaries. The next part of the project was to find the modeling technique to best predict whether or not our other salaries would fall above or below our median. I started off by trying to use location as the only predictor in our Random Forest model (for all non-data scientists who accidentally stumbled across this blog, a great explanation can be found here) , it did poorly, performing barely better than a coin toss would.
Time to step it up a little. I then decided to use natural language processing on the job title and see if we could recognize any key words that would help us determine whether or not a given salary would fall above or below the median. Running this data through a logistic regression model had our accuracy shoot up by around twenty percent, much better, but I was still aiming higher. The winner was a Random Forest model which gave us an accuracy of around 80 percent, with slight variations due to the random nature of the model.
Back to tight-fisted fictitious boss. After analyzing our predicted salaries for the D.C. Metro area it looks like the boss got the raw end of the deal. Seventy percent of jobs had predicted salaries over the median income, not all that surprising considering the high cost of living here, (source you may ask, my $1500 dollar a month walk-in closet size studio apartment). Our original data that had salaries had a median salary of $125,000 compared with $85,000 nationally. Some other interesting findings, nearly all listings that contained actual numbered salaries were government or academia jobs, very little insight to glean on the private sector pay. Workers paid by an hourly, daily, or monthly wage cost about $30,000 a year less than salaried workers.
All in all I really enjoyed this project, particularly the web scraping. Learning this skill really opens up the internet (obviously within the parameters of robots.txt) to all kinds of data that you didn’t have access to before, a hugely valuable asset for anyone in the industry.