Prediction Model- What Do Data Science Jobs Pay?

Han Man
han_
Published in
12 min readMar 1, 2017
Our Jobs.

So what does a data science job pay these days? Ever since the Harvard Review called it the sexiest job of the 21st century, way back in 2012, data science has had a connotation of a hot, lucrative, sophisticated, and desired career. Since then, there has some backlash, to put it mildly. But it’s 2017 and there is data scientist, still at #1 in Glassdoor’s top 50 jobs in America. I feel like I have witnessed the whole cycle: the hype, the backlash, and now the comeback. Hopefully, I haven’t missed this big wave already! I wanted to see- what does a listed data scientist job pay today?

The first challenge with this question is that the data science field encompasses many job types. To focus on what current listings look like, I decided to analyze data listed on www.indeed.com, a newer job site that has recently gained in popularity.

My goal was to:

1. Gather data on job listing data from online listings

2. Create a model to classify jobs by salary and explore feature selection techniques to optimize the model

3. Document and present insights/recommendations based on what we can infer about the relationship between attributes and salary

My hero?

Data Acquisition and Processing

The first step for me was to gather a dataset that I could work with. It quickly became clear that Indeed was a going to be the right source- it had the combination of being well-populated with data science listings, and a site which was easy to access. A quick inspection of website source code revealed that it was a relatively organized website.

For example: the job title listed in an instance of the search results page is stored in the “a” element where class=jobtitle, within a “div” element where class= “row result”. Understanding the pattern makes it simple to access the data for collection and analysis. Using similar patterns in the website, I set up a webscraper to gather listings for a few different data science job searches. I wanted to look at a few cities where I would be interested in working. In fact, it looks like Indeed themselves have compiled a report on data science jobs in the US for 2016.

Inspection of the indeed.com webpage.

I like NYC, Chicago, and California, having lived in all three areas. I prefer to stay on the coasts, but using the indeed report as a guide, I put together a relatively long list of other cities that had strong data science footprints.

I wrote a function to pull job title, location, company, and salary data. Then I ran that function across the entire list of cities. I performed two separate pulls a week apart in order to increase the number of unique listings in my dataset.

def getTitle (souper):
return souper.find(class_="turnstileLink")['title']
def getSalary (souper):
entries=souper.find('td', class_='snip').find('nobr')
if entries==None:
return np.NaN
else:
return str(entries)[7:(len(str(entries))-7)]
def getCompany (souper):
entries=souper.find('span', class_='company')
if entries==None:
return np.NaN
else:
return entries.text.replace("\n", "").replace(" ", "")
def getLocation (souper):
entries=souper.find('span', class_='location')
return entries.text.replace("\n", "").replace(" ", "")
url_template = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l={0}&start={1}"
max_results_per_city = 500
locations=[]
salaries=[]
titles=[]
companies=[]
cities=[]
for city in set(['New+York', 'Chicago', 'San+Francisco', 'Austin', 'Atlanta', 'Seattle', 'Los+Angeles', 'San+Diego',
'Mountain+View', 'San+Jose', 'Washington+DC', 'Boston', 'Denver', 'Dallas', 'Salt+Lake+City',
'Minneapolis', 'Santa+Barbara', 'Arlington']):
for start in range(0, max_results_per_city, 10):
URL=url_template.format(city, str(start))
#print URL
r = requests.get(URL)
soup = BeautifulSoup(r.content, "lxml")
entries=soup.findAll('div', class_='result', id=re.compile("p_"))
for a in entries:
titles.append(getTitle(a))
salaries.append(getSalary(a))
companies.append(getCompany(a))
locations.append(getLocation(a))
cities.append(city)

All in all, I gathered about ~16k listings, from which there are about ~6000 unique listings. My guess is that most cities do not have the 500 listings I requested from each city, so eventually the website started to cycle job listings. I also did two pulls a week apart which i’m sure had overlapping entries. From the ~6000 unique job listings, however, there were only 291 salaries listed. My guess is that most jobs do not want to explicitly publish salaries. From experience, I know that the salary discussion during the hiring process, for most companies recruiting, happens after live discussion has taken place whether over the phone or in person. This was a disappointing data return, but still enough for me to work with. Let’s give it a try.

To clean up the data, I first had to convert the salary figures into workable numeric form. Listings came in four flavors- hourly, weekly, monthly, and annual. I converted everything to annual assuming that those jobs not listed as annual would have enough demand for a new hire to work an entire year. Some salaries came in ranges so I took the midpoint of the range.

def categorize (a):
if type(a)==float:
return a
elif "hour" in a:
return "hourly"
elif "year" in a:
return "annual"
elif "month" in a:
return "monthly"
elif "weekly" in a:
return "weekly"

dff['salary_type']=dff['salary'].apply(categorize)
def salaryconv (a):
#convert all salary types to annual
if type(a)==float:
return a
else:
saldic={"a":1, "w":52, "h":2080, "m":12}
saltype="a"
if "hour" in a:
saltype="h"
if "week" in a:
saltype="w"
if "year" in a:
saltype="a"
if "month" in a:
saltype="m"
b=a.replace(" ", "").replace("$", "").replace(",", "")
c=re.findall("\d.*\d", b)
d=[]
for lister in c:
for listers in lister.split("-"):
d.append(float(listers))
output=np.mean(d)
return output*saldic[saltype]

dff['convSalary']=dff['salary'].apply(salaryconv)
Distributions of annual salary after conversion.

Assumptions

Before we dive into modeling, I want to discuss a few assumptions we are making. I want to model the relationship between a couple of features in the job description with salary. For this model to be predictive, it needs to be representative for the entire jobs universe. In general, when using a regression model, we are also assuming underlying normal distribution of the data we sample. The salary data above is skewed right, towards a tail of higher salaries.

My instinct was that the salary data we were able to get is not very representative. We only got 291 reported salaries, indicating that not many jobs are willing posters of exact salaries. My hypothesis is that lower salary jobs tend to post salaries. I also think that jobs which post their salaries in monthly, weekly, or hourly figures will not necessarily hire for an entire year. All this points to our test set one that has lower pay and worse quality jobs. We will examine that further later.

Modeling the Data

I decided that, with relatively few data points, it may be more accurate to create a classification model, and convert the salary into a couple of ranges to run a logistic regression model.

First, I wanted to tokenize the words in the job title. As I mentioned earlier, data science jobs encompass a wide range of functional roles within different industries. The job titles varied greatly- there was no way to map the jobs into easily defined buckets. Therefore, I gathered all of the words and found the most commonly occurring words. Then I tokenized the words based on a binary classification of whether the word appeared in the title.

def inspectjobs (a):
jobs=[]
b=a.replace("-", " ").replace("&", " ").replace("/", " ").replace(","," ").replace("(", " ").replace(")"," ").replace(" ", " ").replace(" ", " ").replace(" ", " ").replace(" ", " ").split(" ")
for bb in b:
jobs.append(bb.lower())
return jobs
dfc_noNans['job_words']=dfc_noNans['job'].apply(inspectjobs)
dfc_noNans.head()

The list of words included- you guessed it- “data” and “scientist” as the most common. Below shows the top 5 most frequently occurring words. I picked the top ~20 words, making sure to include position identifiers such as “junior”, “senior”, “director” and functional indicators such as “developer”, “programmer”.

[('scientist', 146),
('data', 144),
('analyst', 121),
('research', 82),
('engineer', 45)...

As far as companies, that was less encouraging. The top 7 companies are listed below, and they skew on the side of educational institutions (UT Austin) and headhunters (Jobspringparters, Harnham). This list did not look representative of the job listings out there, and does not include any listings from companies I know are hiring data scientists all the time such as startups in NYC (Spotify, Squarespace, Blue Apron, Casper), large internet companies (Google, Facebook, Amazon, Jet), quant hedge funds, etc. I decided not to include these company names as features in my model because I did not want to overfit the model. The jobs that I want to use this model for are going to be with companies that do not appear on this list.

UniversityofTexasatAustin                     18
JobspringPartners 15
SelbyJennings 14
Harnham 11
IntermountainHealthcare 9
UniversityofWashingtonMedicalCenter 7
UniversityofUtah 7

The dataframe we have is shown below.

Dataframe showing our dataset.

The features that I planned to use was city, salary type (hourly, weekly, monthly, annual), and job title words (top 20 words). I needed to tokenize all of the input variables, and then convert the salaries into ranges. I wanted to categorize the salary into 4 ranges.

job_words=['scientist', 'data', 'analyst', 'research', 'engineer', 'senior', 'learning', 'machine', 'associate', 'engineering', 'manager', 'software', 'analysis', 'quantitative', 'statistical', 'assistant', 'clinical', 'medical', 'analytics', 'specialist', 'director', 'center', 'business', 'risk', 'med', 'science', 'laboratory', 'sr.', 'full', 'programmer', 'modeling', 'statistician']dummy_columns=['city', 'salary_type']for n in dummy_columns:
print(n)
ax = plt.subplots(figsize=(7, 2.5))
plt.xticks(rotation='vertical')
ax=sns.violinplot(x=n, y="convSalary", data=dfc_noNans, linewidth=1)
plt.show()
Rdatadum=dfc_noNanscategories = dummy_columns
for category in categories:
series = Rdatadum[category]
dummies = pd.get_dummies(series, prefix=category, drop_first=True)
Rdatadum = pd.concat([Rdatadum, dummies], axis=1)
for word in job_words:
Rdatadum[word]=Rdatadum['job_words'].apply(lambda x: int(1*(word in x)))
Salary distribution by city
Salary Distribution by salary type
salary_bins=list(np.histogram(dfc_noNans['convSalary'], bins=4))[1]
salary_bins
def bin_salary (a):
if (a>=salary_bins[0] and a<=salary_bins[1]):
return 0
if (a>=salary_bins[1] and a<=salary_bins[2]):
return 1
if (a>=salary_bins[2] and a<=salary_bins[3]):
return 2
if (a>=salary_bins[3] and a<=salary_bins[4]):
return 3

The salary were distributed into the following quartiles:

Categorization:
0 $25992-81994
1 $81994-137996
2 $137996-193998
3 $193998-250000
# of Listings:
0 146
1 86
2 48
3 11

Feature Selection:

I wanted to try two methods for feature selection: SelectKBest and RFECV. Basically, both of these models score each feature and select only the handful that impact the model the most based on a scoring function. For a classification such as what we are building here, the model will use chi squared as the default scoring function.

from sklearn.feature_selection import SelectKBest
kbest = SelectKBest(k=20).fit(X, y)
kbest_columns=X.columns[kbest.get_support()]
kbest_columns

The top 20 features returned by selectKbest are:

Index([u'city_San+Francisco', u'city_San+Jose', u'salary_type_hourly', u'salary_type_monthly', u'scientist', u'data', u'analyst', u'research', u'engineer', u'senior', u'learning', u'machine', u'manager', u'quantitative', u'clinical', u'medical', u'analytics', u'director', u'business', u'science'], dtype='object')

Then using REFCV, which recursively finds the top features:

from sklearn.feature_selection import RFECV
from sklearn.svm import SVR
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
estimator = LogisticRegression()
selector = RFECV(estimator, step=1, cv=5)
selector = selector.fit(X, y)
rfecv_columns=X.columns[selector.get_support()]
rfecv_columns

The top features are:

Index([u'city_Austin', u'city_Dallas', u'city_Denver', u'city_Los+Angeles', u'city_Minneapolis', u'city_New+York', u'city_Salt+Lake+City', u'city_San+Francisco', u'city_San+Jose', u'city_Santa+Barbara', u'city_Seattle', u'city_Washington+DC', u'salary_type_hourly', u'salary_type_monthly', u'scientist', u'data', u'analyst', u'research', u'engineer', u'senior', u'learning', u'machine', u'associate', u'manager', u'software', u'analysis', u'quantitative', u'statistical', u'clinical', u'medical', u'analytics', u'director', u'center', u'business', u'risk', u'med', u'science', u'laboratory', u'full'], dtype='object')

Fitting the Model:

I wanted to create a logistic regression, which is the equivalent linear regression model for classification. I wanted to tune the model in combination with grid-search CV, which helps us tune the model meta-parameters such as C value and type of regularization penalty (L1 or L2).

LRmodel=LogisticRegression()C_vals = [0.0001, 0.001, 0.01, 0.1, .15, .25, .275, .33, 0.5, .66, 0.75, 1.0, 2.5, 5.0, 10.0, 100.0, 1000.0]
penalties = ['l1','l2']
gs = GridSearchCV(LRmodel, {'penalty': penalties, 'C': C_vals}, verbose=False, cv=10)
gs.fit(X, y)
gs.best_params_

This returns parameters of:

{'C': 5.0, 'penalty': 'l1'}

Then I fit the model:

opLRmodel = LogisticRegression(C=gs.best_params_['C'], penalty=gs.best_params_['penalty'])
fitLRmodel = opLRmodel.fit(X, y)
coef=pd.DataFrame(fitLRmodel.coef_).transpose()
coef['c']=X.columns
coef['abs']=coef.iloc[:,:4].apply(lambda x: max(np.abs(max(x)), np.abs(min(x))), axis=1)
coef.sort('abs', ascending=False)

To evaluate the model, I used cross_val_score:

print np.mean(cross_val_score(fitLRmodel, X, y))

Returning an score of:

0.666738274257

Now, I wanted to test if feature selection could improve our model. I used the columns chosen by RFECV since it included the features identified by Select-K-best.

Xnew=X.loc[:,rfecv_columns]
ynew=y
LRmodel=LogisticRegression()C_vals = [0.0001, 0.001, 0.01, 0.1, .15, .25, .275, .33, 0.5, .66, 0.75, 1.0, 2.5, 5.0, 10.0, 100.0, 1000.0]
penalties = ['l1','l2']
gs = GridSearchCV(LRmodel, {'penalty': penalties, 'C': C_vals}, verbose=False, cv=10)
gs.fit(Xnew, ynew)
gs.best_params_

That returned model meta-parameters that were very different from the original gridsearch, implying that limiting features made a significant impact on the final model construction.

{'C': 10.0, 'penalty': 'l2'}

Now to test the model score using cross_val_score. Our final score

0.68045112782

If we recalled, the previous model score was ~67%, so we got a 1% improvement… not quite to the level I expected, but it’s encouraging to see improvement! Let’s see what conclusions can be drawn from building this model.

Findings and Recommendations

As a report for the conclusions we can draw from this model, I first used classification_report and confusion_matrix to evaluate our model success.

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
ypred=fitLRmodel.predict(Xnew)
print classification_report(ynew, ypred)

Classification Matrix:

               precision    recall  f1-score   support

0 0.85 0.92 0.89 146
1 0.69 0.71 0.70 86
2 0.87 0.69 0.77 48
3 1.00 0.55 0.71 11

avg / total 0.81 0.81 0.80 291

Precision measures, of the positives we identified, how many are actual positive, aka. how well are we at minimizing false positives? For type 1, which are jobs with salary from ~$81k-$138k, there is the highest incidence of false positives.

Recall measures, of the positives in our target, how many are we able to identify? The rate is the best at the lowest salary band, but gets worse as we increase in salary. For example, at the highest salary band of “3”, every positive we identify is in fact a category 3 (precision = 100%) but we are only able to identify 55% of all the category 3's in the target list. From this classification matrix, it looks like our threshold for higher salaries needs to lowered so that more data points are classified in those bands.

Categorization:
0 $25992-81994
1 $81994-137996
2 $137996-193998
3 $193998-250000
# of Listings:
0 146
1 86
2 48
3 11

Confusion Matrix:

pd.DataFrame(confusion_matrix(ynew, ypred, labels=[0, 1, 2, 3]))
Confusion Matrix

The confusion matrix tells a similar story. The y-axis is the actual classifications and the x-axis is our model classifications. The diagonal are correctly classified data points. It looks like our model had the most trouble with the 1 band, identifying 22 cases where the salary is actually a 1, but was predicted as a 0. Similarly, 14 cases occurred where salary type 2 were incorrectly classified as a type 1.

Finally, we can examine our model coefficients to see if there are any relevant conclusions. These are the top coefficients ordered by the max absolute value of the coefficient for a particular attribute. 4 cities appear on the list- Minneapolis has a strong negative coefficient for salary type 2 and strong positive value for salary type 1. That means jobs from Minnesota are generally lower pay. In the same vein, Denver typically has type 1 jobs, Dallas has type 0, and San Francisco has type 3. This is inline with our expectation that San Francisco is where higher paid tech jobs tend to be, and the city generally has higher cost of living. This matches with the indeed tech jobs report from earlier. In terms of key words, director, quantitative, analytics, engineer, medical, clinical have large positive values in salary type 2 and 3. Conversely, key words software, manager, business, med, research, analysis, analyst have large positive values in salary type 0 and 1. That implies that the former word groups have the best correlation with higher pay and the latter word group has the best correlation with lower pay. Curiously, our previous hypothesis that types of salary listed (ie. annual, monthly, etc) did not have a large effect on salary compared with other features.

Final model parameter values.

Report:

Model Type- Logistic Regression

Model Meta Parameters- C= 10.0, penalty=l2

Cross Validated Score- 0.68045112782

Classification Matrix-

               precision    recall  f1-score   support

0 0.85 0.92 0.89 146
1 0.69 0.71 0.70 86
2 0.87 0.69 0.77 48
3 1.00 0.55 0.71 11

avg / total 0.81 0.81 0.80 291

Recommendations-

When searching for jobs, if high pay is the main objective, there are a few key words to look out for in the location and job title:

  1. Jobs in San Francisco pay more, while jobs in Denver, Dallas and Minneapolis pay less.
  2. Directors and Engineers, and jobs in the medical or clinical field tend to pay more. Managers, analysts, working in software, business, or research tend to pay less. Jobs requiring analytics pay more. Jobs requiring analysis tend to pay less.

All good things to keep in mind the next time I do a job search on Indeed.

--

--