Predicting Data Science Salaries Based on Natural Language Processing

I set out to determine which words most correlate to changes in data science salaries by running models that predict whether a job listing would fall above the median salary or below the median salary. I did this using solely information gathered from indeed.com.

It did not take long to realize very few jobs post salary information on Indeed. Because NLP works most effectively on large training data, I chose to use Indeed’s salary estimates for my analysis instead of gathering many fewer results if only utilizing postings that included a salary. This has inherent flaws. I am now dependent on Indeed’s algorithms which I cannot reproduce. Additionally, I can’t be sure how effective their salary estimates are. However, because I set out to predict whether a job listing would fall above or below the median, predicting salary with great specificity is not my top concern. I simply need to have a strong belief that Indeed could predict which class a job listing would fall into, above the median or below the median, which I do.

I scraped job titles, job locations, and job summaries. In doing so, I could run natural language processing on the text I had acquired. After accumulating the data, I ran both count vectorizer and TFIDF on my corpus. Count vectorizer is simply a tool that counts the occurrence of every given word, while ignoring certain words that provide very little meaning such as this, the, is, a, etc. TFIDF stands for term frequency inverse document frequency. This tool weights each word by how often it occurs across different documents. If for example, a word occurred repeatedly in one job posting but very few times in other job postings, it would be weighted as having more meaning. Conversely, if a word occurred in every job posting, it would be weighted as having less meaning.

After analyzing the count vectorizer and TFIDF, it became clear some words had more predictive power than others. Words that indicated strong meaning were, “senior”, “research”, “database”, “engineer”, “lead”, “machine”, and “models”. After coupling NLP with features such as location and job title, I could run models capable of effectively predicting whether a job posting would fall above or below the median with about 70% accuracy.

While the keywords mentioned above were useful in determining salary, location was the largest predictor. Unsurprisingly, jobs in San Francisco and New York brought in higher salaries on average than jobs in Houston and Chicago.

I also ran models on my own hand-picked group of words. The words included, junior, senior, masters, PhD, entry, machine, and research. Using this set of words did not provide the strongest accuracy score, however, it did limit the number of false positives in my predictions. False positives in this context refer to jobs predicted to have a high salary but end up actually having a low salary. Due to my goal of limiting false positives, this became the most useful model.

While count vectorizer and TFIDF both seek to indicate predictive strength, they are different ways of doing so. Thus, finding words that prove meaningful in both techniques elicit the strongest indication of importance. In comparing these techniques side by side, I was able to get a solid understanding of which words are truly the most predictive.