Negotiate Your Salary Like a Pro: A Job Description-based Salary Estimator Using Deep Neural Network

Innovative Data Science Job Salary Estimation through Text Mining, BERT model, and Neural Networks.

Published in

The Power of AI

14 min readMar 22, 2023

Have you ever wondered how much that dream data job you’ve been eyeing will pay? Well, wonder no more! In this cutting-edge project, we harness the power of state-of-the-art technology to give you an estimate of the average salary for a given job posting description.

➜ Pro tip: If you like to learn faster and run (or download) this project as a Jupyter Notebook for free, visit CognitiveClass.ai.
The Notebook file contains exercises and further details.

Know Your Worth: Build A Job Salary Estimator

Want to know how much a data science job pays? Use our cutting-edge model that utilizes text mining and BERT to…

cognitiveclass.ai

We utilize text mining techniques to clean and vectorize the data, using the BERT model to extract relevant information. Then, we design a deep neural network for training and testing. The end result? A robust model that can estimate a job’s salary just by copying and pasting the job description. Say goodbye to hours of scouring job boards and salary calculators — with our model, you’ll have the information you need at your fingertips in no time.

1. Objectives

After completing this lab, you will have the skills to:

Clean and normalize text, then convert it into numerical vectors using word embeddings.
Create training and testing sets from preprocessed data.
Build and train a deep neural network model using frameworks.
Evaluate the model’s performance on testing data.
Deploy the trained model for future use on new data.

2. Setup

Installing and Loading Required Libraries

pip install -U 'yellowbrick' 'skillsnetwork' 'sentence-transformers' 'seaborn'

import skillsnetwork
await skillsnetwork.download_dataset("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX086PEN/data/df1_datascience_jobs.csv")
await skillsnetwork.download_dataset("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-GPXX086PEN/data/df2_datascience_jobs.csv")

import re
from tqdm import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

3. Datasets

We found three datasets that collected data scientist job postings from different sources on the internet.

The first dataset used in this project is sourced from Data World, containing 10,000 job postings for data scientists [link]. The salary information is retained while removing records without salary information. The second dataset is collected by Luke Barouse and is available on Kaggle [link]. You can also download the two datasets by following the links: df1 and df2.

3.1. Engineering the dataset

After loading these three datasets, we need to create an inclusive dataset that contains the following information:

`Title`: string ➜ The title of the job posting
`Company_Name`:string ➜ The Name of the company
`City`:string ➜ The Name of the city offering the job
`State`: string ➜ The Name of the state offering the job
`Description`: string ➜ The Description of the job
`Salary`: numbers ➜ The proposed or estimated salary per year

# Load the datasets and having a quick look
df1 = pd.read_csv('df1_datascience_jobs.csv') # data from DataWorld
df2 = pd.read_csv('df2_datascience_jobs.csv') # data from Kaggle

As you can see, the dataset contains numerous columns, and we will only select the ones that are relevant to our analysis.

# select the columns we are interested
df1 = df1[['job_title','company_name','state','job_description','salary_offered']]

# romove the rows without salaries
df1 = df1[df1['salary_offered'].notna()]
df1.info()

# funtion for cleaning the salary column
def salary_extract(salaries):
    for i in range(len(salaries)):
        # replace k or K with 000
        salaries[i] = re.sub("[kK]", "000", salaries[i])
        
        # Use a regular expression to extract the numbers
        salaries[i] = re.findall(r'\d+', salaries[i])

        # Convert the numbers from strings to integers
        numbers = [int(x) for x in salaries[i]]

        # Calculate the mean of the numbers
        salaries[i] = sum(numbers) / len(numbers)
        
    return salaries

# clean the salary offered

salaries = df1.salary_offered.tolist()
salaries = salary_extract(salaries)

df1['salary_offered'] = salaries

# renaming the columns for consistency
df1.rename(columns={"job_title": "Title","company_name":"Company","state":"Location","job_description":"Description","salary_offered":"Salary"}, inplace=True)
df1.head()

The first dataset (df1) is ready. let’s clean the second dataset (df2).

 df2.columns

# select the columns and remove the unwanted records with no description
df2= df2[['title','company_name','location','description','salary_standardized']]
df2 = df2[df2['salary_standardized'].notna()]
df2.head(2)

The salary column seems correct, but we need to extract states from the locations. Also, to keep consistency with the first dataset, we rename the columns, accordingly.

The following code will iterate over each string in the strings list. It checks if the string contains a comma using the in operator for each string. If it does, it uses the `str.split()` method to split the string by the comma and takes the last element of the resulting list, which should be the state. It removes any spaces after the state by using the `str.strip()` method, then appends the state to the state's list. If there is no comma in the string, it will append the United States to the state's list. Finally, it prints the contents of the list of states.

locations = df2.location.tolist()

# looping through the locations and extracting state if exists
states = [string.split(",")[-1].strip() if "," in string else "United States" for string in locations ]
df2['location'] = states

# rename the column for consitency
df2.rename(columns={"title": "Title","company_name":"Company","location":"Location","description":"Description","salary_standardized":"Salary"}, inplace=True)
df2.info()

The second dataset is ready, next we concatenate them by row to create one whole dataset.

3.2 Preparing the Main Dataset

Our three datasets are ready, so we concatenate them by rows.

# combining the two datasets
data = pd.concat([df1, df2], ignore_index=True, sort=False)
data.info()
# so we have 5778 non-null rows

In order to ensure the quality of our analysis, we will remove any job listings with an insufficient description of the company. So, we remove those descriptions of fewer than 250 characters.

# get the list of job description lenth, and print those less than 250 characters
desc_len = [len(x) for x in data.Description]
print('number of jobs with short descriptions ==>',sum(i < 250 for i in desc_len))

# finding short job descriptions (less than 250 characters) and removind them
ind = [i for i, x in enumerate(desc_len) if x < 250]
data = data.drop(data.index[ind], axis=0)
data.shape

Also, we remove those extremely high and low `Salary` which can be resulted from mistakes in data collection.

# seems we have faulty data at some point
data.sort_values(by=['Salary'],ascending=False)

# removing those dataset with extreme high or low values in Salary
data = data.query('Salary >= 40000 and Salary<=250000')

4. Exploring The Dataset

data['Salary'].describe()

# lets plot the salary
sns.boxplot(y = 'Salary', data = data)

# Plotting the distribution of the salaries.

# make the xlabel ticks smaller to fit properly
plt.rcParams['xtick.labelsize'] = 8
# plot histogram for observing the distribution
data['Salary'].plot.hist(bins=30, alpha=0.5)

Plotting the top-paying companies:

# grouping by companies and calculating the mean salaries
top_companies = data.groupby(['Company']).mean().sort_values('Salary',ascending=False).head(20)

plt.style.use('seaborn-darkgrid') 
top_companies.plot(kind='barh',legend=False,color='olive')
plt.xlabel('Salary')
plt.ylabel('Companies')
plt.title('Top 25 Paying Companies')

Plotting the top-paying data scientist titles:

# grouping by title and calculating the mean salaries
top_title = data.groupby(['Title']).mean().sort_values('Salary',ascending=False).head(20)

plt.style.use('seaborn-darkgrid') 
top_title.plot(kind='barh',legend=False,color='olive')
plt.xlabel('Salary')
plt.ylabel('Titles')
plt.title('Top 25 Paying Data Scientist Titles')
plt.show()

5. Text Transformation for ML Model

We select the column `Description` and process it into vectors. The steps for processing the text are as follows:
1. Convert all text to lowercase for consistency and ease of processing.
2. Eliminate any punctuation and stopwords as they do not contribute to the meaning of the text.
3. Use stemming to reduce words to their base form for better grouping of similar words.
4. Apply lemmatization to normalize the words further, treating variations of a word as the same.
5. Transform the processed text into numerical vectors using the S-BERT model for mathematical analysis and processing.

text = data['Description']

import string
import re
from nltk.corpus import stopwords
import nltk
# stemming using PorterStemmer
from nltk.stem import PorterStemmer

from nltk.stem import WordNetLemmatizer

# for lemmatization, you need to download "wordnet" repository which contains family of words
nltk.download("wordnet")

# function for text preparation
def text_prep(text):
    # make it lower case
    text = [t.lower() for t in text]
    text[:3]

    # Strip all punctuation from each article
    # This uses str.translate to map all punctuation to the empty string
    table = str.maketrans('', '', string.punctuation)
    text = [t.translate(table) for t in text]

    # Convert all numbers in the article to the word 'num' using regular expressions
    text = [re.sub(r'\d+', 'num', t) for t in text]
    # remove '\n' from the text 
    text = [t.replace('\n',' ') for t in text]

    # Use the re.sub() function to replace all characters that are not alphanumeric, period, underscore, hyphen, or space
    text = [re.sub(r'[^a-zA-Z0-9._\s]', '', t) for t in text]
    # remove double spaces
    text = [re.sub(r'\s+', ' ', t) for t in text]

    # creating stemmer model and fit evey sentence to it
    stemmer = PorterStemmer()
    text = [stemmer.stem(t) for t in text]

    # create lemmatizer and apply it for every sentence in the text
    lem = nltk.stem.wordnet.WordNetLemmatizer()
    text = [lem.lemmatize(t) for t in text]
    return text

text = text_prep(text)
# print a sample of the text
text[3]

SentenceTransformer(‘all-MiniLM-L6-v2’) is a pre-trained model provided by the SentenceTransformers library which is used to generate sentence embeddings, it’s based on MiniLM, a version of RoBERT a language model fine-tuned on different natural language understanding tasks, ‘all-MiniLM-L6-v2’ version is trained on all available data, with 6 layers. These embeddings can be used as input for other machine learning models for tasks such as text classification, text generation, and question answering.

from sentence_transformers import SentenceTransformer
from sentence_transformers import util
model_Sen = SentenceTransformer('all-MiniLM-L6-v2')

SentenceTransformer models are based on complex transformer architectures which are known to be large and deep neural networks. Generating embeddings from these models can be computationally expensive and may take longer to run compared to simpler models. This is because embedding generation requires a significant amount of computational resources, such as CPU or GPU power. Therefore, please note that the following cells may take a few minutes to run.

# this method encode the sentences into numpy vectors which can be used for clustering task
# by increasing the batch size, the coversion time can be reduced
# the vectorization may take time

embeddings = model_Sen.encode(text, convert_to_numpy = True, show_progress_bar=True,batch_size=100) # By default, convert_to_numpy = True
embeddings.shape

Transformation process

# Splitting the data for training and testing

from sklearn.model_selection import train_test_split

# Split the data into independent and dependent variables
X = pd.DataFrame(embeddings)
y = data['Salary']

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=11)

# print the shape of train and test data
print("X_train shape: ", X_train.shape)
print("y_train shape: ", y_train.shape)
print("X_test shape: ", X_test.shape)
print("y_test shape: ", y_test.shape)

In this project, we utilize deep neural network (DNN) model for our training and testing task. A deep neural network is a type of artificial neural network with multiple hidden layers between the input and output layers. It’s called “deep” because of the number of hidden layers.

In implementing the DNN model, we utilize `Keras` library which is a high-level deep learning library for Python that provides a convenient way to define and train DNNs. It has a user-friendly API that allows developers to build and experiment with various DNN architectures quickly and easily. The library supports multiple backends such as TensorFlow. With `Keras`, you can define the structure of your DNN using its built-in layers and activation functions, then `compile` the model and `fit` it to your data to train it.

Each hidden layer needs to have an activation function (we chose `relu`. more info about activation function).

The `Dropout` layer in a deep neural network is a regularization technique used to prevent overfitting. Overfitting occurs when a model learns the training data too well, resulting in poor generalization to new, unseen data.

`Dropout` works by randomly setting a fraction of the inputs to zero during each training step, effectively “dropping out” those inputs from the computation. This forces the model to learn multiple independent representations of the same data and reduces the dependence on any one feature, making the model more robust and generalizable.In Keras, the `Dropout` layer can be added to a model by simply inserting it between other layers.

from keras.layers import Dense, Dropout
from keras.models import Sequential
from keras.optimizers import Adam
from sklearn.metrics import mean_absolute_error

# Create a DNN model
model = Sequential()

# The Input Layer
model.add(Dense(512, kernel_initializer='normal',input_dim = X_train.shape[1], activation='relu'))

# The Hidden Layers (Dropout layers for avoiding overfitting)
model.add(Dropout(0.2))
model.add(Dense(512, kernel_initializer='normal',activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(512, kernel_initializer='normal',activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(256, kernel_initializer='normal',activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(128, kernel_initializer='normal',activation='relu'))

# The Output Layer
model.add(Dense(1, kernel_initializer='normal',activation='selu'))

# Compile the network
model.compile(loss='mean_absolute_error', optimizer='adam', metrics=['mean_absolute_error'])
model.summary()

Executing the model:

model.fit(X_train, y_train, epochs=40, batch_size=35, validation_split = 0.2)

Testing the model:

# Make predictions on the test data
y_pred = model.predict(X_test)

# Calculate the mean absolute error of the model
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error: ", mae)

The mean absolute error of 18,000 $ is not bad considering the varieties of salary policies in different companies and the power of individual negotiation.

6. Estimating A New Job Posting (Deploying the Model)

You can copy and paste the description of any data scientist-related job positions below in the variable sample_job_description.

sample_job_description=""" 
Job description Company Description 
We help the world see new possibilities and inspire change for better tomorrows. Our analytic solutions bridge content, data, and analytics to help business, people, and society become stronger, more resilient, and sustainableJob Description
• Under supervision, participates in the development and execution of analysis project plans for timely project completion. May contribute to project hypothesis and goal setting.
• Utilizes statistical and machine learning techniques to create high-performing models that comply with regulatory and privacy requirements and address business objectives and client needs.
• Ensures statistical, computational, and algorithmic validity of results.
• Performs analyses of structured and unstructured data to solve multiple and complex business problems, utilizing advanced statistical, mathematical and machine learning technique.
• Ability to keep up to date with emerging methods and environments.
• Provides required support for project delivery and implementation.
• Collaboration with Verisk Cloud Data Lakehouse to support Insurance Analytics work
• Develop, construct, test, and maintain architectures
• Align architecture with business requirements
• Data acquisition
• Develop data set processes
• Use programming language and tools
• Identify ways to improve data reliability, efficiency, and quality
• Deploy sophisticated analytics programs, machine learning, and statistical methods
• Creates clear and easy to understand documents for product support (technical).
• Presents analysis ideas, progress reports and results to internal managers, project managers.
• Collaborates with Lead or Principal Data Scientist to develop technical/business approaches and new or enhanced technical tools.

LI-SM1

Qualifications

Required:
• Graduate-level degree with concentration in a quantitative discipline such as statistics, mathematics, economics, operations research, computer science or aligned discipline.
• Skilled with feature engineering, model selection and assessment methodologies, and familiarity with all stages of the data science pipeline.
• Skilled programming in – SQL, Python, PySpark,
• Knowledge in and applied experience with statistical and machine learning.
• Applied, practical experience using statistical and machine learning computer languages (e.g. R, Python, SLQ).
• Working familiarity with cloud computing environments.
• Demonstrated skills with Data Engineering Tools – Database, Data Transformation, Data Ingestion, Data Mining, Data Warehousing and ETL/LTE, Reasl-time Processing Framework, Data Buffering, Cloud Computing, and Data Visualization
• Logical, evidence-based problem solving and critical thinking skills.

Preferred:
• Insurance experience, focusing on the personal lines or related business content.

Additional Information

At the heart of what we do is help clients manage risk. Verisk (Nasdaq: VRSK) provides data and insights to our customers in insurance, energy and the financial services markets so they can make faster and more informed decisions. 

Our global team uses AI, machine learning, automation, and other emerging technologies to collect and analyze billions of records. We provide advanced decision-support to prevent credit, lending, and cyber risks. In addition, we monitor and advise companies on complex global matters such as climate change, catastrophes, and geopolitical issues.

But why we do our work is what sets us apart. It stems from a commitment to making the world better, safer and stronger.

It’s the reason Verisk is part of the UN Global Compact sustainability initiative. It’s why we made a commitment to balancing 100 percent of our carbon emissions. It’s the aim of our “returnship” program for experienced professionals rejoining the workforce after time away. And, its what drives our annual Innovation Day, where we identify our next first-to-market innovations to solve our customers’ problems. 

At its core, Verisk uses data to minimize risk and maximize value. But far bigger, is why we do what we do.

At Verisk you can build an exciting career with meaningful work; create positive and lasting impact on business; and find the support, coaching, and training you need to advance your career. We have received the Great Place to Work® Certification for the 7th consecutive year. Weve been recognized by Forbes as a World’s Best Employer and a Best Employer for Women, testaments to our culture of engagement and the value we place on an inclusive and diverse workforce. Verisk’s Statement on Racial Equity and Diversity supports our commitment to these values and affecting positive and lasting change in the communities where we live and work.

Verisk Analytics is an equal opportunity employer.

All members of the Verisk Analytics family of companies are equal opportunity employers. We consider all qualified applicants for employment without regard to race, religion, color, national origin, citizenship, sex, gender identity and/or expression, sexual orientation, veterans status, age or disability.

"""

Get the salary:

# Prepare your text through text prepration and vectorization
your_text = text_prep([sample_job_description])
your_text_embeddings = model_Sen.encode(your_text, convert_to_numpy = True) 

# Estimate the salary based on the model we trained before
estimatation = model.predict(your_text_embeddings)

print("Estimated salary for the job ==> ", int(estimatation[0]), "U$D")

Thanks for reading!

You can follow me on Medium or LinkedIn and stay in tune for more articles on Data Science, Machine Learning, and AI.

If you are interested in my project, here is my IBM skills network profile:

Sina Nazeri

I am an accomplished Research Associate, Ph.D. candidate, and IBM Data Scientist, specializing in unraveling complex…

author.skills.network