Job Market Analysis

Sumukhabharadwaj
SFU Professional Computer Science
16 min readApr 20, 2020

A detailed report on the project Job Market Analysis

The photo was taken from THE DREADED JOB SEARCH and inc.com

Contributors: Madana Krishnan V K, Nguyen Cao, Sanjana Chauhan, and Sumukha Bharadwaj

This blog is written and maintained by students in the Professional Master’s Program in the School of Computing Science at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit {sfu.ca/computing/pmp}.

MOTIVATION AND BACKGROUND

The job market is a platform through which employers search for employees and job seekers search for jobs. The job market can grow or shrink depending on the demand for candidates and the available supply of employees within the overall economy. Other factors that impact the market are the needs of a specific organization, the need for a particular education level or skill set and required job functions.

Many candidates fail to get a job because they do not understand the requirements that are mentioned in the job description. Due to the long paragraphs in a job description, many people tend to ignore the keywords. Also, when HR writes a job posting, it should be relevant to a specific job title. The most important factor these days are the work values and what perks does a company provide. This is the key for candidates to apply for a job because everyone needs a good working environment and an ample number of opportunities to showcase their talent.

It is very important to analyze the job market but with different perspectives. A job seeker might want to know what skills are required to apply for a job post. HR might need a better understanding of what skills are necessary for a job role and to write a suitable job description. Also, the company should identify values or the factors to incorporate in order to attract potential and quality candidates into their organization. The internet-based recruiting platforms like Indeed, Glassdoor, LinkedIn, etc. have become quite popular in almost all the companies.

RELATED WORK

With this idea in mind, we decided to study the job market to help job seekers, HR, and companies. Previously, many have analyzed the job market in view of a candidate who is applying for a particular job. In this project, we cover multiple perspectives to study the job market which includes HR and the company. Also, we have included two online recruiting platforms, Indeed and Glassdoor which covers a lot of jobs across Canada, thereby allowing us to take a unique approach to the problem.

PROBLEM STATEMENT

Through this project, we have dived deep into the job market by analyzing the details of job posts, company reviews, and interview questions. Mainly, our work revolves around answering the following questions, keeping the end-users in mind:

  1. What skills are necessary for a job seeker to get into a career path?
  2. Which job posting should a job seeker apply to?
  3. Which relevant skills can an HR add to a particular job posting?
  4. What factors can a company improve on to increase the number of potential and quality applications?

As explained in the above section, most of the prior work is aimed at providing insights keeping the job seeker in mind. In this project, we try to extend the possible end-users by including HRs and companies.

Some of the major challenges we faced during this project are:

  1. Integrating heterogeneous data sources, i.e. integrating data from Indeed and Glassdoor.
  2. Analyzing the job description by performing Natural Language Understanding.
  3. Lack of labeled data and thereby evaluating our model performance.
  4. Mapping the insights and results derived into useful and practical business values.

DATA SCIENCE PIPELINE

The pipeline for Job Market Analysis

The following are the different stages in our data science pipeline :

Stage 1: Data Collection

The primary step in our data pipeline was to scrape the data from glassdoor.com and indeed.com. We used Scrapy for scraping the data, Selenium for automating the scraping process and stored the scraped data in the PostgreSQL database. From indeed.com we scraped job posts of data scientist, data engineer and data analyst and also various company reviews given by former and current employees. From glassdoor.com, we scraped job posts for the same titles and also interview questions that were asked by various companies for these titles.

Tools: Scrapy, Selenium, SQLAlchemy, PostgreSQL

The following video shows the scraping process :

Scraping indeed and glassdoor websites for job posts

The below screenshots shows the scraped data:

Scraped data from indeed.com
Scraped data of company reviews from indeed.com

Stage 2: Data Preprocessing

In this step, we preprocessed the data that we scraped in order to make it ready for data analysis. It involved tasks such as data cleaning and data integration. Since the description field was paragraph based, we removed all the stop words, punctuations to extract the keywords for NLP tasks. We integrated data because the job posts were scraped from two different websites and there were many duplicate listings as well. Hence we applied entity resolution technique, Jaccard similarity to identify the similar pairs after integration.

Tools: Pandas, TextBlob

The below screenshots depict the data cleaning and data integration process :

Histogram of the top 20 words in a Job Description
Histogram of the top 20 words in a Job Description after cleaning
Entity Resolution: Number of similar pairs after applying Jaccard similarity

Stage 3: Data Exploration

After obtaining the preprocessed data, we performed exploratory data analysis in order to understand the data properly before starting the actual analysis. The visualizations from this step proved to be the most influential in deciding the model for analysis.

We also performed sentimental analysis for company reviews given by former and current employees. It was achieved using the TextBlob library which is an excellent tool for sentimental analysis. The polarity was calculated for each review and based on certain thresholds, they were classified into good, bad or neutral. One good observation was that the average review given by a former employee was lower than the current employee.

Tools: Python, Matplotlib, NLTK, Plotly

Many useful insights were discovered which can be seen in the below screenshots :

Histogram for top 20 bigrams in the job description before removing stop words
Histogram for top 20 bigrams in the job description after removing stop words
Bar chart for the average polarity of reviews
Bar Chart for comparison of ratings by the current and former employees
Word Cloud for Good Reviews
Word Cloud for Bad Reviews

Stage 4: Data Analysis

This stage involved finding the actual answers to the questions mentioned in the problem statement. For a job role, we find a similarity score of the job title + bigrams with the occupation titles in the O*Net database. We further calculate the similarity score of the job description with the corresponding O*Net occupation competencies. The top 20 competencies are returned for that job role. Also, we perform the same for company reviews to identify bad factors a company has and provide competencies that will improve their work values. The methodology section will give a deeper understanding of the analysis.

Tools: Python, SpaCy, TextBlob

Stage 5: Data Visualization

We integrated all the exciting insights that we got from the job market world in our interactive dashboard which has all the findings and results that any user can understand effectively. We displayed the results in the form of bar charts, line charts, tables, and word clouds. These visualizations are simple and meaningful.

Tools: Redash, Celery, PostgreSQL, AWS EC2

METHODOLOGY

The above section described the data science pipeline that we followed. In this section, we explain the approach and the methodology we have adopted throughout the pipeline.

Let’s remind ourselves of what is the problem that we proposed to solve. We want to identify the top skills and responsibilities for a job seeker. And what relevant skills an HR could add for a new job post. The crux of finding the solution is to define a similarity measure between two texts. Let us see how we try to predict the skills and responsibilities for a particular job post.

Job-Competencies Matching to get the skills

The idea here is to first predict the occupation title from the O*Net database for a given job post using the occupation scorer. Once we have the occupation title, we process the job description sentence by sentence. Using the competency scorer, we compare each sentence with the O*Net occupation’s competencies. We pick the top 20 competencies with the highest similarity scores.

Occupation Scorer

Let’s dive deep into the working of the occupation scorer. In any NLP task, the key step is to preprocess the text. The preprocessing involves converting the text into lower case, removing unnecessary punctuations, pronouns and stop-words (ex: the, what, why, where) which do not contribute to the meaning of a sentence. The example below shows a sentence from the job description before and after preprocessing the text.

Before:

After:

To achieve this, we use SpaCy which is an open-source Python library used for performing advanced Natural Language Processing tasks. It provides a simple way to parse and preprocess the text as shown above. We will soon see how SpaCy proves to be valuable in this project.

Coming back to the occupation scorer, we preprocess both the job post title and all the occupation titles from the O*Net database. Along with the job post title, we append the top n-gram words (40 bigrams in our case) that are present in the job post, which helps us to get a better understanding of what the actual role is. We use the top n-grams and not random words as they help in differentiating the various types of jobs we have. Once we have both the titles+top n-grams from the job post and the occupations from the O*Net database, we are ready to calculate the similarity measure.

We have two pieces of text and we want to compare how similar they are. The conventional way to calculate the similarity is to use a cosine similarity metric which is shown below:

Cosine Similarity Equation
Sentences represented in space (photo from Christian Perone).

An important point to note here is that the similarity measure is calculated by using real-valued numbers and not using texts. This process of mapping the text into numbers is known as word embedding. There are multiple ways in which we could embed the texts. For example, we could perform a one-hot encoding on all the words in our database. But it would be impractical as it is extremely huge and sparse and also cannot accommodate a new word. There are various other embedding techniques available, but they are out of scope for this article. In this project, we have considered the word2vec embeddings. What word2vec does is, it tries to map every word into a latent space of a certain size (say 100 or 300) by understanding the context of every word. We can use this embedding of real values to calculate the similarity measure.

This is the idea behind calculating similarity. But we do not have to implement all of this by ourselves. SpaCy provides an extremely convenient way to achieve this. It has a pre-trained English NLP model that uses word2vec. We can use this model to obtain the embedding of our texts. It also provides another method, ‘similarity()’ which calculates the similarity measure between two texts. It firsts converts every word into its vector, sums up the word vectors in a sentence and then calculates the cosine similarity.

In the occupation scorer, the similarity scores are calculated and we choose to keep the occupation title with the highest score.

Competency Scorer

Once we have predicted the occupation title, we predict the competencies in a similar fashion. Except that we compare the similarities between the job description and the corresponding occupation’s competencies (Note that for answering the question about skills, we consider only the ‘Abilities’, ‘Technology Skills’, ‘Tools Used’, ‘Knowledge’, ‘Skills’ competencies from O*Net database). We keep the top 20 competencies with the highest similarity scores.

Job-Competencies Matching to get the responsibilities.

In order to answer the second question about responsibilities, we again follow the same approach. We have the occupation titles which are already computed. We calculate the similarity between the job description and the corresponding occupation’s competencies (The competencies are ‘Task Statements’, ‘Work Activities’ from O*Net Database). We similarly keep the top 20 responsibility competencies with the highest similarity scores.

EVALUATION

As mentioned in the beginning, we know that the data we use to answer the questions is unlabelled, which implies that there is no readily available data to evaluate our model’s performance. In order to overcome this, we propose the following method.

The idea is to create a labeled dataset on a small number of jobs, which are representative of the entire job distribution. And then further we can evaluate the models on its predictions.

Let us say we pick 100 job posts from our database at random. We manually labeled these job posts with their corresponding occupation from the O*Net database. Since manual labeling is both cumbersome and subjective to an individual’s opinion, we split the 100 jobs randomly into two sets of 50 jobs each. Each set is examined and labeled by at least two people independently. If the occupation of a job post is matching between the two different labelings, we consider the occupation to be correctly labeled. If the occupation of a job post is not matching between the two different labelings, we consider them to be ambiguously labeled. We try to relabel these ambiguous occupations again. If the ambiguity is resolved, we choose to keep the job post. Else, we use another less ambiguous job post for the labeling.

After obtaining this labeled dataset, it is straightforward to evaluate the performance of our model. We calculate the ratio of the correctly predicted job occupation title with the total number of job postings considered (100 in this case).

From the data exploration step, we have observed that the bigrams might be able to help differentiate among various jobs. So we check for the occurrence of bigrams in the job posting and add the ones which are present in the posting to the job title. We experiment with different numbers of bigrams and the accuracy that we have achieved is as shown.

Accuracy vs top n-grams

We start with 0 bigrams (using just the job title) and we increment in the count of 5. It is clear that when all the top 40 bigrams are included, we get the highest accuracy of approximately 75%.

The top 40 bigrams are checked in every job post, and for each job post, if a bigram exists, the title is appended with the bigrams. The similarities between the modified title and the O*Net occupation titles are computed. The best O*Net occupation title is chosen for each of the job posts. The result can be seen in the following figure.

Table showing actual title and predictions of our model with and without bi-grams

Once the occupation title is obtained, we proceed to calculate the similarity between the job post and the O*Net competencies. We keep the top 20 competencies. An example of this is as shown.

Original Job post along with the predicted O*Net competencies

It is observed that the model predicts competencies correctly. The job description has ‘MongoDB or other NoSQL’ sentence and the corresponding competency predicted is the ‘NoSQL — Database management system software’.

DATA PRODUCT

We have built a data product which is hosted on AWS EC2 instance that is capable of doing the following functionalities:

  1. Collect job posts from job portals like indeed.com, glassdoor.ca
  2. Analyze job posts by open-source NLP libraries like SpaCy, TextBlob.
  3. Visualize results with Redash, an open-source business intelligence system.

The data product is built based on the Docker container platform, so it is easy to deploy to a distributed production environment for handling large-scale processing.

Follow the instructions here to install it on Ubuntu 18.04. It’s also quite straightforward to install on Macbook or Windows machines. We will see how to make use of our product.

Once the product is running successfully, you can have access to the above functionalities using the command line and web app. Command-line helps to perform data collection and data analysis, while the web app helps to build customized dashboards. The following is the list of examples of command-line operations that can be performed in our system:

Crawl job posts from Indeed with data scientist as a search keyword. Also crawl a maximum of 50 posts from Vancouver, BC.

python manage.py job crawl indeed — search_kw ‘data scientist’ — location ‘Vancouver, BC’ — max_items 50

Crawl job posts from Glassdoor with data engineer as a search keyword. Also crawl maximum 100 posts from Toronto, ON.

python manage.py job crawl glassdoor — search_kw ‘data engineer’ — location Toronto, ON’ — max_items 100 — query toronto-data-engineer-jobs-SRCH_IL.0,7_IC2281069_KO8,21.htm

The option — query is from the URL of the corresponding request on the Glassdoor website as shown below :

Screenshot of glassdoor.com showing the query string

Analyze job posts to get the top 50 frequent bigrams.

python manage.py job analyze bigram — n_top 50

In the above command, instead of a bigram, you can use either word or trigram to get the top frequent words or trigrams in the job posts. You can also get the bigrams with respect to job posts of a particular search keyword as below:

python manage.py job analyze bigram — n_top 50 — search_kw ‘data engineer’

Compute the matching occupations of job posts using the top 50 frequent bigrams.

python manage.py job occupation-score bigram — k 50 — data_table ‘job_occupation’

The result of the computation is saved to a PostgreSQL table named job_occupation_bigram_50. To view the result, login to your Redash instance and issue a SQL command on this table as shown:

Screenshot of Redash instance

LESSONS LEARNT

Developing this project gave us a deeper understanding of the data science pipeline and the significance of each step that would directly affect the results. Also, we realized that the correctness of data plays an important role in getting the desired outcome. Since this project aimed at solving real-world problems, it was difficult to find the labeled data.

Technically, we discovered how NLP models can be adopted in our workflow for data preprocessing which is very crucial. We also learned new libraries like SpaCy which is a production-ready library that is widely used, TextBlob that has an excellent API for beginners to perform sentimental analysis on job reviews. Also, we noticed that the O*Net database we used in our project is quite outdated in terms of recent technologies like deep learning, NLP, etc.

SUMMARY

So in a nutshell what we did is we came up with a very interesting problem statement and tried to build a one-stop solution to all the questions that are essential in understanding a Job Market. We have followed a data science approach to model our results. We scraped data from multiple sources like glassdoor.com and indeed.com and have also used data from the O*Net database. After the data is collected we performed data preprocessing which involved Exploratory Data Analysis, Data Integration and Entity Resolution. Thereafter once the data was normalized we built models for occupation scorer and competency scorer using SpaCy and TextBlob libraries. We have also made use of similarity measures to map the job description with its corresponding O*Net occupation title and competencies such as Technology Skills, Ability, Skills, Work Values, and Work Ability. We have also performed some interesting sentiment analysis to understand the nature of the company review and suggested Work Value competency that can help the company attract its potential applicants. We have developed multiple models using feature engineering techniques such as building the occupation scorer where we only used job titles to predict occupation, the model performed poorly, but when we used job titles along with different lengths of n-grams it leads to much better results.

Here are some key takeaways,

  • A job seeker may want to apply to many different job posts and different job roles and each of these jobs has different specific requirements. It is practically impossible to prepare for all the specific skills. Hence, we build this data product that can help map the job title and its description to a higher level of competency to understand the overall requirements. For example, instead of saying Python is the most important skill we say that programming knowledge is a skill that gives a higher-level perspective.
  • Sometimes the job titles don’t match the job description. To overcome this problem of misleading job titles, we try to guide HR to figure out which job title is most similar to the job description. For example, we have seen in many job posts that say ‘Data Scientist’ as its title but what is expected in the description is more related to Software Development i.e skills like ‘Experience with Agile software development and SCRUM processes’.
  • Lastly, we have tried to analyze the nature of job review and tried to identify the factors that a job seeker looks for and provide a list of Work Value competencies which indicates that if the review is negative, then the company can inculcate the list of top competencies to attract potential applicants. And if the job review is positive, then the employee values the following list of competencies and loves working in a place that provides them.

Finally, we have used a manually labeled dataset to evaluate our models based on the number of bigrams being used to predict the various competencies.

FUTURE WORK

  • Parallelization of the computation of word sequence similarity.
  • Fine-tune the word2vec embeddings provided by SpaCy using our job posting data to match with our context.
  • Utilization of hierarchical occupation structure in the O*Net database.

LINKS

Dashboard:

Public Access-

  1. Job Post Analysis — https://go.aws/34TnlL0
  2. Job Occupation Scoring — https://go.aws/2RLNZQz
  3. Evaluation Visualization — https://go.aws/2XVNSpz

To execute queries -

http://ec2-107-23-250-99.compute-1.amazonaws.com/

Credentials — ID: cmpt733@sfu.ca | Password: cmpt733

Github repository: https://github.com/data-catalysis/job-cruncher

Video: https://www.youtube.com/watch?v=beW5PgEoVGs

--

--