Optimizing job search using Data Analysis

Data Analysis with Kapil Parekh
Purple Beard Training
5 min readNov 7, 2022

INTRODUCTION

Searching for a job is a full-time pursuit and can be time consuming. Have you wondered how you can optimize your job hunting? If yes then continue reading!

My name is Kapil and I am a data scientist. I work for an education company called Purple Beard, don’t worry we are not a hair grooming brand! What we actually do is create employable tech talent via the training we deliver.

I’ll walk you through the project below for those of you who are interested in looking for a Job in a quick sure-fire way using data analysis including the process of gathering job-related information, cleaning the dataset, and finally extracting skills from job descriptions.

IMPLEMENTATION

Step 1: Understanding the structure of Web Page

For collecting available jobs, we rely on multiple sources lets focus on https://www.gov.uk/find-a-job for this blog.

We use selenium for scraping data using Chromedriver (https://chromedriver.chromium.org/downloads).

Let’s start with the fundamentals of selenium. We can use the find element method of the webdriver object we created earlier to filter elements in a webpage based on CLASS, XPATH, CSS SELECTOR, and other criteria. It is critical to understand the structure of the webpage in order to identify the path of the element from which we are attempting to retrieve data (refer to image attached below).

Webpage layout via inspect to identify where elements like text blocks are located
We can store the information highlighted in the image in a DataFrame

Now lets extract jobs related information to store it in a Data Frame.

Step 2: Extract initial job information

Step 3: Extract Job description

replace ‘NaN’ string with np.nan and drop nan values from the data frame

Having done that we shall now convert salary range given in string to find its mean and store it back into the column using REGEX

Let’s get a job description from the links in column Link. However, not every search result is relevant, and relevancy drops after the first few pages. We’ll slice the dataframe into 200 rows.

The company and address are intentionally hidden

Step 4: Convert Titles to roles

We will now convert titles to job roles. Let’s look at the semantics of job roles, which end with a suffix. Engineer, analyst, specialist, and so on. We can iteratively begin with a single list and keep track of any suffixes that did not match, appending them to the next iteration. I’ve compiled a list of suffixes, which is detailed below.

Cluster variable is used to store unique roles

After further cleaning, we can convert titles to roles as follows

Now we shall extract skills!

The job-related information available on the DWP digital website can provide us with a summary of the job description. However, it lacks the necessary technical skills for the job. Using a similar technique, we can extract data from various websites while adhering to the crawling policy. (For more information, see the robots.txt file in websites to find out which pages are allowed to extract data from.)

The idea is to extract skills from your CV, compare them to the skills extracted from the job description, calculate the cosine similarity, and then sort the results in descending order to find jobs that match your skills.

Step 1: Extract Text From CV

We will be using skills extraction endpoint from rapid API.

Step 2: Extract skills from job description and CV

Script to extract Job Skills

We can do the same for CV skills. To find cosine similarity, we must keep only the skills required for each job description and skills that match on the CV. We can accomplish this using nested list comprehension, as shown below.

Step 3: Calculate Cosine Similarity

We can now calculate cosine similarity using a simple logic that first extracts tokens from the skills text in each row of the dataframe using the sklearn feature extraction class method ‘CountVectorizer’ and then calculates the similarity score by calculating cosine similarity.

Final DataFrame

CONCLUSION

This project can be expanded to extract salary information for locations and job roles and populate it in the same dataframe to create a PowerBi report with salary, similarity percentage, and skills as filters.

I’d love to hear from anyone who used the project to find work, as well as any lessons you learned that you think others could benefit from. Please do not hesitate to contact me via Linkedin.

Visit our website Purple Beard and sign up for some of the data analysis programs we offer if you want to improve your data skills and solve real-world problems, or if you want to learn data as a hobby.

We hope to see you on one of our programs soon!

Purple Beard: https://purplebeard.co.uk/

--

--

Data Analysis with Kapil Parekh
Purple Beard Training

I work as a Data Scientist at Purple Beard and teach Data analysis part time. Follow for interesting Data analysis related blogs :D