Job matching: a new challenge in search

Jeremy Bradley
Datasparq Technology
7 min readMay 4, 2021

A small bit of web history

Some years ago in the 90s when I was doing my MSc in Computer Science and the web was the next big tech frontier, we had a problem.

Photo by Daniel Lerman on Unsplash

That problem was search — it was broken, not just a little bit either but totally broken. The search engines of the day regarded search much as a word processor does now. They gave you a search dialogue and expected to return some or even all of the pages in their catalogue which contained your search term. If there was a tie between matching pages, then the engine might look to see if your search term appeared in a title or heading or appeared multiple times within the body of the page.

The inevitable happened — spam pages emerged with huge numbers of hidden words and likely search terms embedded in them. Overnight web search as a tool became completely useless. You couldn’t find anything.

That was of course until Sergei Brin and Larry Page wrote a paper as part of their PhD detailing a neat piece of what would later become known as Data Science that could recover relevance and importance in web search. That piece of applied mathematics was PageRank and it became a $1tn company — you will know it as Google.

Job search

Well searching for a job is a bit like that now. Job postings appear containing all sorts of relevant or not so relevant information. They’re posted with wildly diverging titles even if they represent essentially the same job. So Primary Teacher role might get posted with the title: “Part-time Teacher (Primary) years 5–6 Ipswich-starts September” or equally “Primary Teacher (Fixed term maternity cover) £24k — Brent”.

All because jobs forums and recruiting companies think they can elevate their posting in a simple search box. Meanwhile applicants are reduced to searching for terms they think might match the sort of job they are interested in — often the job title. The result is a deeply unsatisfying user experience —lists of over specific jobs that do not represent the breadth of possibility for the applicant’s capability.

Job search is as broken now as searching for a web page in 1996.

The Job Finder Machine

We started work on the Job Finder Machine to address this problem in December 2020. As my colleague Oli Bartlett has written in his post, this was part of a data-science-for-good project with the Emergent Alliance to answer the challenge of getting people with skills into well-matched jobs after the pandemic.

Our Data Science Hypothesis

Our data set consisted of a set of job titles representing current or recent job openings and associated with each title was a list of skills that were required by that job.

As with all good Data Science projects, we started with a working hypothesis which we tried to disprove.

Hypothesis: We can generate high quality job matches based on user-entered natural language descriptions of skills alone

There were the usual problems of skills being duplicated, spelt wrong, capitalised, not linked with the skill with exactly the same name. Also we were expecting user input of skills from the applicant with exactly the same issues. Some useful techniques from NLP came to hand to help address some of these issues — but not all.

How did we do it?

First we developed a conceptual model of the jobs-skills landscape: a bipartite graph connecting a set U of jobs with a set V of jobs.

A bipartite graph representing the relationship between job postings and skills
A graph of U, the set of job postings, to V, the set of skills needed by a job.

Why bipartite? Because at this stage, we only had a relationship between jobs and skills or skills and jobs and there were no direct relationships between jobs (this would come later).

What did this represent? It showed simply that some jobs need a set of skills and those skills might also be required by some other jobs. Not earth shattering but still useful.

Imagine that an applicant highlights some skills in set V that are relevant to them, could we use that information together with the graphical model to return a list of jobs that were a good match for the user’s skills. Potentially.

Those sets, by the way, U might have as many as 180,000 current jobs in at any one time, and V might have 3 million skills listed.

Just returning all the jobs that happened to mention one of the skills the candidate mentioned was a non-starter — we would have many 1000s of matching jobs potentially.

Equally, we realised quite quickly that returning all the jobs that matched all the skills a user mentioned quickly reduced the the number of matches to zero and even if not, very clearly missed lots of jobs that were clearly very similar in nature but had happen not to mention one of the skills that other jobs with the same name had mentioned.

Step 1: Constructing a common skills description for a job

Job descriptions and their associated and likely different skills descriptions.

Many jobs with the same job title had slightly differing skills.

We were trying to give a good match to a set of likely jobs and not necessarily a precise match to a specific posting. Also it might be that a hiring manager had left off some skills or had assumed them to be implicit in the role.

Whereas other companies were more precise over how they described the skills that a job needed.

We could learn from the aggregate data to come to a common skills description of a job with a given title. It would fill in some of the gaps left by the individual postings. It would also allow us to capture phrases and synonyms for skills that different companies had used.

We could also start to pick out the most frequent skills for a job and use that to give an indication of the importance of a skill for each job title.

Step 2: Establishing a jobs network

With a common view of skills for each job title we could now look at how similar jobs are in terms of skills.

The overlap in skills between a Software Engineer and a Data Scientist — represented as a Venn diagram.
A Venn diagram showing the similarity between two related job titles

We constructed a similarity metric between job titles, by comparing their skill sets.

We can do this by computing the Jaccard Index between each pair of job titles.

This is then used to assign a weight to the edge between a network of job titles. This can be visualised in the following example of Job Titles Network.

A network of job titles with associated skills similarity measures

Step 3: Clustering a network of Job Titles

Having obtained a job titles network we are in the exciting position to solve the problem of having lots of very differently described jobs with very similar skills. These naturally cluster together and for a certain granularity we can create a higher or lower abstraction on the jobs graph.

Clustering a graph
A clustered graph of job titles

With this clustered representation we can have a single representation of a set of job titles with a set of similar skills.

Step 4: Ranking job clusters

Finally, we modified the Page and Brin algorithm from 1999 and created our own ranking algorithm across the skills-jobs network using a discrete-time Markov Chain. Now we can take into account how important skills are for a given job (how frequently they are mentioned) — and extend that to clusters of similar jobs.

Now when an applicant lights up some of the skills in the original data set, we are able to produce a probabilistic rank of jobs clusters and produce meaningful importance ranked output based on continuously-updated historical data.

What about algorithm bias?

We need to address the issue of potential bias in the algorithm. This is a massive topic for any recruitment or applicant matching algorithm. While this approach is not immune, the good news is that because the training data is not based on individual applicants we side-step the biggest single source of bias in job-matching — that of prior applicant selection bias.

So where can bias occur? Well two main areas:

  1. The job descriptions themselves might contain biased perspective on skills, but unless this is a sector-wide problem, individual posting will get amalgamated with other jobs from the same sector and the effect of the bias will be significantly diluted if not removed altogether.
  2. The candidate applying may be self-selective about how they describe their own abilities. However the algorithm looks for jobs that match partially and “fills in the blanks” for missing skills. This then gets displayed in the percentage match metric.

Summary

We have presented a brief overview of our approach to tackling the job search problem. We had a lot of fun doing it and were able to deploy on a GCP-BigQuery-Dataflow architecture two months after starting the project.

If you are interested in joining us for exciting projects like this or just finding out how your company can benefit from our Data Science and Engineering expertise, please get in touch. We’ll be delighted to hear from you.

Huge credit goes to the team from Code First Girls who brilliantly innovated on the data science behind the Job Finder Machine with us at DataSparQ: Erika Gravina, Dehaja Senanajaka and Rajwinder Bhatoe.

References

[1] The Job Finder Machine: https://labs.datasparq.ai/job-finder-machine/

[2] Page, L., Brin, S., Motwani, R., and Winograd, T. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab. 1999.

[3] Oliver Bartlett. Using Data Science to help people get back into work. Medium, 2021.

[4] Hierarchical clustering of networks. Wikipedia, April 2021.

--

--