Ranking resume with Natural Language Processing

Using AI technologies to help HR rank candidates

6 min readJun 12, 2022

The job market is changing, the number of open positions is rising, and so is the number of applicants. In average, a job post can have between 20/30 candidates, and make recruiting task hard and time consuming, due to the amount of resumes to filter based on the company requirements. If the filtering is done in a strict mode, a company can end up with 0 retained people. In this article we will show how some can create a resume ranking search engine using Natural Language Processing (NLP) and use it for solving these issues and apply it to rank applicants.

Information Retrieval System

A Information Retrieval global framework — Information retrieval process

Generaly the hiring process belong to the field of Information Retrieval (IR). IR is the science of searching information (in this case resume) that’s relevant to a query, from a collection of documents¹. A hiring process can be split in several steps : job posting, receiving people applications, resume screening, profile(s) selection, call for interview … As we can see, it follows the same process as an IR process :

the job posting being the query;
candidate applications help to form the collection of documents (resume)
and the selected profiles are supposed to be relevant for that query.

There are two crucial questions that need to be answered : how to retrieve the documents relevant to a query and how to rank them ?

Let D={d1,d2,…,dn} be the document set , Q={q1,q2,…,qn} the set of query and f a scoring (ranking) function that maps each document d to a probability of being relevant for a query q:

Di = {di1,di2,…,din} represents the set of documents associated with the query qi and we define a permutation πi on Di as a bijection of Di into itself. Πi denotes the set of all possible permutations on Di. πi(j) represents the rank or position of the j-th document in the permutation πi. Ranking then comes down to choosing a permutation πi ∈ Πi for a given query qi and the associated documents Di using the score s = f(qi,di).

Document retrieval step

In this step, given a query, we need to retrieve documents that are relevant to it. A query is a set of words that describe what the user wants. For example the Google search engine takes a set of words (e.g : how to become a developer) and return document(s) judged relevant for the query.

The first approaches for document retrieval relied on rule-based systems and statistical methods. A rule-based system defines set of rules for recognizing pattern in a document. The pattern can be a date of birth , a name, a date range (years of experience) etc. Rule-based systems are really fast and accurate, but need to be applied in a well-known domain or in a domain where the changes doesn’t occur frequently.

The statistical methods compute features on each document (co-occurence matrix or tf-idf) in order to retrieve document(s) matching the query. Let’s take tf-idf as an example.

Tf-Idf formula

Tf compute the frequency of a term t in a document d, idf is used to reduce the weight of terms that appear frequently (eg: articles) and increase the weight for terms that occur rarely.

Word embeddings

A resume is an unstructured or semi-structured document : there’s no standard in how to write or adjust the layout of the resume. The main disadvantages of such systems, applied on resume, are the complexity of the rules making it hard to maintain and evolve. The word context ,also, are not taken into account (they can’t make difference between a user and company name) either by statiscal methods or rule-based. That’s where Natural Language Processing comes to bring a solution with Named Entity Recognition and word embeddings .

Named Entity Recognition — NER is a technique used in information extraction to locate and extract unstructured text entities. A named entity can be the name of a person, a place, an organisation, a prize etc.

The goal of word embeddings is to create vector representation that can carry meaning of words, sentences or documents. By transforming words into vectors, we can apply operations such as computing the similarity between two words. Each property of the vector is a feature, and taken together, the vectorized words form a multidimensional semantic space. That space is thus mapped so that documents with similar meaning are closer to each other and those that are different are further away.

Thus by using word embeddings we can do what we call query expansion. To expand a query is to create an alternative query from the initial one. The expanded query is made of words similar (in terms of meaning or words that occurs together the most) to the original query and can increase the number of documents retrieved.

Query expansion — from the query Laravel developer we can have : symphony developer or php developer.

We will use Word Embeddings and NER to explain how we can retrieve document for our ranking system. There are many ways to create embedding but we will focus on how to create word embeddings with Recurrent Neural Network more specifically using ELMo .

ELMo : Embeddings from Language Models

ELMo is a deep bidirectional language model (biLM)², which is pre-trained on a large text corpus. ELMo representations can be easily added to existing models and significantly improve the state of the art across many challenging NLP problems, including question answering, sentiment analysis and Named Entity Reconigtion².

Here is the architecture of the model we’ve created using ELMo :

Named Entity Recognition Model architecture

Ranking step

For the ranking step, we define a scoring function f that will be used to assign to each document a score or probability relevance. The most basic function we can use is the cosine distance, that computes the similarity between two documents. For more specific case we can, as an example, rank by taking the couple years-of-experience + document query-matching score.

For demonstration purpose, we define the ranking function as the couple (y.of.exp, query-matching score). The result can be seen on the next figure, where the query was “python developer” :

Ranking using years of experience and query matching score

We have for each resume the relevance to the user query represented as percentage.

We have shown that we can apply Natural Language Processing to make recruitment process easier and more flexible (e.g : query expansion). There’s still work to do in this relatively new field and more advanced techniques can be applied to make the screening more accurate (e.g apply supervised model : pairwise, pointwise and listwise algorithms).

Footnotes

1 : https://en.wikipedia.org/wiki/Information_retrieval

2 : https://arxiv.org/abs/1802.05365