Performing Resumé Analysis using NER with Cosine Similarity

Expediting the Hiring Process using Named Entity Recognition (NER) Systems

Published in

Python’s Gurus

10 min readJun 24, 2024

Some weeks ago, I was researching ways to develop a model for automatic equipment logging from an email request when I came across a DataCamp blog post by Adib Ali Anwan on Named Entity Recognition (NER) and decided to try hands on it.

Named Entity Recognition (NER) is a technology used in natural language processing (NLP) to identify and classify key pieces of information (entities) in text. These entities typically include names of people, organizations, locations, dates, and other specific items. It helps computers understand and extract meaningful information from text, which can then be used for various applications like search engines, recommendation systems, and data analysis.

Anwan’s blog post details steps to perform a resumé analysis for hiring managers using NER. It took me hours of reading and exceptions to develop it according to his guide, but thanks to ChatGPT, the process was simplified.

In his article, Anwan highlights 6 steps to follow:

Importing the necessary packages, including spaCy, nltk, and pandas,
Loading the resumé data and the NER model,
Creating an entity ruler,
Cleaning the text data,
Perform entity recognition, and
Perform a match scoring.

This article details how to build the NER model for resumé analysis using the steps listed. You can find the full code in my GitHub repo.

Importing necessary packages

For entity recognition, we will use spacy.
For stopwords and word lemmatizer, we will use stopwords and WordNetLemmatizer from the nltk package.
We will also use PyPDF2 for converting the PDF files to string data, and hold them in a DataFrame.

import spacy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import PyPDF2
import pandas as pd

Convert PDF to CSV

Next, we convert the PDF files to csv. To do this, we:

Define a function called extract_text_from_pdf that takes a PDF file path as input and returns the extracted text from the PDF.
Iterate over each PDF file path in a list called pdf_files, extract the text from each PDF using the extract_text_from_pdf function, and store the texts in a list.
Create a DataFrame with columns ID (to uniquely identify each resumé) and resume_text (to store the extracted text from resumés).
And lastly, we save the DataFrame to a CSV file named resumes.csv.

def extract_text_from_pdf(pdf_path):
    text = ''
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfReader(file)
        for page in reader.pages:
            text += page.extract_text()
    return text

# List of PDF file paths containing resumes
pdf_files = ['resume1.pdf', 'resume2.pdf', 'resume3.pdf', 'resume4.pdf', 'resume5.pdf'] # replace resumes with filepaths to resumes for the analysis

# Extract text from each PDF resume and store it in a list
resumes_text = [extract_text_from_pdf(pdf_path) for pdf_path in pdf_files]

# Create a DataFrame with columns 'ID' and 'resume_text'
data = pd.DataFrame({'ID': range(1, len(pdf_files)+1), 'resume_text': resumes_text})

# Save the DataFrame to a CSV file
data.to_csv('resumes.csv', index=False)

Loading the Data and NER model

Next, we load resumes.csv which includes unique IDs and respective resumé text. Then, we will load the spacy en_core_web_sm model.

# Load data from CSV file
data = pd.read_csv('resumes.csv')

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

Entity Ruler

At this point, we add an entity ruler pipeline. EntityRuler in spaCy is a tool that helps you identify specific types of words or phrases in text based on predefined patterns. It is useful for identifying specific terms relevant to your domain that might not be recognized by pre-trained models, and ensures certain important terms are always recognized and labeled correctly.

We add an entity ruler pipeline to the spaCy model (nlp) and create an entity ruler using a list of dictionaries containing labels and patterns for skills. To ensure that these custom rules are applied before the statistical NER model runs, you can specify the position of the EntityRuler in the pipeline using the before parameter. This ensures that the custom entities identified by the EntityRuler are recognized and preserved when the NER model runs.

from spacy.pipeline import EntityRuler

# Add entity ruler pipeline to spaCy model
ruler = nlp.add_pipe("entity_ruler", before="ner")

# Define patterns as dictionaries
patterns = [
    {"label": "SKILL", "pattern": [{"LOWER": "skill_1"}]},
    {"label": "SKILL", "pattern": [{"LOWER": "skill_2"}]},
    {"label": "SKILL", "pattern": [{"LOWER": "skill_3"}]},
    {"label": "SKILL", "pattern": [{"LOWER": "skill_4"}]}
]  # "LOWER" ensures that variations in case (uppercase, lowercase, title case) are all matched by the same pattern.

# Add patterns to entity ruler
ruler.add_patterns(patterns)

Text Cleaning

In this section, we will clean our dataset using the nltk library. Cleaning the resume text is a crucial preprocessing step that enhances the quality and accuracy of entity recognition. It removes irrelevant information, normalizes the text, and reduces noise (hyperlinks, HTML tags, special characters and punctuation), ensuring that the entity recognition process can focus on the actual content of the resumes.

We perform the text cleaning using the following steps:

Define a function clean_text that takes a text input and performs the cleaning steps.
Remove hyperlinks, special characters, and punctuations using regular expressions.
Convert the text to lowercase and tokenize it into words.
Lemmatize each word to its base form using the WordNet Lemmatizer.
Remove English stopwords using NLTK’s stopwords corpus.
Finally, apply clean_text to the resume_text column in the DataFrame and store the cleaned text in a new column called cleaned_resume.

import nltk

# Download NLTK resources
nltk.download('punkt')  # Download the 'punkt' tokenizer resource
nltk.download('stopwords')
nltk.download('wordnet')

import re
from nltk.tokenize import word_tokenize

# Initialize WordNet Lemmatizer
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    # Remove hyperlinks, special characters, and punctuations using regex
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    text = re.sub(r'<.*?>', '', text)
    text = re.sub(r'[^\w\s\n]', '', text)

    # Convert the text to lowercase
    text = text.lower()

    # Tokenize the text using nltk's word_tokenize
    words = word_tokenize(text)

    # Lemmatize the text to its base form for normalization
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

    # Remove English stop words
    stop_words = set(stopwords.words('english'))
    filtered_words = ' '.join([word for word in lemmatized_words if word not in stop_words])

    return filtered_words

# Clean the 'resume_text' column in the DataFrame
data['cleaned_resume'] = data['resume_text'].apply(clean_text)

Entity Recognition

After adding a new pipeline to our model, we can visualize the named entities in our text using the render function in the spaCy displacy module. By passing the input text through the language model, we can highlight the words with their labels.

Here are the steps we will follow:

Import the displacy module from spacy.
Define options for visualization, specifying the entity labels we want to display and their corresponding colors.
Loop through each cleaned resume text in the DataFrame.
Process each resume text with the spaCy model to obtain a Doc object.
Use displacy.render to visualize the named entities in the text with their labels highlighted. Set jupyter=True to display the visualization in a Jupyter notebook.

from spacy import displacy

# Define options for visualization
options = {'ents': ['PERSON', 'GPE', 'SKILL'],
           'colors': {'PERSON': 'orange',
                      'GPE': 'lightgreen',
                      'SKILL': 'lightblue'}}

# Visualize named entities in each resume
for resume_text in data['cleaned_resume']:
    doc = nlp(resume_text)
    displacy.render(doc, style="ent", jupyter=True, options=options)
    print('\n\n')

Output:

Upon implementation, you may find the spacy en_core_web_sm model doesn’t do so well recognizing especially foreign names. One way to handle this is to create and add custom patterns to the entity ruler to help recognize specific names.

Match Score

Here comes the crowning and trickiest part of the whole project — calculating the similarity score. The similarity score, or match score, is the measure of how similar an object matches another, in this case, the resumés with a company’s employment requirements.

Various methods can be implemented to achieve this including BERT embeddings, Word Embeddings, and TF-IDF (Term Frequency-Inverse Document Frequency). Here we will use two different methods, TF-IDF with cosine similarity, and skill matching using the entity ruler skill extraction. I’ll explain how each work and their advantages and use cases.

TF-IDF Score

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). It is commonly used in text mining and information retrieval to identify relevant keywords in documents. TF-IDF vectors are numerical representations of text data, capturing the significance of words in the context of the entire corpus.

Term Frequency (TF): TF is the measure of how frequently a term (word) appears in a document. It is the number of times term t appears in document d divided by the total number of terms in document d.

The term frequency increases as the number of occurrences of a term in a document increases.

Inverse Document Frequency (IDF): IDF measures the importance of a term in the entire corpus. It is the logarithm of the total number of documents divided by number of documents containing term t.

The IDF value increases as the term appears in fewer documents, indicating that the term is more unique.

TF-IDF Score: The TF-IDF score is the product of TF and IDF.

Here are the steps to calculating the match score on each resume using TF-IDF:

Define the company requirements as a string.
Clean the company requirements using the clean_text function we defined earlier.
Calculate the TF-IDF vectors for the company requirements and each resume text.
Calculate the cosine similarity between the TF-IDF vector of the company requirements and each resume.
Sort the indices of resumes based on the similarity scores in descending order.
Display the top N most similar resumes along with their similarity scores.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Define the company requirements
company_requirements = "Company Requirements"

# Combine the company requirements with stopwords removed
cleaned_company_requirements = clean_text(company_requirements)

# Calculate TF-IDF vectors for the company requirements and resume texts
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(data['cleaned_resume'])
company_tfidf = tfidf_vectorizer.transform([cleaned_company_requirements])

# Calculate cosine similarity between the company requirements and each resume
similarity_scores = cosine_similarity(company_tfidf, tfidf_matrix).flatten()

# Get the indices of resumes sorted by similarity score
sorted_indices = similarity_scores.argsort()[::-1]

# Display the top 5 most similar resumes
top_n = 5
for i in range(top_n):
    index = sorted_indices[i]
    print(f"Resume ID: {data['ID'][index]}")
    print(f"Similarity Score: {similarity_scores[index]}")
    print(data['resume_text'][index])
    print()

Skill Extraction using `Entity_Ruler`

The skill extraction using EntityRuler is rather straightforward. This is implemented by estimating the proportion of required skills that are present in the resumé.

The steps are as follows:

Define a function calculate_similarity that takes the resume text and required skills as input.
Process the resume text with the spaCy model which has already been trained to recognize some skills.
Extract skills from the resume by filtering entities with the label “SKILL” using list comprehension.
Calculate the number of matching skills between the resume and the required skills.
Calculate the similarity score by dividing the number of matching skills by the maximum of the lengths of required skills and extracted skills.
Finally, we return the similarity score.

def calculate_similarity(resume_text, required_skills):
    # Process the resume text with the spaCy model
    doc = nlp(resume_text)

    # Extract skills from the resume using the entity ruler
    skills = [ent.text.lower() for ent in doc.ents if ent.label_ == "SKILL"]

    # Calculate the number of matching skills with required skills
    matching_skills = [skill for skill in skills if skill in required_skills]
    num_matching_skills = len(matching_skills)

    # Calculate the similarity score
    similarity_score = num_matching_skills / max(len(required_skills), len(skills))

    return similarity_score

for text in data[['cleaned_resume']].itertuples(index = False):
  resume_text = str(text[0])
  print(resume_text)
  required_skills = ["skill_1", "skill_2", "skill_3", "skill_4"]
  similarity_score = calculate_similarity(resume_text, required_skills)
  print("Similarity Score:", similarity_score)

TF-IDF with cosine similarity vs calculate_similarity

At this point, you must have realized the results of either similarity calculation differ. This is because TF-IDF with cosine similarity compares documents by looking at how often specific words appear in each document. It gives more weight to words that are rare in the overall collection of documents but are common within a specific document. Cosine similarity then measures how similar two documents are by calculating the cosine of the angle between their TF-IDF vectors. This method is useful for quantitatively assessing how much textual content two documents share.

On the other hand, the calculate_similarity function evaluates resumés based on the presence of required skills according to the predefined rules. This makes it more reliable when you have specific domain knowledge or requirements not effectively captured by TF-IDF alone. For example, if you need to match resumes based on particular skills or attributes not adequately represented by word frequencies alone.

TF-IDF with cosine similarity is valuable for general text similarity and can be part of a layered approach to resumé analysis.

Conclusion

The resume analysis project leverages NLP techniques to streamline the hiring process. We compared TF-IDF with cosine similarity for general text matching and a custom calculate_similarity function for precise skill matching. Each method has its strengths and limitations.

Recommendation

While the calculate_similarity function excels at identifying specific and particularly technical skills that meet a job requirement, TF-IDF with cosine similarity works excellently at identifying soft skills, experiences, and personal qualities that will be relevant to a given role, which may not be captured by calculate_similarity.

As such, one can adopt a hybrid approach of resume analysis by performing initial filtering using the calculate_similarity function to quickly filter resumes based on the presence of specific required skills, and then apply TF-IDF with cosine similarity to the filtered resumes for a more detailed assessment of overall content relevance and similarity, or the reverse. Other approaches like BERT can be implemented for contextual understanding and semantic similarity.

Regarding the automatic equipment logger, I should return to that in a bit. I’ll know whether or not NER may be the best solution for the problem with some more reading.

While you’re here, share your thoughts on either topic and send a clap if you found this insightful.

Python’s Gurus🚀

Thank you for being a part of the Python’s Gurus community!

Before you go:

Be sure to clap x50 time and follow the writer ️👏️️
Follow us: Newsletter
Do you aspire to become a Guru too? Submit your best article or draft to reach our audience.