“Entity Extraction from Resume using Mistral-7b for Knowledge Graphs”

8 min readFeb 21, 2024

In the rapidly evolving field of natural language processing (NLP), the ability to accurately extract and analyze information from unstructured text sources has become increasingly important. One of the most challenging and relevant applications of this capability is in processing resumes for the creation of knowledge graphs. Resumes are dense, complex documents that contain a wealth of information about a candidate’s professional history, skills, and qualifications. However, extracting this information accurately and efficiently requires advanced NLP techniques.

This is where “Entity Extraction from Resumes using Mistral-7b-Instruct-v2 for Knowledge Graphs” comes into play. Mistral-7b-Instruct-v2, a state-of-the-art language instruction model, offers an innovative approach to parsing resumes by identifying and categorizing key entities such as names, organizations, job titles, skills, and education details. By leveraging Mistral-7b’s instruct capabilities, we can not only extract these entities with high precision but also structure them in a way that is conducive to the creation of comprehensive knowledge graphs.

Knowledge graphs organize and visualize relationships between entities, providing a holistic view of the data that can be incredibly valuable for various applications, including recruitment, talent management, and job matching. In this blog, we will delve into how Mistral-7b-instruct can transform the process of resume analysis, the technical foundations behind entity extraction, and the steps to construct a knowledge graph from the extracted data. We will also explore the potential benefits and implications of this technology for the future of HR and recruitment analytics.

This is how a typical knowledge Graph look like:

Because of the data privacy one may be able to use openAI’s or other API’s available. So question how can we use offline model to do this task in the accurate way.

We will be using https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2, this model our use case.

We will be going step by step to get the relevant entities from the resume.

Step 1: Extracting text from the PDF or Image.

I am not showing the code to do this , but one can use Pymupdf if you have pdf or Pytesseract if you have images of resumes

Resume text our usecase:

text="Developer <span class=\"hl\">Developer</span> Developer - TATA CONSULTANTCY SERVICE Batavia, OH Relevant course work† Database Systems, Database Administration, Database Security & Auditing, Computer Security,Computer Networks, Programming & Software Development, IT, Information Security Concept & Admin,† IT System Acquisition & Integration, Advanced Web Development, and Ethical Hacking: Network Security & Pen Testing. Work Experience Developer TATA CONSULTANTCY SERVICE June 2016 to Present MRM (Government of ME, RI, MS) Developer†††† Working with various technologies such as Java, JSP, JSF, DB2(SQL), LDAP, BIRT report, Jazz version control, Squirrel SQL client, Hibernate, CSS, Linux, and Windows. Work as part of a team that provide support to enterprise applications. Perform miscellaneous support activities as requested by Management. Perform in-depth research and identify sources of production issues.†† SPLUNK Developer† Supporting the Splunk Operational environment for Business Solutions Unit aiming to support overall business infrastructure. Developing Splunk Queries to generate the report, monitoring, and analyzing machine generated big data for server that has been using for onsite and offshore team. Working with Splunk' premium apps such as ITSI, creating services, KPI, and glass tables. Developing app with custom dashboard with front- end ability and advanced XML to serve Business Solution unit' needs. Had in-house app presented at Splunk's .Conf Conference (2016). Help planning, prioritizing and executing development activities. Developer ( front end) intern TOMORROW PICTURES INC - Atlanta, GA April 2015 to January 2016 Assist web development team with multiple front end web technologies and involved in web technologies such as Node.js, express, json, gulp.js, jade, sass, html5, css3, bootstrap, WordPress.†Testing (manually), version control (GitHub), mock up design and ideas Education MASTER OF SCIENCE IN INFORMATION TECHNOLOGY in INFOTMATION TECHNOLOGY KENNESAW STATE UNIVERSITY - Kennesaw, GA August 2012 to May 2015 MASTER OF BUSINESS ADMINISTRATION in INTERNATIONAL BUSINESS AMERICAN INTER CONTINENTAL UNIVERSITY ATLANTA November 2003 to December 2005 BACHELOR OF ARTS in PUBLIC RELATIONS THE UNIVERSITY OF THAI CHAMBER OF COMMERCE - BANGKOK, TH June 1997 to May 2001 Skills Db2 (2 years), front end (2 years), Java (2 years), Linux (2 years), Splunk (2 years), SQL (3 years) Certifications/Licenses Splunk Certified Power User V6.3 August 2016 to Present CERT-112626 Splunk Certified Power User V6.x May 2017 to Present CERT-168138 Splunk Certified User V6.x May 2017 to Present CERT -181476 Driver's License Additional Information Skills† ∑††††SQL, PL/SQL, Knowledge of Data Modeling, Experience on Oracle database/RDBMS.† ∑††††††††Database experience on Oracle, DB2, SQL Sever, MongoDB, and MySQL.† ∑††††††††Knowledge of tools including Splunk, tableau, and wireshark.† ∑††††††††Knowledge of SCRUM/AGILE and WATERFALL methodologies.† ∑††††††††Web technology included: HTML5, CSS3, XML, JSON, JavaScript, node.js, NPM, GIT, express.js, jQuery, Angular, Bootstrap, and Restful API.† ∑††††††††Working Knowledge in JAVA, J2EE, and PHP.† Operating system Experience included: Windows, Mac OS, Linux (Ubuntu, Mint, Kali)"

Step 2: Extraction of Entities.

You can use Google Colab to run the below code , I am using one of the AWS instance to run this.

To achieve our Extraction goal as per the schema, I am going to chain a series of prompts, each focused on only one task — to extract a specific entity. By this way, you can avoid Token limitations. Also, the quality of extraction will be good.

Necessary libraries:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

Download the model:

model_name = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    torch_dtype=torch.float16,
    # load_in_8bit=True,
    # load_in_4bit=True,
    device_map="auto",
    use_cache=True,
)

Setting Up Langchain and config of models

from langchain.prompts.few_shot import FewShotPromptTemplate
from langchain.prompts.prompt import PromptTemplate

from transformers import AutoTokenizer, TextStreamer, pipeline,LlamaForCausalLM,AutoModelForCausalLM
from langchain import HuggingFacePipeline, PromptTemplate
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

text_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=5000,
    do_sample=False,
    repetition_penalty=1.15,
    streamer=streamer
)

llm = HuggingFacePipeline(pipeline=text_pipeline, model_kwargs={"temperature": 0.1})

Now we have LLM ready to use ,

Information about the Person

prompt —


person_prompt_tpl="""From the Resume text for a job aspirant below, extract Entities strictly as instructed below
1. First, look for the Person Entity type in the text and extract the needed information defined below:
   `id` property of each entity must be alphanumeric and must be unique among the entities. You will be referring this property to define the relationship between entities. NEVER create new entity types that aren't mentioned below. Document must be summarized and stored inside Person entity under `description` property
    Entity Types:
    label:'Person',id:string,role:string,description:string //Person Node
2. Description property should be a crisp text summary and MUST NOT be more than 100 characters
3. If you cannot find any information on the entities & relationships above, it is okay to return empty value. DO NOT create fictious data
4. Do NOT create duplicate entities
5. Restrict yourself to extract only Person information. No Position, Company, Education or Skill information should be focussed.
6. NEVER Impute missing values
Example Output JSON:
{{"entities": [{{"label":"Person","id":"person1","role":"Prompt Developer","description":"Prompt Developer with more than 30 years of LLM experience"}}]}}

Question: Now, extract the Person for the text below -

{text}

Answer:
"""

This will be help us to get the information about the person in the form json and will use instruction we have shown it.

from langchain.chains import LLMChain
prompttemplate=PromptTemplate(template=person_prompt_tpl,input_variables=['text'])
chain=LLMChain(llm=llm,prompt=prompttemplate)
import time
t1=time.time()
result=chain(text)
t2=time.time()

print(t2-t1)

output → Information about the person

{
"entities":[
 {
   "label":"Person",
   "id":"developer1",
   "role":"Developer",
   "description":"Experienced developer with expertise in Java, JSP, JSF, DB2(SQL), LDAP, BIRT report, Jazz version control, Squirrel SQL client, Hibernate, CSS, Linux, and Windows. Has worked as a Splunk Developer supporting the Splunk Operational environment for Business Solutions Unit."
 }
]
}

This is awsome , we got the label , id , role and description about the person in the form we want .

2. Information about the Education of the Person:

prompt →

edu_prompt_tpl="""From the Resume text for a job aspirant below, extract Entities strictly as instructed below
1. Look for Education entity type and generate the information defined below:
   `id` property of each entity must be alphanumeric and must be unique among the entities. You will be referring this property to define the relationship between entities. NEVER create other entity types that aren't mentioned below. You will have to generate as many entities as needed as per the types below:
    Entity Definition:
    label:'Education',id:string,degree:string,university:string,graduationDate:string,score:string,url:string //Education Node
2. If you cannot find any information on the entities above, it is okay to return empty value. DO NOT create fictious data
3. Do NOT create duplicate entities or properties
4. Strictly extract only Education. No Skill or other Entities should be extracted
5. DO NOT MISS out any Education related entity
6. NEVER Impute missing values
Output JSON (Strict):
{{"entities": [{{"label":"Education","id":"education1","degree":"Bachelor of Science","graduationDate":"May 2022","score":"0.0"}}]}}

Question: Now, extract Education information as mentioned above for the text below -

{text}

Answer:
"""

from langchain.chains import LLMChain
prompttemplate=PromptTemplate(template=edu_prompt_tpl,input_variables=['text'])
chain=LLMChain(llm=llm,prompt=prompttemplate)
import time
t1=time.time()
result=chain(text)
t2=time.time()

print(t2-t1)

output → about the education of the person:

{
"entities": [
 {
   "label": "Education",
   "id": "education1",
   "degree": "Master of Science in Information Technology",
   "university": "KENNESAW STATE UNIVERSITY",
   "graduationDate": "May 2015"
 },
 {
   "label": "Education",
   "id": "education2",
   "degree": "Master of Business Administration in International Business",
   "university": "AMERICAN INTER CONTINENTAL UNIVERSITY ATLANTA",
   "graduationDate": "December 2005"
 },
 {
   "label": "Education",
   "id": "education3",
   "degree": "Bachelor of Arts in Public Relations",
   "university": "THE UNIVERSITY OF THAI CHAMBER OF COMMERCE",
   "graduationDate": "May 2001"
 }
]
}

This is mindblowing, we got the json with the info of all the education of the person.

3. Information about the Skills of the Person

prompt →

skill_prompt_tpl="""From the Resume text below, extract Entities strictly as instructed below
1. Look for prominent Skill Entities in the text. The`id` property of each entity must be alphanumeric and must be unique among the entities. NEVER create new entity types that aren't mentioned below:
    Entity Definition:
    label:'Skill',id:string,name:string,level:string //Skill Node
2. NEVER Impute missing values
3. If you do not find any level information: assume it as `expert` if the experience in that skill is more than 5 years, `intermediate` for 2-5 years and `beginner` otherwise.
Example Output Format:
{{"entities": [{{"label":"Skill","id":"skill1","name":"Neo4j","level":"expert"}},{{"label":"Skill","id":"skill2","name":"Pytorch","level":"expert"}}]}}

Question: Now, extract entities as mentioned above for the text below -
{text}

Answer:
"""

from langchain.chains import LLMChain
prompttemplate=PromptTemplate(template=skill_prompt_tpl,input_variables=['text'])
chain=LLMChain(llm=llm,prompt=prompttemplate)
import time
t1=time.time()
result=chain(text)
t2=time.time()

print(t2-t1)

output → about the Skills of the person:

{
"entities":[
 {
  "label":"Skill",
  "id":"skill1",
  "name":"Java",
  "level":"expert"
 },
 {
  "label":"Skill",
  "id":"skill2",
  "name":"JSP",
  "level":"expert"
 },
 {
  "label":"Skill",
  "id":"skill3",
  "name":"JSF",
  "level":"expert"
 },
 {
  "label":"Skill",
  "id":"skill4",
  "name":"DB2",
  "level":"intermediate"
 },
 {
  "label":"Skill",
  "id":"skill5",
  "name":"Linux",
  "level":"expert"
 },
 {
  "label":"Skill",
  "id":"skill6",
  "name":"Windows",
  "level":"intermediate"
 },
 {
  "label":"Skill",
  "id":"skill7",
  "name":"SQL",
  "level":"expert"
 },
 {
  "label":"Skill",
  "id":"skill8",
  "name":"Oracle",
  "level":"intermediate"
 },
 {
  "label":"Skill",
  "id":"skill9",
  "name":"MySQL",
  "level":"intermediate"
 },
 {
  "label":"Skill",
  "id":"skill10",
  "name":"MongoDB",
  "level":"beginner"
 },
 {
  "label":"Skill",
  "id":"skill11",
  "name":"HTML5",
  "level":"expert"
 },
 {
  "label":"Skill",
  "id":"skill12",
  "name":"CSS3",
  "level":"expert"
 },
 {
  "label":"Skill",
  "id":"skill13",
  "name":"XML",
  "level":"expert"
 },
 {
  "label":"Skill",
  "id":"skill14",
  "name":"JSON",
  "level":"expert"
 },
 {
  "label":"Skill",
  "id":"skill15",
  "name":"JavaScript",
  "level":"expert"
 },
 {
  "label":"Skill",
  "id":"skill16",
  "name":"Node.js",
  "level":"expert"
 },
 {
  "label":"Skill",
  "id":"skill17",
  "name":"NPM",
  "level":"expert"
 },
 {
  "label":"Skill",
  "id":"skill18",
  "name":"GIT",
  "level":"expert"
 },
 {
  "label":"Skill",
  "id":"skill19",
  "name":"express.js",
  "level":"expert"
 },
 {
  "label":"Skill",
  "id":"skill20",
  "name":"jQuery",
  "level":"expert"
 },
 {
  "label":"Skill",
  "id":"skill21",
  "name":"Angular",
  "level":"expert"
 },
 {
  "label":"Skill",
  "id":"skill22",
  "name":"Bootstrap",
  "level":"expert"
 },
 {
  "label":"Skill",
  "id":"skill23",
  "name":"Restful API",
  "level":"expert"
 },
 {
  "label":"Skill",
  "id":"skill24",
  "name":"PHP",
  "level":"intermediate"
 },
 {
  "label":"Skill",
  "id":"skill25",
  "name":"SCRUM/AGILE",
  "level":"expert"
 },
 {
  "label":"Skill",
  "id":"skill26",
  "name":"WATERFALL methodologies",
  "level":"expert"
 }
]
}

There we go , we got all the skillls of the person in the form of Json,

From here can make the Knowledge graph as the in the image shown at the Top.

Hope you find this useful.