Building a Human Resource GraphRAG application
How to quickly scan through 100s of resumes
I’ve been on a mission to find the perfect talent for our team. While it’s an exciting journey, it can get pretty exhausting going through hundreds of resumes to find the right fit. In this blog post, I’ll share how I created a GraphRAG (Retrieval Augmented Generation) application to make my search a lot easier.
Important: I got the candidates’ permission to use AI to analyze their data, with the promise that I wouldn’t share their information with anyone else. Sharing the counts of responses.
Architecture
Each candidate was asked to fill out a form and also upload their CV. Then using Microsoft open-source library GraphRAG a Knowledge graph and vector embeddings were constructed. Like every retriever, the retrieved information was sent to an LLM to give a complete answer to the query.
The forms and CVs were extracted as text using PDF parsers. For a detailed explanation of how to do so, please check this out:
Building the Knowledge Graph
Provided you have loaded and extracted the information, the first step is building the graphRAG. Before building you need to install all necessary packages and have done the pre-requisites:
Start by initializing GraphRAG index in your project folder directory like this
python -m graphrag.index [--init] [--root PATH]
The project will have a file called `settings.yaml`. Which has information on how the system will index this into a Knowledge Graph + vector store.
Here is a short description of what some of these settings do:
Chunks
size
int - The max chunk size in tokens. The larger the chunk the more information in each documentoverlap
int - The chunk overlap in tokens. More overlap allows for context to mix between documents.
Entity extraction
prompt
str - The prompt file to use. I will showcase how I changed this promptentity_types
list[str] - The entity types to identify. I used this list instead of the default :[organization, person, programming language, technical skill]max_gleanings
int - The maximum number of gleaning cycles to use. Gleaning cycle is how many times you need to go from the whole process of data extraction to the construction
Summarize descriptions
prompt
str - The prompt file to use.max_length
int - The maximum number of output tokens per summarization. Longer will give more detailed answers but will cost more!
Claim extraction
enabled
bool - Whether to enable claim extraction. I set it to true, a claim is “Any claims or facts that could be relevant to information discovery”
Community Reports
prompt
str - The prompt file to use.max_length
int - The maximum number of output tokens per report. I set this to 2000 tokens
Changes made to some prompt files:
Entity Extraction:
-Goal-
Given a resume text document and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.
-Steps-
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, capitalized
- entity_type: One of the following types: [person, organization, skillset, programming language, tool]
- entity_description: Comprehensive description of the entity's attributes and activities
Format each entity as ("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>)
2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other
- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity
Format each relationship as ("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_strength>)
3. Return output in English as a single list of all the entities and relationships identified in steps 1 and 2. Use **{record_delimiter}** as the list delimiter.
4. When finished, output {completion_delimiter}
######################
-Examples-
######################
Example 1:
Entity_types: [person, organization, skillset, programming language, tool]
Text:
John Doe is a software engineer at ABC Corp. He is proficient in Python, Java, and has experience with tools like Docker and Kubernetes. His skillset includes software development, system design, and DevOps practices.
################
Output:
("entity"{tuple_delimiter}"John Doe"{tuple_delimiter}"person"{tuple_delimiter}"John Doe is a software engineer with experience at ABC Corp."){record_delimiter}
("entity"{tuple_delimiter}"ABC Corp"{tuple_delimiter}"organization"{tuple_delimiter}"ABC Corp is an organization where John Doe is employed."){record_delimiter}
("entity"{tuple_delimiter}"Python"{tuple_delimiter}"programming language"{tuple_delimiter}"Python is a programming language that John Doe is proficient in."){record_delimiter}
("entity"{tuple_delimiter}"Java"{tuple_delimiter}"programming language"{tuple_delimiter}"Java is a programming language that John Doe is proficient in."){record_delimiter}
("entity"{tuple_delimiter}"Docker"{tuple_delimiter}"tool"{tuple_delimiter}"Docker is a tool that John Doe has experience with."){record_delimiter}
("entity"{tuple_delimiter}"Kubernetes"{tuple_delimiter}"tool"{tuple_delimiter}"Kubernetes is a tool that John Doe has experience with."){record_delimiter}
("entity"{tuple_delimiter}"software development"{tuple_delimiter}"skillset"{tuple_delimiter}"Software development is a skill that John Doe possesses."){record_delimiter}
("entity"{tuple_delimiter}"system design"{tuple_delimiter}"skillset"{tuple_delimiter}"System design is a skill that John Doe possesses."){record_delimiter}
("entity"{tuple_delimiter}"DevOps practices"{tuple_delimiter}"skillset"{tuple_delimiter}"DevOps practices are part of John Doe's skillset."){record_delimiter}
("relationship"{tuple_delimiter}"John Doe"{tuple_delimiter}"ABC Corp"{tuple_delimiter}"John Doe works at ABC Corp."{tuple_delimiter}10){record_delimiter}
("relationship"{tuple_delimiter}"John Doe"{tuple_delimiter}"Python"{tuple_delimiter}"John Doe is proficient in Python."{tuple_delimiter}9){record_delimiter}
("relationship"{tuple_delimiter}"John Doe"{tuple_delimiter}"Java"{tuple_delimiter}"John Doe is proficient in Java."{tuple_delimiter}9){record_delimiter}
("relationship"{tuple_delimiter}"John Doe"{tuple_delimiter}"Docker"{tuple_delimiter}"John Doe has experience with Docker."{tuple_delimiter}8){record_delimiter}
("relationship"{tuple_delimiter}"John Doe"{tuple_delimiter}"Kubernetes"{tuple_delimiter}"John Doe has experience with Kubernetes."{tuple_delimiter}8){record_delimiter}
("relationship"{tuple_delimiter}"John Doe"{tuple_delimiter}"software development"{tuple_delimiter}"John Doe possesses skills in software development."{tuple_delimiter}10){record_delimiter}
("relationship"{tuple_delimiter}"John Doe"{tuple_delimiter}"system design"{tuple_delimiter}"John Doe possesses skills in system design."{tuple_delimiter}10){record_delimiter}
("relationship"{tuple_delimiter}"John Doe"{tuple_delimiter}"DevOps practices"{tuple_delimiter}"John Doe possesses skills in DevOps practices."{tuple_delimiter}10){completion_delimiter}
#############################
-Real Data-
######################
Entity_types: [person, organization, skillset, programming language, tool]
Text: {input_text}
######################
Output:
Are you searching for a skilled team to develop a custom AI agent, RAG, or LLM application solution? Or someone to solve technical problems for you? Please reach out using the link below:
Start Building
You can use the below terminal command to build the GraphRAG, but be mindful of the settings. Choose an appropriate LLM to avoid high costs, and check the request per minute and other rate limits to prevent errors.
python -m graphrag.index --root PATH
Indexing is done now, and it is time to load the data in Python.
# Loaders for the indexes created while injesting
from graphrag.query.indexer_adapters import (
read_indexer_entities,
read_indexer_relationships,
read_indexer_reports,
read_indexer_text_units,
)
import tiktoken
# LanceDB is used as the vector store, it is run locally
from graphrag.vector_stores.lancedb import LanceDBVectorStore
from graphrag.query.input.loaders.dfs import store_entity_semantic_embeddings
api_key = os.environ["OPENAI_API_KEY"]
llm_model = "gpt-4o-mini"
llm = ChatOpenAI(
api_key=api_key,
model=llm_model,
api_type=OpenaiApiType.OpenAI,
max_retries=20,
)
embedding_model = "text-embedding-3-small"
token_encoder = tiktoken.get_encoding("cl100k_base")
text_embedder = OpenAIEmbedding(api_key=api_key, api_type=OpenaiApiType.OpenAI, model=embedding_model, max_retries=20)
INPUT_DIR = "<insert project directory>"
#Names for each of the indexes (set by default)
ENTITY_TABLE = "create_final_nodes"
ENTITY_EMBEDDING_TABLE = "create_final_entities"
RELATIONSHIP_TABLE = "create_final_relationships"
COMMUNITY_REPORT_TABLE = "create_final_community_reports"
TEXT_UNIT_TABLE = "create_final_text_units"
LANCEDB_URI = f"{INPUT_DIR}/lancedb"
# Community level is specifies how much interconnected you want the entities to be
# It serves as a hyper parameter, the larger the more smaller groups would be created
# within the knowledge graph (detailed explanation below). With 0 being default
COMMUNITY_LEVEL = 1
# creating dataframes
entity_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_TABLE}.parquet")
report_df = pd.read_parquet(f"{INPUT_DIR}/{COMMUNITY_REPORT_TABLE}.parquet")
entity_embedding_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_EMBEDDING_TABLE}.parquet")
relationship_df = pd.read_parquet(f"{INPUT_DIR}/{RELATIONSHIP_TABLE}.parquet")
text_unit_df = pd.read_parquet(f"{INPUT_DIR}/{TEXT_UNIT_TABLE}.parquet")
# Building reports, entities, relationships and text_units from the
# indexes
reports = read_indexer_reports(report_df, entity_df, COMMUNITY_LEVEL)
entities = read_indexer_entities(entity_df, entity_embedding_df, COMMUNITY_LEVEL)
relationships = read_indexer_relationships(relationship_df)
text_units = read_indexer_text_units(text_unit_df)
# connecting to LanceDB
description_embedding_store = LanceDBVectorStore(collection_name="entity_description_embeddings")
description_embedding_store.connect(db_uri=LANCEDB_URI)
store_entity_semantic_embeddings(entities=entities, vectorstore=description_embedding_store)
What is a Community?
In the context of a knowledge graph, a “community” typically refers to a group of related entities that are interconnected through various relationships or attributes. These entities could be people, organizations, concepts, or any other type of data object. The community concept helps in understanding and analyzing clusters of entities that share common attributes or are linked through specific kinds of relationships.
- Social Networks: In a knowledge graph representing a social network, a community might consist of individuals who frequently interact with each other, share common interests, or are involved in similar activities.
- Corporate Structures: In a business context, a community could be a group of employees, departments, or teams that collaborate on specific projects or share similar functions within the organization.
- Academic Research: For academic research, a community might include researchers working in the same field, co-authoring papers, or participating in related conferences.
Communities help to identify patterns, trends, and insights by grouping related entities and examining their collective behavior or characteristics.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — -
Are you searching for a skilled team to develop a custom AI agent, RAG, or LLM application solution? Or someone to solve technical problems for you? Please reach out using the link below:
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
Performing Search
In the original Research paper and subsequent blogs, Microsoft stated that there are two search types, local search and global search. Here is what they mean and how to perform these searches using Python.
Local Search
The local search method integrates structured data from the knowledge graph with unstructured data from input documents to enhance the LLM’s context with relevant entity information during a query.
This approach is particularly effective for answering questions that involve detailed knowledge about specific entities mentioned in the input documents (e.g., “Tell me about this candidate {candidate_name}?”).
Stress on specific entities, meaning that queries that ask for “How many people have applied? What are the most common degrees?” will not get a good answer using local search, instead Global-Search is of more use for these queries.
from graphrag.query.structured_search.local_search.mixed_context import LocalSearchMixedContext
from graphrag.query.structured_search.local_search.search import LocalSearch
# Adds all of the reports, txt, entities, relationships,emeddings to
# a predefined context based abstraction.
context_builder_local = LocalSearchMixedContext(
community_reports=reports,
text_units=text_units,
entities=entities,
relationships=relationships,
entity_text_embeddings=description_embedding_store,
embedding_vectorstore_key=EntityVectorStoreKey.ID,
text_embedder=text_embedder,
token_encoder=token_encoder,
)
# Adds search proportions for each of the units
# setting higher community_prop will retrieve community based context
# Also sets top_K entities/relationships, the more of these you retrieve
# the more types of entities/relationship will be given to the LLM
# however, you may come accross context window/token cost issues
# Optionally include past history into the context as well
local_context_params = { "text_unit_prop": 0.5, "community_prop": 0.1, "conversation_history_max_turns": 5, "conversation_history_user_turns_only": True, "top_k_mapped_entities": 10, "top_k_relationships": 10, "include_entity_rank": True, "include_relationship_weight": True, "include_community_rank": False, "return_candidate_context": False, "max_tokens": 12000, }
llm_params = { "max_tokens": 2_000, "temperature": 0.0, }
local_search_engine = LocalSearch(
llm=llm,
context_builder=context_builder_local,
token_encoder=token_encoder,
llm_params=llm_params,
context_builder_params=local_context_params,
response_type="multiple paragraphs",
)
query = "Tell me about Candidate X? What are their skills? Do they have AI, RAG Experience?"
response = local_search_engine.search(query)
Global Search
With GraphRAG, we can answer big-picture questions about our data. The knowledge graph created by the LLM shows the structure and main themes in the dataset. This organizes the data into clear groups that are already summarized. Using these summaries, you can get a complete view of the relationships, entities, or the entire dataset.
However, specific queries targeting individual candidates may not yield good results with a global search.
# Imports neccessary to do Global-Search
from graphrag.query.structured_search.global_search.community_context import GlobalCommunityContext
from graphrag.query.structured_search.global_search.search import GlobalSearch
# context builder for global search
context_builder = GlobalCommunityContext(
community_reports=reports,
entities=entities,
token_encoder=token_encoder,
)
# Parameters and what they do
context_builder_params= {
'use_community_summary': False, # Use the summarises for community created or just the whole community report
'shuffle_data': True, # Randomize Data
'include_community_rank': True, # Include the Ranking in the retrieved documents
'min_community_rank': 1, # What is the minimum rank to be included in the context
'max_tokens': 7000, # Maximum tokens
'context_name': 'Reports'
}
search_engine = GlobalSearch(
llm=llm,
context_builder=context_builder,
token_encoder=token_encoder,
max_data_tokens=12_000,
allow_general_knowledge=False, # Whether LLM should answer from within its training data or just the context
json_mode=True,
context_builder_params=context_builder_params,
concurrent_coroutines=10, # Amount of concurrency
response_type="multiple-page report", # Free form text e.g. prioritized list, single paragraph, multiple paragraphs, multiple-page report
)
query = "what universities have candidates studied at? Can you give information on what they studied?"
glob = search_engine.search(query)
print(glob.response)
You can see retrieved context data using glob.context_data
For this situation, local search is a better fit because you’ll need summaries for each candidate, institution, or previous employer. Like many RAG applications, getting the best results involves quite a bit of “engineering” to build the perfect pipeline for your needs.
Thanks for reading! Follow me and Firebird Technologies to stay updated on our latest projects and what we’re working on next!