Microsoft GraphRAG: Redefining AI-Based Content Interpretation and Search | Local Search | PART 2

JingleMind.Dev
5 min readJul 26, 2024

--

Microsoft Graphrag is a powerful tool designed to facilitate advanced data querying and indexing by leveraging knowledge graphs. A knowledge graph is essentially a structured representation of data where entities and their relationships are captured in a graph format. The local search method provided by Graphrag is particularly useful for extracting and combining information from specific entities mentioned in documents.

Key Features of Graphrag Local Search:

  1. Entity Extraction: Automatically identifies and extracts key entities from text documents, enabling more precise and contextually accurate searches.
  2. Combination of Data: Merges data from both structured knowledge graphs and unstructured text chunks, providing comprehensive answers to complex queries.
  3. Scalability: Capable of handling large volumes of text and making them searchable through efficient indexing mechanisms using Apache Parquet, a columnar storage file format optimized for performance.
  4. Integration: Can be integrated into various data workflows, enhancing the capability to perform sophisticated local and global searches within enterprise environments.

Implementation Steps

To implement the local search method using Microsoft Graphrag, follow the steps outlined below:

Prerequisites

Install Required Packages: Graphrag requires the installation of specific Python packages. Use the following commands:

pip install graphrag python-dotenv

Prepare Input Files: Place your input text or CSV files in a designated input folder. These files are the source of data that will be indexed and queried.

Initialize the Graphrag Folder:

  • To set up the necessary folder structure for Graphrag, run:
python -m graphrag.index --init --root .
  • This will initialize the current directory (., representing root) for Graphrag operations. You can see two new folders in local — output and prompts. These prompts will be used to generate index.

Create the Index:

  • After initialization, create the index for your input files by running:
python -m graphrag.index --root .
  • This command will read the files from the input directory, process them, and generate indexed data stored in parquet files within the output/artifacts directory.

Local Search:

Now we will use this indexed data to search any query. We will load generated entities, embeddings, relationships community reports to use it in GraphRAG Local Search. Explation of the implementation steps for local search is:

Loading Data:

  • The script reads various data tables (entities, relationships, community reports, text units) from parquet files and loads them into pandas DataFrames.

Setting up Context and Models:

  • It sets up the necessary API keys and models for ChatOpenAI and embeddings.
  • The data from DataFrames are processed and stored in a LanceDB vector store.

Local Search Execution:

  • A LocalSearchMixedContext is created to combine and manage context data for the search.
  • The LocalSearch engine is configured with parameters to control its behavior and response format.

Running Queries:

  • The script executes sample queries to demonstrate how the local search retrieves and combines data to generate responses.

Inspecting Context and Generating Questions:

  • It inspects the context data used to generate the responses and sets up a question generator for follow-up queries.
import os
import pandas as pd
import tiktoken
from graphrag.query.context_builder.entity_extraction import EntityVectorStoreKey
from graphrag.query.indexer_adapters import (
read_indexer_covariates,
read_indexer_entities,
read_indexer_relationships,
read_indexer_reports,
read_indexer_text_units,
)
from graphrag.query.input.loaders.dfs import (
store_entity_semantic_embeddings,
)
from graphrag.query.llm.oai.chat_openai import ChatOpenAI
from graphrag.query.llm.oai.embedding import OpenAIEmbedding
from graphrag.query.llm.oai.typing import OpenaiApiType
from graphrag.query.question_gen.local_gen import LocalQuestionGen
from graphrag.query.structured_search.local_search.mixed_context import (
LocalSearchMixedContext,
)
from graphrag.query.structured_search.local_search.search import LocalSearch
from graphrag.vector_stores.lancedb import LanceDBVectorStore

INPUT_DIR = "./output/20240726-162400/artifacts"
LANCEDB_URI = f"{INPUT_DIR}/lancedb"

COMMUNITY_REPORT_TABLE = "create_final_community_reports"
ENTITY_TABLE = "create_final_nodes"
ENTITY_EMBEDDING_TABLE = "create_final_entities"
RELATIONSHIP_TABLE = "create_final_relationships"
COVARIATE_TABLE = "create_final_covariates"
TEXT_UNIT_TABLE = "create_final_text_units"
COMMUNITY_LEVEL = 2

entity_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_TABLE}.parquet")
entity_embedding_df = pd.read_parquet(f"{INPUT_DIR}/{ENTITY_EMBEDDING_TABLE}.parquet")

entities = read_indexer_entities(entity_df, entity_embedding_df, COMMUNITY_LEVEL)

# load description embeddings to an in-memory lancedb vectorstore
# to connect to a remote db, specify url and port values.
description_embedding_store = LanceDBVectorStore(
collection_name="entity_description_embeddings",
)
description_embedding_store.connect(db_uri=LANCEDB_URI)
entity_description_embeddings = store_entity_semantic_embeddings(
entities=entities, vectorstore=description_embedding_store
)

relationship_df = pd.read_parquet(f"{INPUT_DIR}/{RELATIONSHIP_TABLE}.parquet")
relationships = read_indexer_relationships(relationship_df)

report_df = pd.read_parquet(f"{INPUT_DIR}/{COMMUNITY_REPORT_TABLE}.parquet")
reports = read_indexer_reports(report_df, entity_df, COMMUNITY_LEVEL)

text_unit_df = pd.read_parquet(f"{INPUT_DIR}/{TEXT_UNIT_TABLE}.parquet")
text_units = read_indexer_text_units(text_unit_df)

We will now initialize LLM and LocalSearchMixedContext to run any query.

api_key = os.environ["GRAPHRAG_API_KEY"]
llm_model = os.environ["GRAPHRAG_LLM_MODEL"]
embedding_model = os.environ["GRAPHRAG_EMBEDDING_MODEL"]

llm = ChatOpenAI(
api_key=api_key,
model=llm_model,
api_type=OpenaiApiType.OpenAI, # OpenaiApiType.OpenAI or OpenaiApiType.AzureOpenAI
max_retries=20,
)

token_encoder = tiktoken.get_encoding("cl100k_base")

text_embedder = OpenAIEmbedding(
api_key=api_key,
api_base=None,
api_type=OpenaiApiType.OpenAI,
model=embedding_model,
deployment_name=embedding_model,
max_retries=20,
)

context_builder = LocalSearchMixedContext(
community_reports=reports,
text_units=text_units,
entities=entities,
relationships=relationships,
entity_text_embeddings=description_embedding_store,
embedding_vectorstore_key=EntityVectorStoreKey.ID, # if the vectorstore uses entity title as ids, set this to EntityVectorStoreKey.TITLE
text_embedder=text_embedder,
token_encoder=token_encoder,
)

local_context_params = {
"text_unit_prop": 0.5,
"community_prop": 0.1,
"conversation_history_max_turns": 5,
"conversation_history_user_turns_only": True,
"top_k_mapped_entities": 10,
"top_k_relationships": 10,
"include_entity_rank": True,
"include_relationship_weight": True,
"include_community_rank": False,
"return_candidate_context": False,
"embedding_vectorstore_key": EntityVectorStoreKey.ID, # set this to EntityVectorStoreKey.TITLE if the vectorstore uses entity title as ids
"max_tokens": 12_000, # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 5000)
}

llm_params = {
"max_tokens": 2_000, # change this based on the token limit you have on your model (if you are using a model with 8k limit, a good setting could be 1000=1500)
"temperature": 0.0,
}

search_engine = LocalSearch(
llm=llm,
context_builder=context_builder,
token_encoder=token_encoder,
llm_params=llm_params,
context_builder_params=local_context_params,
response_type="multiple paragraphs", # free form text describing the response type and format, can be anything, e.g. prioritized list, single paragraph, multiple paragraphs, multiple-page report
)

result = await search_engine.asearch(<<Query>>)
print(result.response)

Implementing Microsoft’s Graphrag local search method as demonstrated provides a powerful way to combine the structured data from AI-extracted knowledge graphs with unstructured text from raw documents. This holistic approach is tailored for queries requiring a deep understanding of specific entities within the documents.

Happy coding….:)

--

--