Knowledge Graphs, Completeness & Multi-Document Retrieval Benchmark

Published in

WhyHow.AI

12 min readSep 3, 2024

Knowledge Graphs are a sparse, directed representation of information. It is incredibly useful for a directed representation. Directed representations are where you know what you want to capture or segregate, or if your documents follow some type of structure that needs to be relied on.

One large advantage with structured representation is that you can then collect information across multiple data sources and perform multi-document RAG with a guarantee that all the information you were looking for was completely and exhaustively retrieved. In other words, one key benefit of using Knowledge Graphs is around optimizing for Completeness as a RAG metric.

Unlike vector databases, which represent information as embeddings, knowledge graphs explicitly model the relationships between entities, enabling more structured and contextualized retrieval.

We take the definition of completeness from Luxin Zhang in Worldline, that we take below:

“Completeness compares the generated response with the ground truth to measure if the generated response addresses all points in the ground truth. If this score is low, one should first check if IR recall is good enough, as it is the upper bound. Then, one can check the IR precision to see whether the LLM is misled by irrelevant information. Finally, one may need to investigate every test query in the ‘golden dataset’ and adjust prompts to increase the completeness of the generated response.”

We wanted to highlight how we retrieve complete answers from a graph. How the following extraction logic works is the following:

First, based on the existing schema, triples and question, the LLM determines the right triples and entities to extract and extracts those entities and relationships in a structured manner.
Secondly, we then perform a Hybrid RAG function by doing a vector retrieval of the triples to identify any other triples that we may have missed.

Let’s give an example of this. In this example, we are interested in understanding the legal cases that Amazon has been involved with over the years. This involves retrieval of specific types of information across multiple (5) documents of Amazon 10-Ks between 2020 to 2024. In all scenarios (Vector & Graph), we only processed (embedded or extracted triples from) the Legal Proceedings section for each 10-K, and not the entire document.

There are 2 ways we can attempt to solve this problem.

Vector Retrieval
Graph Retrieval

In Vector Retrieval, completeness will always be an issue as it will never be clear to the LLM when retrieval is complete. It would be difficult for the LLM to understand if the number of vector chunks that mention legal cases are 5, 10 or 15, and to stop or continue searching if it falls short or has completed its goal. This is as the main task in vector retrieval is semantic similarity, not in understanding semantic completeness. There is a pathway around confidence thresholds, query rephrasing and agentic recursive retrieval that could theoretically work, but seems fairly arbitrary. With Knowledge Graphs, complete and exhaustive extraction is simply a matter of selecting all the entity types (“legal cases”) related to Amazon. With Chunk Linking, where the vector chunks are linked to the nodes, you can also perform exhaustive vector chunk retrieval (i.e complete and exhaustive retrieval of all the relevant vector chunks deterministically), as well as perform Hybrid RAG (i.e. retrieval and comparison or combination of answers from both Graph and Vector retrieval).

In Graph Retrieval, because we have defined that legal cases are part of the things that we care about, a discrete number of legal cases are identified and pre-processed accordingly. As such, retrieval from the graph is simply a matter of retrieving from the pre-processed list of legal cases. We can sometimes even do so even if the schema is not defined. This is as uniquely in the process of KG creation, entity extraction is performed (sometimes without conforming to a predefined schema but driven by a prompt), which means that every entity that you care about has the ability to be preprocessed and categorized.

The difference between Vector and Graph retrieval is made even more apparent when we think about other limitations of vector retrieval, specifically around context window limits and top-k retrieval. Let us assume for a moment that we are able to retrieve every single chunk (across all documents) that contains every mention of every legal case that Amazon has been involved in, with a top-k of 16.

The number of vector chunks this will constitute will almost likely exceed the top-k set, especially if the information is spread across multiple pages and documents. In this scenario, we have little recourse except to hope that every single legal case mentioned happens to fall in the top 8 most relevant chunks. We may decide that given the prevalence of these types of questions (“List all of X”), we may want to set a larger top-k, say 32. However, hitting the upper limits of context windows with a larger top-k would result in a higher risk of hallucinations, costs and latency.

We see this is especially helpful if the search process is across multiple documents. This is as multiple documents mean that by definition the number of distinct, relevant, vector chunks increases. This complicates the top-k vector retrieval problem mentioned earlier. However, with a schema that is used across multiple documents, we can store the case laws mentioned across multiple documents as additional child nodes, simplifying the retrieval process.

Multi-Document Graph & Vector Benchmark Experiment

For the experiment, we chunked the texts by a 2000 character text size, and ran the following query, with a top-k of 64 to accommodate the expected 32 cases we are looking to search for. We attached the raw Vector Retrieval Answers in the Appendix

# Retrieve embeddings using 1536 dimensions and 2000 character text size
index = pc.Index("demo")
embeddings = OpenAIEmbeddings()

query = "What lawsuits is Amazon dealing with?"
query_embedding = embeddings.embed_query(query)

concatenatedChunks = ""

query_response = index.query(
        top_k=64,
        vector=query_embedding,
        include_metadata=True
    )

for chunk in query_response.matches:
    concatenatedChunks += chunk.metadata['pageContent'] + ' '

# Generate answer using chunks in response
prompt = f"""
        Context:
        You are a helpful chatbot. Your job is to answer the following question using only the context provided in the Context below:

        Question: {query}
        Context: {concatenatedChunks}

        Answer:"""

response = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": prompt
    }],
    max_tokens=4000
)

For the Graphs, we ran the following graph extraction logic.

async def extract_tiples(text, company_name):

    prompt = f"""
        Using the content below, extract the companies involved and output it in following JSON format:

        [{{"head": {{"type": "Company", "id": {company_name}}}, "relation": "INVOLVED_IN", "tail": {{"type": "Legal Proceeding", "id": <legal proceeding name>}}}}, ...]

        Do not include supporting information. Do not wrap the response in JSON markers. If there is no relevant information, just return an empty array.
        Content:{text}
    """
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "user", "content": prompt},
        ],
        max_tokens=4000,
        temperature=0.1
    )

    triples = json.loads(response.choices[0].message.content)
    return triples

We then ran the following querying logic against the Graph which would take a natural language query, retrieve the right entities and relations

schema_id = whyhow_client.graphs.get(graph_id=graph.graph_id).schema_id

schema = whyhow_client.schemas.get(schema_id)

entities = [entity.name for entity in schema.entities]
relations = [relation.name for relation in schema.relations]

query = "What lawsuits is Amazon dealing with?"

# Using the graph entities and relations, extract relevant entities and relations from a question to build a structured query.

prompt = f"""
    Perform entity and relation extraction on the question below using the list of entity types and relation types provided.
    The output should an object with three arrays:
        "entity_types" which are the entity types detected in the question
        "relation_types" which are the relation types detected in the question
        "values" which are the relevant entity names detected in the question.

    The output should look like this:

    {{"entity_types": ["Person", "Place"], "relation_types": ["LIVES_IN], "values": ["John Doe", "New York"]}}

    Do not include supporting information. Do not wrap the response in JSON markers. If there is no relevant information, just return an empty array.
    Question:{query}
    Entities:{entities}
    Relations:{relations}
"""
response = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": prompt},
    ],
    max_tokens=4000,
    temperature=0.1
)

structured_query = json.loads(response.choices[0].message.content)

# Run an unstructured query

unstructured_query_response = whyhow_client.graphs.query_unstructured(query=query, graph_id=graph.graph_id)

# Run a structured query using entities and relations extracted from the question, then generate a response by combining the result of the structured and unstructured queries

query_response = whyhow_client.graphs.query_structured(
    entities=structured_query["entity_types"], 
    relations=structured_query["relation_types"], 
    values=structured_query["values"], 
    graph_id=graph.graoh_id
)

# Generate a response to the question using the structured output
prompt = f"""
    Using the supporting information from the natural language response and the structured triples below, provide a detailed answer to the question below:

    Question: {query}
    Natural Language Response: {unstructured_query_response.answer}
    Structured Triples: {query_response}
"""
answer = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": prompt},
    ],
    max_tokens=4000,
    temperature=0.1
)

Into the results we obtained. The below list of 32 cases represent our Golden Dataset of legal proceedings that were mentioned against Amazon in the Amazon 10-Ks. We calculate a completeness score on the basis of how many of these 32 cases were retrieved. If only 10 cases were retrieved, that is a completeness score of (10/32 = 32%). For the tests with 10 documents, we also added in Walmart 10-K documents to reflect that a realistic data processing step may include a range of unrelated documents.

The graphs were used were:

Graph Retrieval (10 Documents)

Graph Retrieval (5 Documents)

The results were below:

Green represents a correctly identified case
Red represents an incorrect and irrelevant case
White represents an accurate summary of the facts, but not a specific case and does not contribute to the accuracy score

To note that in Graph Retrieval (10 Docs), case 24 was separated into two separate nodes of Rensselaer & CF Dynamics. This was represented as a single node in Graph Retrieval (5 docs).

In this experiment, we discovered the following:

Vector Retrieval’s accuracy goes down for multi-document retrieval.
Graph Retrieval was able to consistently retrieve a complete set of legal cases (100%), while Vector Retrieval had a 31–41% success rate.

An interesting issue we found was a decreased level of vector retrieval accuracy the more documents were added to the knowledge base (41% vs. 31%). Although we expected the accuracy % to drop the more documents were added as we pushed against the top-k limit, the accuracy in terms of absolute number of cases returned also dropped. We suspect that this was a result of more vector chunks decreasing the accuracy of retrieval, and overwhelming the context window.

It can also be observed that the LLM had a higher tendency to give summarized rather than specific facts the more documents it was allowed to retrieve from. This can be seen in the number of White cells that were returned (‘Antitrust and Consumer Protection Lawsuits’, and ‘General Employment and Consumer Litigations’ are two such examples for Vector Retrieval (10 Docs)). We were concerned that the phrasing of the initial (“What lawsuits is Amazon dealing with?”) might bias the experiment in our favor and so ran the following rephrased query (“List all legal cases that Amazon is facing”) against the Vector store of 10 documents. Here, we found that there were hallucinations returned, with only a slight improvement in the retrieved result.

In conclusion, we were able to show a consistently complete retrieval of information across multiple documents with a high completeness score, in ways that vector chunk-only search struggles to do with complete accuracy across multiple documents.

In a slightly more convoluted way to think about what is happening here, knowledge graphs help turn a Search process into a more deterministic Look-Up process. A Search process is when you are not sure of what you are looking for. A Look-Up process is when you are retrieving from a structured knowledge base, and is a more deterministic and straightforward process. Knowledge graphs are not just about imposing some arbitrary structure, but about imposing a structure that aids within the retrieval process. LLMs are great at helping us create structures around our information to turn Search processes into Look-Up processes. Knowledge Graphs just happen to be one good way to store schema-less data structures that grow as we stack additional structured and unstructured data sources on top of it.

WhyHow.AI’s Knowledge Graph Studio Platform (currently in Beta) is the easiest way to build modular, agentic Knowledge Graphs, combining workflows from LLMs, developers and non-technical domain experts.

If you’re thinking about, in the process of, or have already incorporated knowledge graphs in RAG for accuracy, memory and determinism, we’d love to chat at team@whyhow.ai, or follow our newsletter at WhyHow.AI. Join our discussions about rules, determinism and knowledge graphs in RAG on our Discord.

Appendix

Github Repo — https://whyhow-ai.github.io/whyhow-sdk-docs/examples/query_completeness/

Vector RAG (10 Docs):

Ten 10k documents (5 AMZN & 5 WMT) ~1600 pages
Top_k = 64
Question: “What lawsuits is Amazon dealing with?”

Amazon is facing several lawsuits, including:

1. **Class Action Cases (U.S. and Canada)**:
    - Allegations of price fixing arrangements between Amazon and third-party sellers.
    - Monopolization and attempted monopolization claims.
    - Consumer protection and unjust enrichment claims.
    - Complaints seeking billions in damages, treble damages, punitive damages, injunctive relief, civil penalties, attorneys' fees, and costs.

2. **Federal Trade Commission and State Attorneys General Lawsuit (September 2023)**:
    - Allegations of violations of federal and state antitrust and consumer protection laws.
    - Claims that Amazon has a monopoly in markets for online superstores and marketplace services, maintained through anticompetitive practices.

3. **Patent Infringement Lawsuits**:
    - **BroadbandiTV, Inc. (October 2020)**: Allegations against Amazon Prime Video features and services.
    - **Rensselaer Polytechnic Institute (May 2018)**: Allegations against "Alexa Voice Software and Alexa enabled devices."
    - **Kove IO, Inc. (December 2018)**: Allegations involving Amazon S3 and DynamoDB.
    - **Acceleration Bay, LLC (July 2022)**: Allegations against multiple AWS and Amazon services.
    - **LightGuide, Inc. (November 2022)**: Allegations against Amazon's Nike Intent Detection System.
    - **Dialect, LLC (May 2023)**: Allegations against Alexa-enabled products and services.
    - **Vocalife LLC (April 2019)**: Allegations involving Amazon Echo devices.

4. **Biometric Data Collection**:
    - Multiple cases in Illinois (beginning June 2019) alleging violations of the Illinois Biometric Information Privacy Act.
    - Allegations related to Amazon Photos, Alexa, AWS, Ring, Amazon Connect, Flex driver app, and virtual try-on technology.

5. **Regulatory Actions**:
    - **European Commission (November 2020)**: Allegations regarding the use of marketplace sellers' data.
    - **Luxembourg National Commission for Data Protection (July 2021)**: Allegations regarding compliance with EU General Data Protection Regulation.
    - **Italian Competition Authority (December 2021)**: Allegations related to marketplace and logistics practices.

6. **Opioids Litigation**:
    - Multiple cases involving opioid dispensing and distribution practices.
    - Related subpoenas and investigations from governmental entities in the U.S.

7. **Wage and Hour Lawsuits**:
    - Class-action allegations related to federal and state wage and hour laws.
    - Claims involving unpaid time for security screenings and other labor-related disputes.

Additionally, Amazon is involved in various other legal proceedings, regulatory investigations, and claims related to patent infringement, competition and antitrust issues, privacy and data protection, consumer protection, and commercial disputes.

Vector RAG (5 Docs)

Five 10k documents (5 AMZN) ~550 pages
Top_k = 64
Question: “What lawsuits is Amazon dealing with?”

Amazon is facing multiple lawsuits across various domains, including:

1. **Biometric Data**: Several cases in U.S. federal and state courts, including a class action under the Illinois Biometric Information Privacy Act, involving Amazon Photos, Alexa, AWS cloud services, Ring, Amazon Connect, Amazon’s Flex driver app, and Amazon’s virtual try-on technology.

2. **Price Fixing and Antitrust**: Starting with Frame-Wilson v. Amazon.com, Inc., there have been numerous cases in the U.S. and Canada alleging price fixing, monopolization, consumer protection, and unjust enrichment. Similar suits were brought by the Attorneys General for the District of Columbia and California, and the Federal Trade Commission, along with several state Attorneys General.

3. **Patent Infringement**: 
   - *BroadbandITV, Inc. v. Amazon*: Allegations around Amazon Prime Video infringing patents.
   - *Rensselaer Polytechnic Institute and CF Dynamic Advances LLC v. Amazon*: Accusations regarding Alexa Voice Software and Alexa-enabled devices infringing on patents.
   - *Kove IO, Inc. v. Amazon Web Services*: Infringement claims related to Amazon S3 and DynamoDB.
   - *Acceleration Bay, LLC v. Amazon Web Services*: Infringement accusations concerning multiple AWS services and Amazon devices.
   - *LightGuide, Inc. v. Amazon.com*: Claims about Amazon's Nike Intent Detection System.
   - *Dialect, LLC v. Amazon*: Allegations about Amazon’s Alexa-enabled products infringing patents.
   - *Nokia Technologies Oy v. Amazon*: Infringement claims involving Prime Video and other Amazon technologies across multiple countries.

4. **Regulatory and Privacy**: 
   - *European Commission*: Statement of Objections for competition rule infringements.
   - *Luxembourg National Commission for Data Protection*: Decision over GDPR compliance.
   - *Italian Competition Authority*: Decision claiming marketplace and logistics practices infringe EU competition rules.

5. **Miscellaneous**: Various other claims include issues related to taxes, labor and employment, consumer protection, and data protection.

Amazon disputes the allegations of wrongdoing and intends to defend itself vigorously in these matters.

Knowledge Graphs, Completeness & Multi-Document Retrieval Benchmark

Multi-Document Graph & Vector Benchmark Experiment

Appendix

Written by Chia Jeng Yang