Unlocking the Power of Text: Knowledge Graph Extraction and RAG Integration

Published in

Data Reply IT | DataTech

10 min readJul 15, 2024

Introduction

In today’s technological landscape, extracting and utilizing knowledge from unstructured data has become essential for numerous advanced applications. Imagine a system that can find the exact information you need and weave it into a perfectly tailored response. This is the concept behind the Retrieval-Augmented Generation (RAG).

Now imagine transforming raw text into a network of interconnected knowledge, where information is not simply stored but understood and leveraged. These structured representations, known as knowledge graphs, turn chaotic data into a goldmine of relationships and context.

In this article, we will dive into the process of converting text into knowledge graphs and demonstrate how to harness their potential in RAG applications. We’ll explore various techniques, provide hands-on Python examples, and discuss the benefits and challenges of these innovative technologies.

The Basics of RAG and Knowledge Graphs

Retrieval-Augmented Generation (RAG)

Retrieval-augmented generation (RAG) enhances the accuracy and reliability of generative AI models by incorporating facts from external sources. This technique allows users to interact effectively with data repositories, creating novel user experiences.

When a user poses a question to the language model (LLM), it first generates an embedding vector representing the query. This vector is then compared with a knowledge base to retrieve the most relevant information related to the asked question, contextualizing the response that will be returned to the user (Lewis et al., 2020).

High-level *logic workflow in RAG application*

RAG applications leverage existing customer documentation to create valuable insights. By tapping into these documents, organizations can develop chatbots capable of providing standardized and accurate answers to user queries. This ensures that knowledge is effectively shared across all stakeholders, enhancing customer service and operational efficiency.

Knowledge graphs

A knowledge graph is a structured representation of facts consisting of entities, relationships, and semantic descriptions. Entities can be real-world objects and abstract concepts, relationships represent the relation between entities, and semantic descriptions of entities, and their relationships contain types and properties with well-defined meanings (Ji et al., 2020).

Example of a knowledge graph describing cafes and associated entities

By organizing transactional data into interconnected nodes and edges, these databases enable rapid analysis of complex relationships.
In healthcare, knowledge graphs integrate with electronic health record (EHR) systems and biomedical databases, while in supply chain scenarios, they can optimize logistics by modeling supply chain networks, identifying bottlenecks, and predicting demand fluctuations.

Creating insight: focus on the implementation

From Text to Nodes: Graph Generation Techniques

Creating knowledge graphs from text is a multi-step process that transforms unstructured data into structured knowledge. Here’s a high-level overview of the necessary steps:

Identifying Concepts and Entities: First, we need to pick out the key concepts and entities in the text. These entities will be the nodes in our knowledge graph. They can include people, places, organizations, events, and additional terms relevant to the text.
Establishing Relationships: Next, we figure out how these entities are related. These relationships will form the graph’s edges, linking the nodes together. Relationships can be actions, attributes, or any interactions that connect two entities.
Generating and Visualizing the Graph: Finally, we build and visualize the knowledge graph. Visualization tools can then be used to display the graph, giving us a visual map of the knowledge within the text.

While the last step is straightforward and simple enough, finding meaningful entities and their relationships from a corpus of text can be quite challenging, as there are no specific rules to follow.

A common approach is to create a well-designed prompt to instruct an LLM to identify entities and relationships and generate the graph elements. While this method works well, there are already several packages available that can handle the heavy lifting for us.

For this article, we will be using the graph-maker library that you can find at this repository. For simplicity’s sake, we will also use the same subject for the text corpus, which is already split into chunks.

The way the library works is straightforward:

Our knowledge base is represented by the corpus of documents we will use as a base. Each document consists of a chunk of the full text and is represented by a simple Pydantic model containing the text representing that chunk and additional metadata that could be useful to give it context.
The ontology is also defined as a Pydantic model containing a list of labels to find, each with its own definition, and a list of relationships.

class Document(BaseModel): 
    text: str
    metadata: dict

class Ontology(BaseModel):
label: List[Union[str, Dict]]
relationships: List[str]

In our specific case, the Ontology will contain generic definitions for the entities, as we do know the content of the documents. Otherwise, we would simply refine the list of labels and the definition for the entities to find.

ontology = Ontology(
    labels=[
        {"Person": "A person name without any adjectives. Remember, a person may be referenced by their name or using a pronoun."},
        {"Object": "An object name without any adjectives. Do not add definite or indefinite articles with the object name, like 'a' or 'the'."},
        {"Event": "Event involving one or multiple people. Do not include qualifiers or verbs like gives, leaves, works etc. If the event is described as long sentence, condense it into a single word. For example, 'delivers a package' would become 'delivery'."},
        "Place",
        "Document",
        "Organisation",
        "Action",
        {"Miscellaneous": "Any important concept that can not be categorised with other given labels"},
    ],
    relationships=[
        "Relation between any pair of Entities"
        ],
)

Now that we have defined the text structure for the process and the elements our model will have to look out for, the last step consists of choosing a model to carry out the text analysis.
By construction, the library already integrates with OpenAI and Groq, clients, allowing users to access models like GPT 3.5, Mistral, LLama, and Gemma, but it is completely possible to define a custom client and use it to pass other models, like Claude.
The only prerequisite is to have valid API Keys for Groq and OpenAI.

## Groq models
model = "mixtral-8x7b-32768"
# model ="llama3-8b-8192"
# model = "llama3-70b-8192"
# model="gemma-7b-it"

## Open AI models
#oai_model="gpt-3.5-turbo"

llm = GroqClient(model=model, temperature=0.1, top_p=0.5)
#llm = OpenAIClient(model=oai_model, temperature=0.1, top_p=0.5)

# This will generate the addional summary
def generate_summary(text):
    SYS_PROMPT = (
        "Summarize the text provided by the user."
        "Respond only with the summary and no other comments."
    )
    try:
        summary = llm.generate(user_message=text, system_message=SYS_PROMPT)
    except:
        summary = ""
    finally:
        return summary

current_time = str(datetime.datetime.now())

graph_maker = GraphMaker(ontology=ontology, llm_client=llm, verbose=False)

docs = map(
    lambda t: Document(text=t, metadata={"summary": generate_summary(t), 'generated_at': current_time}),
    example_text_list
)

graph = graph_maker.from_documents(
    list(docs),
    delay_s_between=15
    )

Regardless of the model you chose, after running the above code you will end up with a set of saved transactions representing your nodes and their relationships.
This is an example of the kind of output you should expect:

GRAPH MAKER VERBOSE - 2024-07-10 18:11:35 - INFO 
LLM Response:
[
   {
    "node_1": {"label": "Person", "name": "Bilbo Baggins"},
    "node_2": {"label": "Event", "name": "birthday celebration"},
    "relationship": "Bilbo Baggins participates in the birthday celebration event."
   },
   {
    "node_1": {"label": "Person", "name": "Bilbo Baggins"},
    "node_2": {"label": "Object", "name": "Ring"},
    "relationship": "Bilbo Baggins gives the Ring to Frodo."
   },
   {
    "node_1": {"label": "Person", "name": "Frodo"},
    "node_2": {"label": "Person", "name": "Bilbo Baggins"},
    "relationship": "Frodo is the heir of Bilbo Baggins."
   },
…

The final step in our journey of transforming text into actionable knowledge is visualizing the results. Visualization not only provides a tangible way to explore and understand the data but also enables us to uncover insights that might remain hidden in raw, unstructured text.

For our visualization, we leverage the power of Neo4j, a robust and scalable graph database designed for handling highly interconnected data. Neo4j allows us to store our knowledge graph in a way that is both intuitive and efficient, making it an ideal choice for this task.
By saving our results in a Neo4j graph database, we can take full advantage of its sophisticated querying capabilities. This enables us to traverse the relationships within our data easily, helping us to explore the intricate connections between entities

from neo4j import GraphDatabase
import re

uri = os.getenv('NEO4J_URI')
username = os.getenv('NEO4J_USERNAME')
password = os.getenv('NEO4J_PASSWORD')

driver = GraphDatabase.driver(uri, auth=(username, password))

def sanitize_relationship_type(relationship):
    return re.sub(r"[^\w]", "_", relationship)


def load_graph_into_neo4j(graph):
    with driver.session() as session:
        for edge in graph:
            node_1 = edge.node_1
            node_2 = edge.node_2
            relationship = edge.relationship
            metadata = edge.metadata

            # Create or merge nodes
            session.run(
                """
                MERGE (n1:{label1} {{ name: $name1 }})
                MERGE (n2:{label2} {{ name: $name2 }})
                """.format(
                    label1=node_1.label,
                    label2=node_2.label
                ),
                name1=node_1.name,
                name2=node_2.name
            )

            # Prepare relationship properties
            rel_props_str = ', '.join([f'{key}: ${key}' for key in metadata.keys()])
            params = {**{'name1': node_1.name, 'name2': node_2.name}, **metadata}

            # Create relationship with explicitly specified properties
            session.run(
                """
                MATCH (n1:{label1} {{ name: $name1 }})
                MATCH (n2:{label2} {{ name: $name2 }})
                MERGE (n1)-[r:{rel_type} {{ {rel_props} }}]->(n2)
                """.format(
                    label1=node_1.label,
                    label2=node_2.label,
                    rel_type=sanitize_relationship_type(relationship),
                    rel_props=rel_props_str
                ),
                **params
            )

load_graph_into_neo4j(graph)
driver.close()

Once all the data has been imported into Neo4j, we gain a detailed and insightful graph model that captures the essence of our analysis. This model is more than a mere collection of data points; it dynamically represents the entities and their interrelationships, offering deep insights into the information we have processed.
At the heart of our graph model are nodes representing the entities identified in our text analysis. These entities span a variety of categories, including people, organizations, events, concepts, and other relevant elements. Each node is distinctly labeled and defined, providing clear context and meaning. Connecting these nodes are edges that signify the relationships between entities, illustrating how different entities interact and relate to each other.
To enhance the clarity and usability of our graph, each node and edge comes with a summary and annotations. These summaries offer concise, informative descriptions that help explain the significance of each entity and relationship within the context of our analysis.

From Graphs to Answers: Graph RAG Overview

Now that we have seen how to generate a knowledge graph from text, it is time to leverage its power to implement a Graph-based Retrieval-Augmented Generation (GRAG) system.
GRAG combines the strengths of knowledge graphs and generative models to provide enhanced, contextually relevant responses by utilizing structured knowledge efficiently.
The process involves integrating the generated graph with a generative AI model to improve the accuracy and relevance of the responses.
If we try to list the possible approaches we could take, the article written by Thomas Bratanic does an amazing job at summarizing them:

RAG Retriever: Traditional method where the exact data indexed is the data retrieved based on the context similarity.
Parent Retriever: Instead of indexing entire documents, data is divided into smaller chunks, referred to as Parent and Child documents. Child documents are indexed for better representation of specific concepts, while parent documents are retrieved to ensure context retention.
Hypothetical Questions: Documents are processed to generate potential questions they might answer. These questions are then indexed for better representation of specific concepts, while parent documents are retrieved to ensure context retention.
Summaries: Instead of indexing the entire document, a summary of the document is created and indexed. Similarly, the parent document is retrieved in a RAG application.

As the corpus of documents used to generate our knowledge graph was really small (<15 short documents), to showcase the advantages of GRAG against traditional RAG approaches, we will make use of an already available Langchain template.

To start things up, you only need to have installed the langchain-cli and git packages, and have a valid OpenAI API key: that’s it!
Then, you can immediately pull the Langchain template and start the automatic ingestion process to generate the graph from the underlying documents, similar to what was shown in the previous section.

langchain app new my-app --package neo4j-advanced-rag


#Make sure to have properly set these variables
export OPENAI_API_KEY=sk-..
export NEO4J_USERNAME=neo4j_username
export NEO4J_PASSWORD=neo4j_password
export NEO4J_URI=neo4j_uri

python ingest.py

What you will immediately notice is that the generated graph closely mirrors the narrative structure we derived from the raw text of The Lord of the Rings (just a different series of books!). These graph representations not only preserve the detail and complexity of the original summaries but also enhance them, making it easier to explore and understand the intricate web of connections within the stories.

By firing up the Langchain application, we step into an interactive playground where we can dive deep into exploring and querying our data. This platform lets us uncover hidden patterns, navigate complexities, and gain insights into our analysis.

langchain serve

*Comparison between information retrieved by different approaches*

A few questions are sufficient to distinguish between traditional RAG retrieval and GRAG, as it not only enhances answer precision but also reveals connections between different pieces of information that a conventional RAG would overlook.

Conclusions

In this article, we explored the theory and practice of converting text into knowledge graphs and leveraging them in GRAG applications. Despite the powerful capabilities of these methods, the output from text-to-graph conversion is not optimal, which directly impacts the quality of downstream applications. It often requires considerable fine-tuning and care to ensure accuracy and relevance. Entities and relationships can be complex and nuanced, making automated extraction challenging. Therefore, continuous refinement and adjustments are essential to improve the quality of the knowledge graph and ensure it meets the desired standards.

Moreover, GRAG (Graph-Based Retrieval-Augmented Generation) appears particularly promising in this context. It has shown remarkable potential in capturing complex relationships between concepts that traditional RAG methods might overlook. By enhancing both the precision of answers and uncovering intricate connections between pieces of information, GRAG represents a significant advancement in knowledge representation and retrieval.

This iterative process of tweaking models, prompts, and parameters is crucial as we strive to better capture the intricacies of source texts. As these methods continue to evolve and improve, the potential for knowledge graphs to revolutionize various fields will expand, fostering innovation and deeper understanding.

References

Ji, S., Pan, S., Cambria, E., Marttinen, P., & Yu, P. (2020, February 2). [2002.00388] A Survey on Knowledge Graphs: Representation, Acquisition and Applications. arXiv.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020, May 22). [2005.11401] Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv.
Nayak, R. (2024, May 7). *Blogs* [GitHub repository]. GitHub. https://github.com/rahulnyk/graph_maker
Bratanic, T. (2023, November 17). *Blogs* [GitHub repository]. GitHub. https://github.com/tomasonjo/blogs