The integration of large language models (LLMs) with Neo4j-based knowledge graphs

Enhancing Data Interactivity with LLMs and Neo4j Knowledge Graphs

Research Graph
9 min readMay 16, 2024
Image generated in DALL-E 2.

Author

Introduction

Since OpenAI launched ChatGPT, a large language model (LLM) based chatbot, in 2023, it has set off a technological wave. These models, capable of understanding and generating human-like text, can adapt to various conversational contexts and address questions across a broad range of topics. They even simulate human actions, revolutionising the way humans interact with machines. People now use these models for tasks like summarising articles, rewriting emails, question answering etc sparking a new wave of applications in artificial intelligence (AI). Nonetheless, these LLMs come with challenges such as generating inaccurate information (hallucinations), a lack of explainability and domain-specific limitations which can hinder their practical application. To address these issues, Knowledge graphs, known for their complex analytical capabilities, structured representations, and semantic query functionalities, can be integrated with LLMs. This integration enhances the intelligence of AI systems, leading to more precise and reliable outputs.

LLMs refer to large, general-purpose language processing models that are first pre-trained on extensive datasets and then fine-tuned for specific applications. To learn more about LLMs, please read the article Large Language Models. A knowledge graph captures information about main entities in a domain and the relationships between them. Neo4j is a graph database management system. It embraces a graph-based model, organising data base nodes, properties and relationships. To learn more about Neo4J and Cypher (the programming language supported in Neo4j), please check this article Neo4j.

A standard approach to help integration of LLMs and Knowledge Graphs is known as Retrieval-Augmented Generation (RAG). We have talked about various ways in which RAG can be utilised to retrieve information and generate responses effectively in the article RAG.

In this article, we would like to explore the integration of knowledge graphs, specifically with Neo4j and LLMs and their applications.

Framework

Instead of having users interact directly with LLMs in isolation, we guide the LLM to retrieve data from Neo4j’s knowledge graph when a question is posed. This approach allows the model to provide a richer and more accurate response. By leveraging Neo4j’s knowledge graph, we not only reduce instances of hallucination where the model might generate misleading or irrelevant information, but also enhance the explainability and traceability of the responses. This makes Neo4j a particularly robust choice compared to other solutions for integrating RAG techniques.

To fully leverage the benefits of Neo4j within LLM applications, a structured approach is employed as the image below:

Framework of integration LLMs and Neo4j. Source:https://neo4j.com/labs/genai-ecosystem/
  • Data Retrieval: When a user queries using natural language, the LLM interacts with Neo4j to fetch relevant data, grounding the responses in verified information and reducing the likelihood of inaccuracies.
  • Response Accuracy and Reliability: By integrating knowledge graphs, the responses generated by LLMs are not only more accurate but also include explanations that trace back to concrete data entities and its relevant relationships, enhancing their reliability and usefulness in practical applications.

Use Case

LLM Graph Builder

The Neo4j LLM Knowledge Graph Builder is designed to convert unstructured text from various sources such as PDFs, YouTube videos, and webpages into a structured knowledge graph. This tool supports a diverse range of inputs including platforms like YouTube and Wikipedia, as well as cloud storage services such as Amazon S3 and Google Cloud Storage (GCS). Additionally, users can also directly upload PDFs files or other file types, enhancing the flexibility and breadth of data that can be processed. It employs advanced machine learning models like Google Gemini, Diffbot, and OpenAI GPT to transform textual content into a network where entities become nodes and their relationships are represented as links. To explore this application further, you can access it through this graph builder link. This tool is ideal for creating detailed knowledge graphs from multiple data sources

The process involves dividing documents into segments, generating embeddings for each segment, and then employing LLMs to identify and extract entities and their relationships from these segments. For instance, I process text from Wikipedia about Neo4j utilising the model OpenAI’s GPT-4. The result indicated that from the Wikipedia content, we successfully extracted 24 nodes and 16 relationships.

GraphBuilder Demo. Source created through: https://llm-graph-builder.neo4jlabs.com/

To visualise the graph, simply connect it to a Neo4j database. This allows for effectively displaying and analysing the relationships and entities within the knowledge graph.

Visualise Result. Source created through: https://llm-graph-builder.neo4jlabs.com/

GraphRAG

The GraphRAG system utilises a knowledge graph derived from SEC filings (a financial statement or other formal document submitted to the U.S. Securities and Exchange Commission (SEC)), enriched with additional company data using Neo4j’s vector index and LangChain (a framework designed to simplify the creation of applications using LLMs) to provide a RAG chatbot. This chatbot can address queries about SEC filings and provide supplementary information about the company. For a practical demonstration, please refer to this GraphRAG Demo.

To construct a Neo4j knowledge graph, the information is organised into nodes and relationships. For unstructured data like 10-K filings, the approach involves creating “document” nodes. Each document node corresponds to a segment of text from a 10-K filing, utilising an embedding model from the GenAI service to generate vectors for these nodes, which later facilitates vector search capabilities. For semi-structured Form 13 filings, the system employs Cypher templates to input the data in a structured manner, creating “company” and “manager” nodes and building an “owns” relationship between them while linking back to the document. Additionally, an alternative method for ingesting Form 13 uses named entity recognition powered by a LLM from the GenAI service provider, offering a more robust way to handle filings that may be malformed, inconsistent, or semi-structured XML.

GraphRAG Framework. Source: https://neo4j.com/labs/genai-ecosystem/rag-demo/

NeoConverse

NeoConverse is an application for making graph databases accessible to non-technical users through natural language queries. It utilises LLMs to interpret user questions and translate them into Cypher queries of graph databases Neo4j. This process involves leveraging the graph database schema, sample question-statements and model fine-tuning to ensure accuracy and relevance. Once a query is formulated, it goes through validation before execution against the database. The results of this query are then fed back into the LLM, which crafts a response in natural language to convey the findings clearly to the user. Alternatively, NeoConverse can configure the LLM to use the query results to generate data and configuration for visual representations, such as charts, enhancing the interpretability and presentation of the information.

The process of NeoConverse:

  1. Questioning: Users ask a natural language question about their enterprise data stored in Neo4j.
  2. Contextual Setup: NeoConverse prepares the LLM by providing it with the schema information that is stored in the Neo4j database and a few relevant data examples to guide the model.
  3. Query Generation: The LLM uses the provided context to generate a Cypher query, which is used to retrieve the specific information as required.
  4. Query Execution: The query is refined, formatted, and executed against the Neo4j database to fetch the relevant data.
  5. Response Creation: Using the retrieved data, the LLM crafts a detailed and accurate response based on the original query, ensuring that the answer is directly relevant to the user’s data.
NeoConverse Pipeline. Source: https://neo4j.com/labs/genai-ecosystem/neoconverse/

By leveraging internal data, NeoConverse empowers LLMs to provide more accurate and relevant answers to user queries within an enterprise setting.

Main functions of NeoConverse

  1. Customisable Neo4j Database Agents: Users can connect their own Neo4j databases to NeoConverse, transforming these databases into conversational agents that can be interacted with in plain English. This includes configuring schema information to provide context for the LLM, adding few-shot learning examples for enhanced model understanding, and dynamically interacting with the database through a natural language interface.
  2. Visual Data Representation: Beyond textual interactions, NeoConverse supports generating chart visualisations from natural language queries. This allows users to not only ask questions but to see their data represented visually, aiding in better data comprehension and presentation.

GenAI Stack

The GenAI Stack is a collaborative tool launched at DockerCon 2023, integrating technologies from Docker, Neo4j, LangChain, and Ollama. It’s designed to enhance data handling and AI capabilities, particularly in managing and querying large datasets effectively.

GenAI Stack Pipeline. Source: https://neo4j.com/labs/genai-ecosystem/genai-stack/

The workflow of GenAI Stack is as follows: When a user asks a question, the system initially uses a vector index for similarity searches to find documents or data entries that most closely match the user query based on semantic similarities. This task is managed by a Local LLM equipped with Ollama (a powerful and user-friendly platform for running LLMs on local machines) technology to process and understand natural language inputs effectively. The relevant documents are then fetched from a structured Neo4j database, which organises data using a combination of a vector index and a knowledge graph. The LLM analyses these documents to generate an accurate and contextually relevant response, which is subsequently delivered to the user.

Key Features of GenAI Stack:

  1. Setup: It allows the setup of local or remote LLMs, Neo4j databases, and LangChain demo applications using Docker.
  2. Model Integration: Users can pull Ollama models and sentence transformers as required.
  3. Data Management: The stack supports the import of StackOverflow’s Questions and Answers by tags and enables the creation of knowledge graphs and vector embeddings for these items.
  4. Embedding Creation: It can create Knowledge Graph and vector embeddings for questions and answers
  5. Application Development: The stack features a Streamlit Chat App with capabilities for vector search and GraphRAG answer generation, tailored for enhancing user interaction.
  6. Support Ticket Generation: It can create “Support Tickets” from unanswered questions on StackOverflow, considering the relevance and quality of existing questions.
  7. PDF Interaction: The PDF chat feature supports loading PDFs, chunking text, indexing vectors, and searching to generate answers.
  8. Software Architecture: It includes a Python backend and a Svelte front-end for developing robust chat applications with vector search and GraphRAG integration.

Conclusion

In conclusion, the integration of Large Language Models (LLMs) with Neo4j knowledge graphs represents a significant advancement in leveraging artificial intelligence to enhance data interactivity and precision. This combination not only addresses the limitations associated with LLMs, such as generating inaccurate information (hallucinations) and a lack of explainability, but it also maximises the strengths of both technologies. By grounding LLM responses in structured, queryable data from Neo4j, we can achieve more reliable, contextually accurate, and accessible outputs.

However, this integration is not without its challenges. A primary concern is the increased complexity and computational demand required to merge these sophisticated technologies successfully. Furthermore, the effectiveness of this integration is not universally applicable as it heavily depends on the quality, depth, and breadth of the knowledge graph. In scenarios where the knowledge base is limited or skewed towards specific areas, the performance of integrations with graph-based knowledge graphs like Neo4j might not exceed that of traditional RAG methods, and the associated costs could be considerably high, given that organising a knowledge graph demands substantial time and resources. Despite these challenges, this innovative integration promises to enhance AI systems’ ability to mirror human cognition and discovery more effectively.

References

--

--