A Gentle Introduction to Large Language Models and Knowledge Graphs

Claudio Stamile
BIP xTech
Published in
10 min readAug 17, 2023

Introduction

In the last few months, Large Language Models (LLM) and Generative AI seem to be the only topic the data science community covers. Of course, those models are quite interesting, and a lot of applications can be made in combination with other technologies. As you may understand from the title, in this article we will investigate the application of LLM to Graphs and Graphs Machine Learning (GML).

A comprehensive survey with all the possible applications of LLM to graphs is provided in the paper Unifying Large Language Models and Knowledge Graphs: A Roadmap by Pan et al [1].

In the paper, the authors describe three main macro-areas where the combination of LLM with Graphs can be beneficial.

1) KG-enhanced LLMs, which incorporate KGs during the pre-training and inference phases of LLMs, or for the purpose of enhancing understanding of the knowledge learned by LLMs;

2) LLM-augmented KGs, that leverage LLMs for different KG tasks such as embedding, completion, construction, graph-to-text generation, and question answering;

3) Synergized LLMs + KGs, in which LLMs and KGs play equal roles and work in a mutually beneficial way to enhance both LLMs and KGs for bidirectional reasoning driven by both data and knowledge

In this article, we will try to build an example of the third point where, starting from a given graph, we can use LLM to generate all the Cypher queries needed in order to extract data from a Neo4j database. A nice overview of this feature on LangChain is available in a fantastic medium article [2]. Here we are going to understand/debug how this process is performed and how it is possible to generalize it using, eventually, other technologies (both for LLM and graph database side).

Tools and techniques

The goal is then to build a pipeline, using Python, in order to generate Cypher queries capable of actually replying to specific questions using the knowledge base contained in the graph database. In order to do that we do not want to learn any strange Python library, our goal is to let the LLM do the complex stuff.

In this article, we will use three different tools. Python, of course, Neo4j as a graph database [3], and Google Codey [4] to generate the code from a given input. Before starting with a detailed discussion let’s introduce Google Codey.

Codey belongs to the family of Google foundational coding models built on PaLM 2. Codey was fine-tuned on a large dataset of high-quality, permissively licensed code from external sources and includes support for 20+ coding languages, including Python, Java, Javascript, Go, and others. The Codey models have been used to enhance various kinds of software development-related tasks across various Google surfaces such as Colab, Android Studio, Google Cloud and Google Search. This has various benefits for developers such as improving coding speed, enhancing code quality and closing the skills gap between novice and expert developers. Some of the tasks that Codey can help with include and enable:

Code completion: Codey suggests the next few lines based on the existing context of code.

Code generation: Codey generates code based on natural language prompts from a developer.

Code chat: Codey lets developers converse with a bot to get help with debugging, documentation, learning new concepts, and other code-related questions.

Let’s start with the simple operation, create the Neo4j instance we are going to use in order to check and build the code. We used the (free) Neo4j sandbox. The sandbox allows you to create, in a few clicks, a new Neo4j instance. It is possible to select an empty instance or instances with some pre-loaded datasets. In this article, we will use the instance with the Twitch dataset. After a few seconds, the instance is generated and up and running allowing you to connect using Python or other programming languages.

Example of credentials from the Neo4j sandbox

The easy part is completed with zero code let’s now deep dive into the code to actually use Google Codey to automatically query the database without any prior knowledge of the Cypher query language.

Build the pipeline and write the code

The steps of the process that we want to create are the following:

Step 1 — Building the prompt for the schema (We are not using the LLM yet)

We will start with the first step, generating the “context” prompt that will contain the information about the “structure” of the data graph database. In this process, we want to convert the entities (nodes) and the relationship (edges), along with their property, from a graph format to a “text” version. We can imagine this step as a “graph2text” process where the graph structure is described as text. In order to perform this task we use the class Neo4jGraph contained in the LangChain package.

from typing import Any, Dict, List

node_properties_query = """
CALL apoc.meta.data()
YIELD label, other, elementType, type, property
WHERE NOT type = "RELATIONSHIP" AND elementType = "node"
WITH label AS nodeLabels, collect({property:property, type:type}) AS properties
RETURN {labels: nodeLabels, properties: properties} AS output

"""

rel_properties_query = """
CALL apoc.meta.data()
YIELD label, other, elementType, type, property
WHERE NOT type = "RELATIONSHIP" AND elementType = "relationship"
WITH label AS nodeLabels, collect({property:property, type:type}) AS properties
RETURN {type: nodeLabels, properties: properties} AS output
"""

rel_query = """
CALL apoc.meta.data()
YIELD label, other, elementType, type, property
WHERE type = "RELATIONSHIP" AND elementType = "node"
RETURN "(:" + label + ")-[:" + property + "]->(:" + toString(other[0]) + ")" AS output
"""


class Neo4jGraph:
"""Neo4j wrapper for graph operations."""

def __init__(
self, url: str, username: str, password: str, database: str = "neo4j"
) -> None:
"""Create a new Neo4j graph wrapper instance."""
try:
import neo4j
except ImportError:
raise ValueError(
"Could not import neo4j python package. "
"Please install it with `pip install neo4j`."
)

self._driver = neo4j.GraphDatabase.driver(url, auth=(username, password))
self._database = database
self.schema = ""
# Verify connection
try:
self._driver.verify_connectivity()
except neo4j.exceptions.ServiceUnavailable:
raise ValueError(
"Could not connect to Neo4j database. "
"Please ensure that the url is correct"
)
except neo4j.exceptions.AuthError:
raise ValueError(
"Could not connect to Neo4j database. "
"Please ensure that the username and password are correct"
)
# Set schema
try:
self.refresh_schema()
except neo4j.exceptions.ClientError:
raise ValueError(
"Could not use APOC procedures. "
"Please ensure the APOC plugin is installed in Neo4j and that "
"'apoc.meta.data()' is allowed in Neo4j configuration "
)

@property
def get_schema(self) -> str:
"""Returns the schema of the Neo4j database"""
return self.schema

def query(self, query: str, params: dict = {}) -> List[Dict[str, Any]]:
"""Query Neo4j database."""
from neo4j.exceptions import CypherSyntaxError

with self._driver.session(database=self._database) as session:
try:
data = session.run(query, params)
return [r.data() for r in data]
except CypherSyntaxError as e:
raise ValueError("Generated Cypher Statement is not valid\n" f"{e}")

def refresh_schema(self) -> None:
"""
Refreshes the Neo4j graph schema information.
"""
node_properties = self.query(node_properties_query)
relationships_properties = self.query(rel_properties_query)
relationships = self.query(rel_query)

self.schema = f"""
Node properties are the following:
{[el['output'] for el in node_properties]}
Relationship properties are the following:
{[el['output'] for el in relationships_properties]}
The relationships are the following:
{[el['output'] for el in relationships]}
"""

This class is really simple. It runs some simple queries on the database in order to extract information about the schema. Running the class and the code, we get.

n4jg = Neo4jGraph("neo4j://xx.xxx.xxx.xxx:7687","username", "password")
n4jg.get_schema

Url, username, and password are the ones visible on the screen of the sandbox. As a result, we get.

Node properties are the following:
[{'properties': [{'property': 'createdAt', 'type': 'DATE_TIME'}, {'property': 'id', 'type': 'STRING'}, {'property': 'description', 'type': 'STRING'}, {'property': 'url', 'type': 'STRING'}, {'property': 'name', 'type': 'STRING'}, {'property': 'followers', 'type': 'INTEGER'}, {'property': 'total_view_count', 'type': 'INTEGER'}], 'labels': 'Stream'}, {'properties': [{'property': 'name', 'type': 'STRING'}], 'labels': 'Game'}, {'properties': [{'property': 'name', 'type': 'STRING'}], 'labels': 'Language'}, {'properties': [{'property': 'name', 'type': 'STRING'}], 'labels': 'User'}, {'properties': [{'property': 'createdAt', 'type': 'DATE_TIME'}, {'property': 'name', 'type': 'STRING'}, {'property': 'id', 'type': 'STRING'}], 'labels': 'Team'}]
Relationship properties are the following: []
The relationships are the following:
['(:Stream)-[:PLAYS]->(:Game)', '(:Stream)-[:HAS_LANGUAGE]->(:Language)', '(:Stream)-[:CHATTER]->(:Stream)', '(:Stream)-[:HAS_TEAM]->(:Team)', '(:Stream)-[:MODERATOR]->(:Stream)', '(:Stream)-[:VIP]->(:Stream)', '(:User)-[:CHATTER]->(:Stream)', '(:User)-[:VIP]->(:Stream)', '(:User)-[:PLAYS]->(:Game)', '(:User)-[:HAS_LANGUAGE]->(:Language)', '(:User)-[:MODERATOR]->(:Stream)', '(:User)-[:HAS_TEAM]->(:Team)']

The string obtained is simply the description of the Twitch Neo4j.

Step 2 — Add to the prompt the NLP query and send it to the LLM (Start to use the LLM)

The second step is still prompt engineering. In this step, the goal is to build the final prompt that will be “executed” by the LLM (Google Codey). In order to build the prompt we will use the CYPHER_GENERATION_TEMPLATE variable defined in the LangChain source code. We can use this code in order to write the following function.

build_qa = lambda schema, question: f"""Task:Generate Cypher statement to query a graph database.
Instructions:
Use only the provided relationship types and properties in the schema.
Do not use any other relationship types or properties that are not provided.
Schema:
{schema}
Note: Do not include any explanations or apologies in your responses.
Do not respond to any questions that might ask anything else than for you to construct a Cypher statement.
Do not include any text except the generated Cypher statement.

The question is:
{question}"""

Using this function in combination with the one previously defined (for the schema generation) we get.

n4jg = Neo4jGraph("neo4j://xx.xxx.xxx.xxx:7687","username", "password")
prom = build_qa(n4jg.get_schema, "the users who play the game
Call of Duty: Warzone")

The execution of this code will give the prompt that will be sent to the LLM.

Task:Generate Cypher statement to query a graph database.
Instructions:
Use only the provided relationship types and properties in the schema.
Do not use any other relationship types or properties that are not provided.
Schema:

Node properties are the following:
[{'properties': [{'property': 'createdAt', 'type': 'DATE_TIME'}, {'property': 'id', 'type': 'STRING'}, {'property': 'description', 'type': 'STRING'}, {'property': 'url', 'type': 'STRING'}, {'property': 'name', 'type': 'STRING'}, {'property': 'followers', 'type': 'INTEGER'}, {'property': 'total_view_count', 'type': 'INTEGER'}], 'labels': 'Stream'}, {'properties': [{'property': 'name', 'type': 'STRING'}], 'labels': 'Game'}, {'properties': [{'property': 'name', 'type': 'STRING'}], 'labels': 'Language'}, {'properties': [{'property': 'name', 'type': 'STRING'}], 'labels': 'User'}, {'properties': [{'property': 'createdAt', 'type': 'DATE_TIME'}, {'property': 'name', 'type': 'STRING'}, {'property': 'id', 'type': 'STRING'}], 'labels': 'Team'}]
Relationship properties are the following:
[]
The relationships are the following:
['(:Stream)-[:PLAYS]->(:Game)', '(:Stream)-[:HAS_LANGUAGE]->(:Language)', '(:Stream)-[:CHATTER]->(:Stream)', '(:Stream)-[:HAS_TEAM]->(:Team)', '(:Stream)-[:MODERATOR]->(:Stream)', '(:Stream)-[:VIP]->(:Stream)', '(:User)-[:CHATTER]->(:Stream)', '(:User)-[:VIP]->(:Stream)', '(:User)-[:MODERATOR]->(:Stream)', '(:User)-[:PLAYS]->(:Game)', '(:User)-[:HAS_LANGUAGE]->(:Language)']

Note: Do not include any explanations or apologies in your responses.
Do not respond to any questions that might ask anything else than for you to construct a Cypher statement.
Do not include any text except the generated Cypher statement.

The question is:
the users who play the game Call of Duty: Warzone

Finally the prompt is generated, let’s now execute. The idea is to build a class capable to: i) Connect to the the Codey environment on the Google Cloud Platform (GCP), ii) Send the generated prompt to the LLM and execute it. The code to perform this action is defined as follow

class CodyGraphPrompt:

def __init__(self, neo4j_connector: Neo4jGraph, project_name: str,
location: str,
model_name: str = "codechat-bison@001", parameters:
Dict = {"temperature": 0.2, "max_output_tokens": 1024}):
vertexai.init(project=project_name, location=location)
self.chat_model = CodeChatModel.from_pretrained(model_name)
self.parameters = parameters
self.neo4j_connector = neo4j_connector

def prompt_builder(self, schema: str, question: str) -> str:
return f"""Task:Generate Cypher statement to query a graph database.
Instructions:
Use only the provided relationship types and properties in the schema.
Do not use any other relationship types or properties that are not provided.
Schema:
{schema}
Note: Do not include any explanations or apologies in your responses.
Do not respond to any questions that might ask anything else than for you to construct a Cypher statement.
Do not include any text except the generated Cypher statement.

The question is:
{question}"""

def run_nlp_query(self, prompt: str) -> str:
chat = self.chat_model.start_chat()
response = chat.send_message(
self.prompt_builder(self.neo4j_connector.get_schema, prompt,
**self.parameters)
return response.text.split("```")[1]

It is possible to use the previous class in the following way:

n4jg = Neo4jGraph("neo4j://xx.xxx.xxx.xxx:7687","username", "password")
asd = CodyGraphPrompt(n4jg, "google_project_name", "us-central1")
asd.run_nlp_query("the users who play the game Call of Duty: Warzone")

As results we get the Cypher query created by the the LLM model.

\nMATCH (user:User)-[:PLAYS]->(game:Game)\nWHERE game.name = 
"Call of Duty: Warzone"\nRETURN user.name;\n

Step 3 — Execute the query to the database and return the results

The last, and the easiest step, is to execute the result of the of the model, this step can be easily done by modifying the run_nlp_query function as follows:

def run_nlp_query(self, prompt: str) -> str:
chat = self.chat_model.start_chat()
response = chat.send_message(self.prompt_builder(
self.neo4j_connector.get_schema, prompt),
**self.parameters)
return n4jg.query(response.text.split("```")[1])

by running the new function we get, as a result the list of users that hold the given query.

[{'user.name': 'montanablack88'}, {'user.name': 'danucd'}, ....]

Conclusion and Next Article

In this first article, we discussed a simple way to build Cypher query using Google Codey. We extensively used and adapted the code of LangChain in order to simplify the understanding of the whole pipeline and the (simple) process under the hood. The notebook with the code discussed in this article is available on Colab.

In the next series of articles about the topic LLM + Graph, we will show how LLM can boost Graph Machine Learning.

References

[1] Unifying Large Language Models and Knowledge Graphs: A Roadmap. https://arxiv.org/pdf/2306.08302.pdf

[2] LangChain has added Cypher search. https://towardsdatascience.com/langchain-has-added-cypher-search-cb9d821120d5

[3] Neo4j graph database, https://neo4j.com/

[4] Google Codey, https://cloud.google.com/vertex-ai/docs/generative-ai/code/code-models-overview

--

--

Claudio Stamile
BIP xTech

Machine Learning Scientist | Double PhD | Software Engineer