Knowledge Graphs + Large Language Models = The ability for users to ask their own questions?

--

Any database is pretty useless unless the user can ask, and get answers to, their questions.

  • KnowledgeGraphs are superb at representing a domain’s data, and, together with a model, can deliver answers, though the questions need to be formulated as a query by an expert.
  • Conversely, Large Language Models (LLM) allow any user to ask a question to retrieve a comprehensive answer, but that answer will not include information drawn from the users’ domain data.

In this article, we show how a knowledge graph can prompt or fine-tune an LLM enabling users to ask their questions. To illustrate this, we use an RDF knowledge graph of a process plant, the core of a Digital-Twin, to prompt or fine-tune OpenAI’s GPT LLM.

Example Problem

The example chosen to illustrate the use of an LLM to answer complex graph questions is a process flow sheet as shown below:

Process Flow Sheet

A process flow sheet (PFD) is a diagrammatic representation of the process equipment and associated instrumentation and controls. This particular process flow sheet shows how the material is processed through a distillation tower (Distill-501) to produce two products emerging via BL-055 and BL-056. It also shows how instrumentation has been added in conjunction with some pressure, temperature, and flow controllers.

This problem has been chosen because, unambiguously IMHO, the best way to capture the flowsheet connectivity (aka digital twin) is with a knowledge graph. Now there are different ways of expressing the flowsheet connectivity as a graph, but for this example, the simplest has been chosen.

Elsewhere, in Graph Connectivity Model, the different approaches to capturing a digital twin of this flowsheet are described. The simplest, referred to as a 1-D model, is used here but is the least useful. Reification of the connectedTo predicate brings greater flexibility, referred to as a 2-D model. Finally the addition of attributes to the reified predicate captures the evolution over time of a typical flowsheet as equipment is replaced or repaired. This is referred to as a 3-D model.

For example, the connectivity of Reboiler-601 is expressed as an RDF Knowledge Graph as follows:

:Reboiler-601
connectivity:connectedTo :Pipe-005 ;
connectivity:connectedTo :Pipe-006 ;
connectivity:connectedTo :Pipe-012 ;
.

Questioning

A database is pretty useless unless we can ask, and get answers to, questions about the data.

Now there are innumerable questions we can ask of any database. For example, we could ask how many things are connected to an item in the flowsheet, such as Reboiler-601. If the database cannot answer this then it is time it was retired!

In the ‘real-world’ a user would be asking ‘real-world’ questions, such as

What is the impact of opening Valve-104 on vessel Reflux-401?

We are likely to be disappointed if we expect our database can answer this question directly. However, the strength of a knowledge graph is that we can deduce what paths exist between things. So we could simplify this question to the following:

How is Valve-104 connected to Reflux-401?

Asking the RDF Graph

So we can ask the RDF knowledge graph this question directly? Well actually no, because the most common query language for RDF is SPARQL which is, to be honest, weak when asking path query questions.

PathQL is an alternative language for accessing RDF graphs, optimized for querying paths between things in the RDF graph.

:Valve-104/:connectedTo{1,*}/:Reflux-401

would return all of the routes between :Valve-104 and :Reflux-401.

The SPARQL for this question would be something like this:

SELECT  *
{
VALUES(?start ?end){(:Valve-104 :Reflux-401)}
{
?start ?p01 ?end .
}UNION{
?start ?p01 ?i1 .
?i1 ?p12 ?end .
} UNION{
....lots of missing parts to the query!!
}UNION{
?start ?p01 ?i1 .
?i1 ?p12 ?i2 .
?i2 ?p23 ?i3 .
?i3 ?p34 ?i4 .
?i4 ?p45 ?i5 .
?i5 ?p56 ?i6 .
?i6 ?p67 ?i67 .
?i7 ?p78 ?end .
}
}

Yes, it can be written more simply using path patterns, but they are limited especially when there are loops or multiple paths.

This is not something we would want an end-user to endure.

Asking a Large Language Model

Everyone is talking about AI, ChatGPT, and Large Language Models (LLM). So I will too, not because it is fashionable to do so, but because they are the closest thing to sliced bread😁

However, if I ask ChatGPT my question:

How is Valve-104 connected to Reflux-401?

It will respond something like this:

I'm sorry, but without additional context or information about Valve-104 and Reflux-401, I am unable to provide an answer to your question. Can you please provide more details or clarify your question?

Ask an LLM … to tell me the SPARQL

This is not at all surprising as it knows nothing at all about the process flowsheet about which I am asking a question. I could provide the ontology of my RDF graph as the context prompt for the LLM:

Using the OpenAI gpt-3.5-turbo LLM

I am going to use a RDF graph with this ontology: 
connectivity:connectedTo
a owl:ObjectProperty ;
rdfs:domain connectivity:Node;
rdfs:range connectivity:Node;
.
connectivity:Node
a owl:Class ;
.

To which it responds:

The ontology includes two main components: the object property "connectedTo" and the class "Node." The object property "connectedTo" represents the connection between two nodes, where the domain and range are both the class "Node." This means that "connectedTo" can only link two "Node" instances together.
The class "Node" is a generic class that can represent any type of node, such as a node in a network or a node in a graph. It does not have any specific properties or restrictions, but it can be used to create more specific classes or subclasses that inherit its properties.
Overall, this ontology provides a basic framework for representing connections between nodes in a network or graph.

Having provided the ontology as the context, it is ready for me to ask it to prepare some SPARQL:

Please provide a SPARQL query that tells me everything a particular node is connected to.

And in response, I get the correct answer:

Sure! Assuming you know the URI of the node you are interested in, you can use the following SPARQL query to retrieve all the nodes that it is connected to using the "connectionTo" object property:
```
PREFIX connectivity: <http://example.org/ontology#>
SELECT ?connectedNode
WHERE {
<http://example.org/node1> connectivity:connectedTo ?connectedNode .
}
```
In this example, the URI `<http://example.org/node1>` is used as the subject of the query. You can replace it with the URI of the node you are interested in. The query uses the "connectedTo" object property to retrieve all the nodes that the subject is connected to, which are represented by the variable "?connectedNode."
If you run this query, it will return a list of URIs representing all the nodes that the subject is connected to. You can modify the query to retrieve additional information about these nodes, such as their labels or other properties, depending on what information you have available in your RDF graph.

But that is not the actual question I want to ask, so I politely ask another:

Politeness is important when talking with an LLM. You never know when you will be working for the LLM.

Great. Now a more complex query. Please provide the SPARQL query that returns the sequence of nodes that connect between two particular nodes.

To which it returns:

Sure! Here is an example of a SPARQL query that returns the sequence of nodes that connect two particular nodes, assuming that there is a unique path between them:
```
PREFIX connectivity: <http://example.org/ontology#>
SELECT ?path
WHERE {
{
<http://example.org/node1> (connectivity:connectedTo)+ ?path .
?path connectivity:connectedTo <http://example.org/node2> .
}
}
```
This query uses the property path syntax to find all paths that connect the two nodes. The syntax `(connectivity:connectedTo)+` means that we are looking for one or more occurrences of the "connectedTo" property, which allows us to traverse a path of arbitrary length.
The query returns the variable "?path," which represents the sequence of nodes that connect the two nodes. This variable will contain a list of URIs representing the nodes in the path.
Note that this query assumes that there is a unique path between the two nodes. If there are multiple paths, or if there are cycles in the graph, the query may return multiple results or an incorrect result.

Subject to the LLM’s own caveat, this is an impressive query, limited only by the expressive power of SPARQL to answer the question.

Ask an LLM … to query the RDF graph

Instead of providing the ontology as the context to the LLM, why not provide the RDF graph itself as the context? Then the LLM has everything that it needs to answer the question itself.

This aligns well with the language foundation of any LLM, since an RDF graph can be thought of as a highly formalized language description of the data, having been built from subject-predicate(verb)-object statements or sentences, just like any other language.

Thus one could rewrite the RDF graph as:


Reboiler-601 is connected to Pipe-005.
Reboiler-601 is connected to Pipe-006.
Reboiler-601 is connected to Pipe-012.

The initial context we provide is as follows:

Let's introduce the concept of Node.
BL-051,BL-052,…, are all individuals of type Node in the ontology.

Reboiler-601 is connected to Pipe-005.
Reboiler-601 is connected to Pipe-006.
Reboiler-601 is connected to Pipe-012.

Let’s start off with some simple questions:

What is connected to or from Distill-501?

To which we get the correct answer:

Distill-501 is connected to several different nodes:
1. Distill-501 is connected to Pipe-002.
2. Distill-501 is connected to Pipe-005.

8. Distill-501 is connected to SignalLine-208.
9. Distill-501 is connected to SignalLine

Following up with another low-ball question:

How many things are connected to Distill-501?

The LLM response is a little tetchy:-)

Distill-501 is connected to a total of 9 different things, as listed in my previous response.

Now for the real question:

How is Distill-501 connectedTo Reflux-401?

Unfortunately GPT answers incorrectly:

Distill-501 is connected to Reflux-401 through Pipe-022.

So we are forced to correct it:

No. Distill-501 is connected to Pipe-016, which is connected to Condenser-303, which is connected to Pipe-022, which then connects to Reflux-401.

To which it responds:

My apologies for the mistake. You are correct, Distill-501 is connected to Reflux-401 through Pipe-016, Condenser-303, and Pipe-022, as you described. Thank you for correcting me.

This prompt/response/correction continues for several iterations. This is suggested by ‘chain-of-thought’ prompting. Each time the LLM improves its accuracy. Eventually, this question is asked:

So how is Valve-104 connected To Reflux-401?

To which it answers, correctly and completely:

Valve-104 is indirectly connected to Reflux-401 through two different paths:
1. Valve-104 is directly connected to Pipe-011, which is connected to Reboiler-601. Reboiler-601 is then indirectly connected to Reflux-401 through Pipe-013, Distill-501, Pipe-016, Condenser-303, and Pipe-022.
2. Valve-104 is directly connected to SignalLine-212, which is connected to TempCont-901. TempCont-901 is then indirectly connected to Reflux-401 through SignalLine-213, Distill-501, Pipe-018,…

This is now really promising because it gets the correct answers because there are multiple routes.

Scaling

Interacting with the OpenAI models using the above interface is limited by the size of the context that can be provided This limit to the size of the context is 3~10 pages long, depending on the version of the model chosen. This limit is increasing as more models are released, but will it ever be large enough to accommodate the linguistic version of a realistic graph? An alternative is required to allow this approach to scale.

Scaling … by pre-indexing the RDF graph

One solution to an overly large context is to pre-index the documents you want as the context, and then only add fragments of the documents that match the question you are asking.

For example, if asking the question:

So how is Valve-104 connected To Reflux-401?

The index should include only those document fragments that reference Valve-104 and Reflux-401. So what is the ‘document’ matching the RDF graph? One way of creating such a document is to use a linguistically expressed Concise Bound Description for each of the entities within the graph. This can be queried from the graph to produce the following, repeated for each entity in the graph.

Distill-501 
has connectedTo SignalLine-213,
which has connectedTo TempCont-901,
which is connectedTo of Distill-501,
has connectedTo Pipe-013,
which has connectedTo Reboiler-601,
which is connectedTo of Distill-501,

is connectedTo of Pipe-014,
which has connectedTo Distill-501,
is connectedTo of Pipe-018,
which has connectedTo Distill-501,
is connectedTo of Pipe-005,
which has connectedTo Distill-501.

We can then pre-index the CBD file and then ask the question

documents = SimpleDirectoryReader(“Flowsheet.ICB2D).load_data()
index = GPTSimpleVectorIndex.from_documents(documents)
response = index.query("How is Valve-104 connected To Reflux-401?")
print(response)

And the response is …

Valve-104 is not directly connected to Reflux-401, but it is connected to SignalLine-213, which is connected to TempCont-901, which is connected to SignalLine-207, which is connected to TempCont-951, which is connected to SignalLine-208, which is connected to TempCont-952, which is connected to SignalLine-209, which is connected to TempCont-953, which is connected to Reflux-401.

… wrong!

So what went wrong? Creating CBDs for each entity within the graph might not work well when we are pursuing paths through the graph. Each CBD is like a mini-graph centered on an entity. Thus the LLM will only be receiving context containing these mini-graphs. However, we want to find a path through the graph, so it is likely this llamaIndex approach is not a good match for the question or type of graph. However, it is likely to work well if the graph’s nodes are all clustered around entities.

I’m not giving up on this approach as other graph serializations need to be investigated. We could serialize all of the paths within a graph, an approach postulated in Neural Graph Databases and RDF2VEC, so if interested watch this space for updates.

Scaling … by fine-tuning with the RDF graph

An LLM is not going to be tuned to your domain’s information. We have seen that the generic knowledge corpus can be supplemented by providing context information about the domain. However, this is severely limited in size.

An LLM can be fine-tuned with an unlimited amount of training data … at a cost.

Training data needs to be gathered, cleaned, and prepared for training. The preparation of the data for fine-tuning involves supplying a series of prompts, and their corresponding completions.

{"prompt": "The cat sat on?", "completion": " the mat"}

In other words, the prompt is the start of a sentence, whilst the completion is the end of that sentence. In its simplest form, a prompt:completion pair seems to correspond to the edge of an RDF graph. So the following graph fragment:

Reboiler-601
connectivity:connectedTo :Pipe-005 ;
connectivity:connectedTo :Pipe-006 ;
connectivity:connectedTo :Pipe-012 ;

Can be used as fine-tuning prompts as follows:

{"prompt":" What is Reboiler-601 connectedTo?","completion": " Reboiler-601 is connectedTo Pipe-005 "}
{"prompt":" What is Reboiler-601 connectedTo?","completion": " Reboiler-601 is connectedTo Pipe-006 "}
{"prompt":" What is Reboiler-601 connectedTo?","completion": " Reboiler-601 is connectedTo Pipe-012 "}
{"prompt":" What is connectedTo Reboiler-601?","completion": " Pipe-011 is connectedTo Reboiler-601 "}
{"prompt":" What is connectedTo Reboiler-601?","completion": " Pipe-013 is connectedTo Reboiler-601 "}

So far answers to even the simplest questions have been disappointing.

The successful questioning of an LLM given the graph as a context used the gpt-3.5-turbo, which is well-tuned for such chain-of-thought prompting. Unfortunately fine-tuning is only available on simpler models with less tuning.

Where to next?

The value of information increases is proportional to the square of the number of connections between things: a variant of Metcalfe’s law.

  • Therefore the integration of one’s domain graph with the information within an LLM creates an explosive increase in the value of one’s domain knowledge graph.

Being on the bleeding edge means you must expect to get injured.

  • None of this is yet working ‘out-of-the-box’, but the successes exceed the failures, so far, and it is only going to improve.
  • Not all the examples cited could use the same model. GPT-3.5-turbo was great for prompted training of graph paths, but fine-tuning was only available using a less well-tuned base model.

There are many further ways of fine-tuning an LLM with a Knowledge Graph.

  • Fine-tuning could use serialized statements combined with Concise Bound Descriptions
  • Pre-indexing could use serialized random walks through the graph.

KnowledgeGraphs look like, IMHO, the perfect precursor to preparing LLM training data, whether for pre-prompting (demonstrated) or fine-tuning (TBD).

--

--

Peter Lawrence, answering users' data questions

Providing solutions that answer users information questions using database technology (SQL, RDF, KnowledgeGraph) for the industrial, legal, scientific domain.