Relationship Graphs using LLM with Retrieval-Augmented Generation(RAG) and vector database

5 min readFeb 29, 2024

Introduction

Relationship graph is a useful representation to understand and analyze data. Enterprises have large amounts of data in unstructured format, such as pdfs, html etc. Extracting relations from the text has been one of the common task in Natural Language Processing (NLP). Using Large Language Models (LLMs) we can effectively extract, store and analyze the relations from texts. Extracted relationship when stored as structured data (e.g. graph format), makes it easier for further data analysis or inference.

Understanding the relationships of a chemical compound is one of the many use cases of relations/knowledge graphs in biomedical field. Below we will explore building relationship graphs on one compound, “Cetirizine”. I downloaded the below 3 documents from PubMed using the keyword search “Cetirizine”

https://www.ncbi.nlm.nih.gov/books/NBK548420/?report=printable
https://www.ncbi.nlm.nih.gov/books/NBK549776/?report=printable
https://www.ncbi.nlm.nih.gov/books/NBK501509/?report=printable

I extracted the texts and the images from the above html documents and saved the references in the database.

Application

You can view the source code for the application in this repository .

This application requires PostgreSQL with pgvector extension. For more details on the background, setup, refer this post , and the related github repository .

High level steps

Read the text documents
— Vectorize and upload text chunks to the vector database
— Use LLM to generate relationship graph on the sentence, and save to the database
When answering user queries
— Use LLM with RAG
— Show related documents
— Show images from the documents
— Show relationship graph

1. LLM to build relationship graph

This is the interesting part of the application. We extract relationship as a triplet <subj> <relation> <obj>. We ask the LLM to generate it for us. Here is the relevant LLM code.

import transformers
import spacy
import torch

msg_tmplt = [{ "role": "system", "content": "",}, 
             {"role": "user", "content": ''},]
# Request LLM for an entity relation triplet
system_content = """Translate the user content as entity relation triplet in 
                    {"subj": "", "relation": "", "obj": ""} json format."""
msg_tmplt[0]['content'] = system_content

pipeline = transformers.pipeline("text-generation",
                                  model="HuggingFaceH4/zephyr-7b-beta",
                                  torch_dtype=torch.bfloat16,
                                  device_map="auto",
                                 )
gconfigdct = pipeline.model.generation_config.to_dict()
gconfigdct["do_sample"] = True
gconfigdct["top_k"] = 50
gconfigdct["top_p"] = 0.95
gconfigdct["pad_token_id"] = pipeline.model.config.eos_token_id
gconfigdct["temperature"] = .1
gconfigdct["max_new_tokens"] = 512
gconfig = transformers.GenerationConfig(**gconfigdct)

def  show_triplet(text):
    prsr = spacy.load("en_core_sci_lg")
    doc = prsr(text)
    pos = {tkn.pos_ for tkn in doc}          
    # Generate relations only on sentences
    # with < 25 (default) tokens, else the generated relations can be too complicated.
    # with Noun and Verb
    if len(doc) < 25 and 'NOUN' in pos and 'VERB' in pos:
        msg_tmplt[1]['content'] = text
        prompt = pipeline.tokenizer.apply_chat_template(msg_tmplt, tokenize=False,
                                                        add_generation_prompt=False)
        outputs  = pipeline(prompt, generation_config=gconfig)
        res = outputs[0]["generated_text"].split("<|assistant|>\n")[1]
        print(f"Generated triplet: {res}")

Note:

Domain specific answers should be based on existing literature. Hence, the temperature is set to a low value, 1. This restricts LLM from becoming creative.
We are scispacy large model “en_core_sci_lg” as it’s geared for biomedical data.

Few outputs

>>> text = "Cetirizine and levocetirizine have been linked to rare, isolated instances of clinically apparent acute liver injury."
>>> show_triplet(text)
Generated triplet: {
  "subj": "Cetirizine and levocetirizine",
  "relation": "associated with",
  "obj": "rare, isolated instances of clinically apparent acute liver injury"
}


>>> text = "Although considered to be nonsedating antihistamines, cetirizine and levocetirizine can cause mild drowsiness particularly at higher doses."
>>> show_triplet(text)
Generated triplet: {
  "subj": "cetirizine",
  "relation": "can_cause",
  "obj": "mild drowsiness"
},
{
  "subj": "levocetirizine",
  "relation": "can_cause",
  "obj": "mild drowsiness"
},
{
  "subj": "cetirizine",
  "relation": "is_considered",
  "obj": "nonsedating antihistamine"
},
{
  "subj": "levocetirizine",
  "relation": "is_considered",
  "obj": "nonsedating antihistamine"
}

>>> text = "Cetirizine and its enantiomer levocetirizine are second generation antihistamines that are used for the treatment of allergic rhinitis, angioedema and chronic urticaria."
>>> show_triplet(text)
Generated triplet: {
  "subj": "Cetirizine",
  "relation": "is a",
  "obj": "second generation antihistamine"
},
{
  "subj": "Levocetirizine",
  "relation": "is",
  "obj": "enantiomer of cetirizine"
},
{
  "subj": "Cetirizine",
  "relation": "and",
  "obj": "levocetirizine"
},
{
  "subj": "Cetirizine",
  "relation": "are",
  "obj": "used for the treatment of allergic rhinitis, angioedema and chronic urticaria"
}

Note: json output may not be consistent, e.g. <obj> can be a string, dictionary or list. Refer save_chunk_relations to handle these scenarios.

There are other approaches to extract relations
- Custom rules: Building common rules against a wide variety of enterprise texts is not easy.
- Use other models e.g. REBEL

These approaches may (or may not) work for your task. e.g. extracted relationship for the same texts using REBEL falls a bit short.

REBEL output on the same texts

>>> from transformers import pipeline
>>> t_extract = pipeline('text2text-generation', model='Babelscape/rebel-large' )
>>>
>>>
>>> text = "Cetirizine and levocetirizine have been linked to rare, isolated instances of clinically apparent acute liver injury."
>>> e_text = t_extract.tokenizer.batch_decode([t_extract(text, return_tensors=True, return_text=False)[0]["generated_token_ids"]])
>>> print(e_text[0])
<s><triplet> Cetirizine <subj> liver injury <obj> has effect 
<triplet> levocetirizine <subj> liver injury <obj> has effect 
<triplet> liver injury <subj> Cetirizine <obj> has cause <subj> levocetirizine <obj> has cause</s>


>>> text = "Although considered to be nonsedating antihistamines, cetirizine and levocetirizine can cause mild drowsiness particularly at higher doses."
>>> e_text = t_extract.tokenizer.batch_decode([t_extract(text, return_tensors=True, return_text=False)[0]["generated_token_ids"]])
>>> print(e_text[0])
<s><triplet> cetirizine <subj> antihistamines <obj> subject has role 
<triplet> levocetirizine <subj> antihistamines <obj> subject has role</s>


>>> text = "Cetirizine and its enantiomer levocetirizine are second generation antihistamines that are used for the treatment of allergic rhinitis, angioedema and chronic urticaria."
>>> e_text = t_extract.tokenizer.batch_decode([t_extract(text, return_tensors=True, return_text=False)[0]["generated_token_ids"]])
>>> print(e_text[0])
<s><triplet> Cetirizine <subj> urticaria <obj> medical condition treated 
<triplet> urticaria <subj> Cetirizine <obj> drug used for treatment</s>

2. On user query, show the relationship graph

When an user submits a query, the query is tokenized and normalized. We get the list documents, list of chunks, the relations of sentences in the chunks by querying the vector database for similar texts as the user query.

With the relations results from the database, we build the relationship graph using graphviz and serve up as an html jpeg image.

# Get relations for the sentences of the chunks
dbqry = self.emb.dbo_stmts['sim_chunks']
dbqry = dbqry.replace("{qargs}", ','.join(str(i) for i in sim_chunk_ids))
dbres = self.emb.dbexec(dbqry, None, "Get chunk relations")
sim_chunk_lst = []
for each in dbres:
    for row in each:
        for itm in row:
            sim_chunk_lst.append((itm["subj"], itm['obj'], itm['relation']))

# Build graph, return as jpeg image
grph = graphviz.Digraph('wide')
for row in sim_chunk_lst:
    grph.edge(row[0].lower(), row[1].lower(), row[2].lower())
unflt = grph.unflatten(stagger=5)
grph_html = "<h2>Relations graph</h2><div style='max-width:100%; max-height:720px; overflow:auto'>"
grph = '<img src="data:image/jpeg;base64,%s"</img></div>'
grph_html +=  grph %(b64encode(unflt._repr_image_jpeg()).__repr__()[2:-1])