Building Knowledge Graphs: REBEL, LlamaIndex, and REBEL + LlamaIndex

Exploring building knowledge graphs using LlamaIndex and NebulaGraph

Saurav Joshi
10 min readOct 3, 2023

Knowledge graphs have become indispensable tools for leading tech companies, powering recommendation systems, search engines, and a multitude of diverse applications. While large pre-trained language models have demonstrated the capacity to embed factual knowledge and excel in various NLP tasks, they still encounter limitations, especially when required to access and manipulate knowledge with precision. One promising avenue to address this limitation is Retrieval Augmented Generation (RAG), which introduces non-parametric knowledge to these models through dense information retrieval. Interestingly, leveraging knowledge graphs for retrieval has shown potential in further enhancing the efficacy of this approach. In this article, we will construct knowledge graphs using three distinct methodologies: Relation Extraction By End-to-end Language generation (REBEL), LlamaIndex, and a combination of both REBEL and LlamaIndex. In Future articles, we will delve deeper into LlamaIndex’s intricate functionalities concerning knowledge graphs.

Use Case

We will use Wikipedia pages from HuggingFace to build knowledge graphs. Wikipedia offers a rich source of entities and relations, foundational for building a robust knowledge graph. An important step involves segmenting these extensive passages using LangChain, detailed further in the “Implementation” section. our high level architectural framework looks like this:

System Architecture: 3 different methods — REBEL, LlamaIndex, and REBEL + LlamaIndex to construct knowledge graphs

If you’re well-versed with Knowledge Graphs and LlamaIndex, feel free to jump to the “Implementation” section. For newcomers, please continue reading.

Knowledge Graph (KG)

A knowledge graph is a way of organizing and connecting information in a graph format, where nodes represent entities, and edges represent the relationships between those entities. The graph structure allows for efficient storage, retrieval, and analysis of data. In the real world, knowledge graphs fuel advanced search engines, recommendation systems, and AI-driven decision-making, addressing challenges in sectors like e-commerce, healthcare, and media by enabling more insightful data-driven conclusions.

NebulaGraph

NebulaGraph is an open-source distributed, scalable, and high-performance graph database designed to manage vast amounts of interconnected data. NebulaGraph has been widely used for social media, recommendation systems, knowledge graphs, security, capital flows, fraud detection, AI, etc. NebulaGraph services need to run locally, much like Neo4j. You can find different ways to install NebulaGraph here. NebulaGraph Query Language (nGQL) is a declarative graph query language for NebulaGraph.

Relation Extraction By End-to-end Language generation (REBEL)

REBEL, a relation extraction model developed by BabelScape uses the BART model to convert raw sentences into relation triplets. Trained on over 200 relation types, it uses a dataset derived from Wikipedia and Wikidata, filtered with a RoBERTa model. The paper highlights REBEL’s efficiency in extracting vital information for applications like knowledge base population. By leveraging autoregressive seq2seq models, REBEL offers streamlined, top-tier performance in relation extraction.

LlamaIndex

LlamaIndex is an open-source project designed to facilitate in-context learning. The toolkit offers data loaders that serialize diverse knowledge sources like PDFs, Wikipedia pages, and Twitter into a standardized format, eliminating the need for manual preprocessing. With a single code line, LlamaIndex aids in generating and storing embeddings, be it in memory or vector databases. In addition to VectorStoreIndex we have KnowledgeGraphIndex which automates the construction of knowledge graphs from raw text and enables precise entity-based querying. This capability enhances search efficiency, especially in contexts requiring broader, cross-node information. A very simple demonstration of a need of a Knowledge Graph Index is provided in this article here:

See the following diagram, assume the question is about x, and 20 of all the pieces of the data are highly related to it. We could now still get the top 3 pieces of the doc(say, no. 1, 2, and 96) as the main context to be sent, apart from that, we ask for two hops of graph traversal around x from the knowledge graph, then the full context will be:

The question “Tell me things about the author and x”

Raw doc from piece number 1, 2, and 96, in Llama Index, it’s called node 1, node 2, and node 96.

Knowledge 10 triplets contain “x” in two-depths graph traversal:

x -> y(from node 1)

x -> a(from node 2)

x -> m(from node 4)

x <- b-> c(from node 95)

x -> d(from node 96)

n -> x(from node 98)

x <- z <- i(from node 1 and node 3)

x <- z <- b(from node 1 and node 95)
And clearly, the refined information related to topic x that comes from both other nodes and across the nodes is included in the context to build the prompt of in-context learning.

REBEL + LlamaIndex

Utilizing REBEL alongside LlamaIndex offers a refined approach to knowledge graph construction and querying. While LlamaIndex excels in triplet extraction and querying, its default LLM-driven process can be resource-intensive. By integrating REBEL, a model adept at efficient relation extraction, the process becomes more streamlined. This hybrid method ensures faster retrievals with minimized token usage, demonstrating the practicality of combining specialized tools for improved efficiency.

Implementation

In this section, we’ll explore how knowledge graphs are constructed using three approaches: REBEL, LlamaIndex, and a combination of REBEL + LlamaIndex. Our primary focus will be on evaluating the resulting triplets and their count from each method. While I’ve also demonstrated querying the knowledge graphs using LlamaIndex’s Knowledge Graph Query Engine, our main emphasis remains on the building process.

Refer to my GitHub repo for the complete Jupyter notebook to build knowledge graphs.

Step 1: Data Preparation

Here we’re setting up essential tools and libraries. We’ll use these for handling datasets, tokenizing input, managing text chunks, and working with sequence-to-sequence language models.

import os
import random
import json
import hashlib
from datasets import load_dataset
from langchain.text_splitter import RecursiveCharacterTextSplitter
from tqdm.auto import tqdm
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

Next, we begin by loading Wikipedia data from HuggingFace. Then, we randomly sample 10 rows from the test data as our primary source. Leveraging the bert-base-uncased tokenizer, we determine the length of tokenized content. The main objective here is to break the content into manageable chunks using the RecursiveCharacterTextSplitter from LangChain. Once processed, the chunks are saved in a jsonl format for subsequent operations.

validation_data, test_data = load_dataset("suolyer/pile_wikipedia", split=['validation', 'test'])

data = []
random_rows = random.sample(range(len(test_data)), 10)
build_data = [test_data[val]['text'] for val in random_rows]

m = hashlib.md5()
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def bert_len(text):
tokens = tokenizer.encode(text)
return len(tokens)

def create_chunk_dataset(content):
m.update(content.encode('utf-8'))
uid = m.hexdigest()[:12]
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 400,
chunk_overlap = 40,
length_function = bert_len,
separators=['\n\n', '\n', ' ', ''],
)
chunks = text_splitter.split_text(content)
for i, chunk in enumerate(chunks):
data.append({
'id': f'{uid}-{i}',
'text': chunk
})

for dt in build_data:
create_chunk_dataset(dt)

filename = '../data/knowledge graphs/rebel_llamaindex/wiki_chunks.jsonl'
# save
with open(filename, 'w') as outfile:
for x in data:
outfile.write(json.dumps(x) + '\n')

Step 2: REBEL

Here we extract relation triplets from given text using the REBEL model. A utility function extract_triplets is defined to parse the model's output and extract relation triplets. Also the tokenizer and model are initialized from Babelscape/rebel-large.

def extract_triplets(text):
triplets = []
relation, subject, relation, object_ = '', '', '', ''
text = text.strip()
current = 'x'
for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
if token == "<triplet>":
current = 't'
if relation != '':
triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
relation = ''
subject = ''
elif token == "<subj>":
current = 's'
if relation != '':
triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
object_ = ''
elif token == "<obj>":
current = 'o'
relation = ''
else:
if current == 't':
subject += ' ' + token
elif current == 's':
object_ += ' ' + token
elif current == 'o':
relation += ' ' + token
if subject != '' and relation != '' and object_ != '':
triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
return triplets

tokenizer = AutoTokenizer.from_pretrained("Babelscape/rebel-large")
model = AutoModelForSeq2SeqLM.from_pretrained("Babelscape/rebel-large")

With these tools in place, the main logic in the subsequent snippet tokenizes and processes batches of text data. Each processed batch undergoes model generation, which is then decoded and passed to extract_triplets to isolate the relation triplets. The unique set of these extracted triplets is finally saved to a JSON file.

gen_kwargs = {
"max_length": 256,
"length_penalty": 0,
"num_beams": 3,
"num_return_sequences": 1,
}

triples = []

def generate_triples(texts):

model_inputs = tokenizer(texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
generated_tokens = model.generate(
model_inputs["input_ids"].to(model.device),
attention_mask=model_inputs["attention_mask"].to(model.device),
**gen_kwargs
)
decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=False)
for idx, sentence in enumerate(decoded_preds):
et = extract_triplets(sentence)
for t in et:
triples.append((t['head'], t['type'], t['tail']))

for i in tqdm(range(0, len(data), 2)):
try:
texts = [data[i]['text'], data[i+1]['text']]
except:
texts = [data[i]['text']]
generate_triples(texts)

distinct_triples = list(set(triples))

# save
with open('../data/knowledge graphs/rebel_llamaindex/rebel_triples.json', 'w') as file:
json.dump(distinct_triples, file)

Let us now print the extracted triples using REBEL which makes up the knowledge graph and also look at the #triples present in the knowledge graph.

distinct_triples[:5]
[['Edward III', 'child', 'John of Gaunt'],
['Playing God', 'cast member', 'David Duchovny'],
["Union–Republican People's Commissariat of the Armed Forces of the Soviet Union",
'replaces',
"People's Commissariat of the Navy of the Soviet"],
['Somerset County, Pennsylvania',
'located in the administrative territorial entity',
'Pennsylvania'],
['1860 United States presidential election', 'candidate', 'Abraham Lincoln']]
len(distinct_triples)
43

Step 3: LlamaIndex KnowledgeGraphIndex

Next, we will build a knowledge graph using the LlamaIndex KnowledgeGraphIndex. We first, set the environment for using the LlamaIndex library with OpenAI. After setting up necessary logging and loading environment variables, LlamaIndex components are imported, and the OpenAI model text-davinci-002 is instantiated for use within the LlamaIndex’s ServiceContext.

from dotenv import load_dotenv
load_dotenv()
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

import logging
import sys

logging.basicConfig(
stream=sys.stdout, level=logging.INFO
)

from llama_index import (
KnowledgeGraphIndex,
LLMPredictor,
ServiceContext,
SimpleDirectoryReader,
)
from llama_index.storage.storage_context import StorageContext
from llama_index.graph_stores import NebulaGraphStore
from llama_index.llms import OpenAI

from IPython.display import Markdown, display

llm = OpenAI(temperature=0, model="text-davinci-002")
service_context = ServiceContext.from_defaults(llm=llm, chunk_size_limit=512)

To set up NebulaGraph locally, begin by establishing a connection using its default credentials. Once connected, you can create a space called llamaindex to define the schema. This space will house entities and their relationships, represented by tags and edges respectively.

# CREATE SPACE llamaindex(vid_type=FIXED_STRING(256), partition_num=1, replica_factor=1);
# :sleep 10;
# USE llamaindex;
# CREATE TAG entity(name string);
# CREATE EDGE relationship(relationship string);
# CREATE TAG INDEX entity_index ON entity(name(256));

os.environ["NEBULA_USER"] = "root"
os.environ["NEBULA_PASSWORD"] = "nebula"
os.environ[
"NEBULA_ADDRESS"
] = "127.0.0.1:9669"

space_name = "llamaindex"
edge_types, rel_prop_names = ["relationship"], [
"relationship"
]
tags = ["entity"]

With our new space created in NebulaGraph, let’s construct our NebulaGraphStore.

graph_store = NebulaGraphStore(
space_name=space_name,
edge_types=edge_types,
rel_prop_names=rel_prop_names,
tags=tags,
)
storage_context = StorageContext.from_defaults(graph_store=graph_store)

Next, the data is loaded into the system using LlamaIndex’s SimpleDirectoryReader, which reads documents from a specified directory. A Knowledge Graph index, kg_index, is then constructed using these documents. For each document, a maximum of 5 triplets is extracted. The include_embeddings=True parameter ensures that semantic embeddings of the knowledge graph's nodes and edges are also included in the index, facilitating semantically-driven queries in the future.

from llama_index import SimpleDirectoryReader

reader = SimpleDirectoryReader(input_dir="../data/knowledge graphs/rebel_llamaindex/wiki/")
documents = reader.load_data()

kg_index = KnowledgeGraphIndex.from_documents(
documents,
storage_context=storage_context,
max_triplets_per_chunk=5,
service_context=service_context,
space_name=space_name,
edge_types=edge_types,
rel_prop_names=rel_prop_names,
tags=tags,
include_embeddings=True,
)

After connecting to the local instance of NebulaGraph, now we query the llamaindex space using Nebula Graph Query Language and fetch ten relationship triplets from the constructed knowledge graph and also count all such relationships present in the database.

%load_ext ngql
%ngql --address 127.0.0.1 --port 9669 --user root --password nebula

%ngql USE llamaindex;

%ngql MATCH (m)-[e]->(n) RETURN m.entity.name AS m_entity,e.relationship AS relationship,n.entity.name AS n_entity LIMIT 10
%ngql MATCH (m)-[e]->(n) RETURN COUNT(*)
from llama_index.query_engine import KnowledgeGraphQueryEngine

from llama_index.storage.storage_context import StorageContext
from llama_index.graph_stores import NebulaGraphStore

query_engine = KnowledgeGraphQueryEngine(
storage_context=storage_context,
service_context=service_context,
llm=llm,
verbose=True,
)

Let’s now run a simple query.

Tell me about Sébastien Pan?

response = query_engine.query(
"Tell me about Sébastien Pan?",
)
display(Markdown(f"<b>{response}</b>"))

Step 4: REBEL + LlamaIndex KnowledgeGraphIndex

Now, let’s establish a new space rebel_llamaindex, which leverages the capabilities of both REBEL and LlamaIndex to build a knowledge graph.

# CREATE SPACE rebel_llamaindex(vid_type=FIXED_STRING(256), partition_num=1, replica_factor=1);
# :sleep 10;
# USE rebel_llamaindex;
# CREATE TAG entity(name string);
# CREATE EDGE relationship(relationship string);
# CREATE TAG INDEX entity_index ON entity(name(256));

space_name = "rebel_llamaindex"
edge_types, rel_prop_names = ["relationship"], [
"relationship"
]
tags = ["entity"]
graph_store = NebulaGraphStore(
space_name=space_name,
edge_types=edge_types,
rel_prop_names=rel_prop_names,
tags=tags,
)
storage_context = StorageContext.from_defaults(graph_store=graph_store)

Here, we will construct the index, as well as leveraging REBEL to extract triplets. Observe that we’re employing the HuggingFace pipeline, which simplifies much of the intricacy involved in initializing the REBEL setup.

from transformers import pipeline

triplet_extractor = pipeline('text2text-generation', model='Babelscape/rebel-large', tokenizer='Babelscape/rebel-large')
rebel_kg_index = KnowledgeGraphIndex.from_documents(
documents,
kg_triplet_extract_fn=extract_triplets,
storage_context=storage_context,
max_triplets_per_chunk=5,
service_context=service_context,
space_name=space_name,
edge_types=edge_types,
rel_prop_names=rel_prop_names,
tags=tags,
include_embeddings=True,
)

Now we query the rebel_llamaindex space using Nebula Graph Query Language and fetch ten relationship triplets from the constructed knowledge graph and also count all such relationships present in the database.

%ngql USE rebel_llamaindex;

%ngql MATCH (m)-[e]->(n) RETURN m.entity.name AS m_entity,e.relationship AS relationship,n.entity.name AS n_entity LIMIT 10
%ngql MATCH (m)-[e]->(n) RETURN COUNT(*)

Let’s now run a simple query.

Tell me about Savoy Hotel?

response = query_engine.query(
"Tell me about Savoy Hotel?",
)
display(Markdown(f"<b>{response}</b>"))

Key Takeaways

The REBEL + LlamaIndex knowledge graph contains fewer triples compared to those produced solely by LlamaIndex or REBEL. This reduction with LlamaIndex arises because LlamaIndex relies on an OpenAI model for relation extraction, whereas REBEL+LlamaIndex uses the rebel-large model. The count is also fewer than REBEL’s knowledge graph because of the max_triplets_per_chunk=5 constraint during index creation while using REBEL+LlamaIndex. Nevertheless, REBEL+LlamaIndex has faster performance, and adjusting parameters like max_triplets_per_chunk could yield more detailed triples.

Summary

In this article, we delve into the creation of knowledge graphs using REBEL, LlamaIndex, and their combined approach. Harnessing the vast information from Wikipedia, we experiment frameworks for optimized knowledge graph creation. By synergizing LlamaIndex’s capabilities with REBEL’s efficient relation extraction, the combined method offers rapid knowledge graph construction performance. Despite producing fewer triples than either approach alone, this integrated approach highlights the potential of merging specialized tools for enhanced results.

Reference

--

--