Make Meaningful Knowledge Graph from OpenSource REBEL Model
In this tutorial, we will explore how to construct a knowledge graph database from any webpage by leveraging llama index in conjunction with Wikipedia.
read more about Llama_index on official documentations:
- installation
- basic usage
- knowledge graph index
- llamahub
read more about REBEL model and transformers:
- Building a Knowledge Base from Texts: a Full Practical Example
- Babelscape/rebel-large
- transformers
- pytorch
Utilizing Wikipedia allows us to validate the plausibility of a relation before incorporating it into our knowledge base.
Subsequently, we will employ the open-source REBEL model for the extraction of triplets — head, type, and tail structures that provide more meaningful insights into relationships.
This tutorial is designed for a MacOS environment running on Apple Silicon. For those operating on different platforms, the primary distinction lies in the GPU acceleration configuration; you will simply need to modify the device settings accordingly.
Set up
Environment and Dependency
Depending on your environment please set up a venv or conda environment following the requirements for latest pytorch or tensorflow:
For MacOS users with Apple Silicon interested in GPU acceleration, you may:
- Install PyTorch Metal: PyTorch on Apple Silicon
- Alternatively, install TensorFlow along with the TensorFlow-Metal plugin
For users on other systems:
- Install PyTorch: PyTorch Official Site
- Alternatively, install TensorFlow
then run the following in a Jupyter Notebook cell:
%pip install html2text wikipedia pyvis IPython cchardet transformers llama_index
Note:
- if you use google Colab, use `!` in replacement of `%`
- if other packages appear uninstalled in your local environment, run:
%pip install your-missing-package
Then input your OPENAI API KEY into the environment. Check the guide here
Verifying Your PyTorch Device for GPU Acceleration
Before proceeding, validate your PyTorch device’s capability for GPU acceleration:
- For MacOS Apple Silicon users, execute the following code:
import torch
if torch.backends.mps.is_available():
mps_device = torch.device("mps")
x = torch.ones(1, device=mps_device)
print(x)
else:
print("MPS device not found.")
If GPU acceleration is available, the output should be:
tensor([1.], device='mps:0')
Then run:
device = 'mps:0'
- For CUDA users (e.g., on a free Google Colab GPU runtime), execute the following code:
import torch
if torch.cuda.is_available():
cuda_device = torch.device("cuda")
x = torch.ones(1, device=cuda_device)
print(x)
else:
print("CUDA device not found.")
If GPU acceleration is available, the output should be:
tensor([1.], device='cuda:0')
Then run:
device = 'cuda:0'
Method 1 — Using HuggingFace Pipeline
The following approach is based on the Google Colab: Rebel + LlamaIndex Knowledge Graph Query Engine and the LlamaIndex documentation on knowledge graphs.
Initialize the Model Pipeline
We will leverage the HuggingFace library to load a pre-trained model tailored for triplet extraction tasks. The `pipeline` function simplifies the model initialization process, allowing you to specify the task, model, and tokenizer with only a few lines of code. This approach enables you to quickly deploy state-of-the-art machine learning models without the need for extensive configurations.
Specifically, we will use `Babelscape/rebel-large`, check huggingface model card
from transformers import pipeline
triplet_extractor = pipeline('text2text-generation',
model='Babelscape/rebel-large',
tokenizer='Babelscape/rebel-large',
device=device)
Implement triplet extractor method
Now we would like to create a method for processing and parsing the model output into triplets.
def extract_triplets(input_text):
text = triplet_extractor.tokenizer.batch_decode(
[
triplet_extractor(input_text, return_tensors=True, return_text=False)[0][
"generated_token_ids"
]
]
)[0]
triplets = []
relation, subject, relation, object_ = '', '', '', ''
text = text.strip()
current = 'x'
for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
if token == "<triplet>":
current = 't'
if relation != '':
triplets.append(
(
subject.strip(),
relation.strip(),
object_.strip()
)
)
relation = ''
subject = ''
elif token == "<subj>":
current = 's'
if relation != '':
triplets.append(
(
subject.strip(),
relation.strip(),
object_.strip()
)
)
object_ = ''
elif token == "<obj>":
current = 'o'
relation = ''
else:
if current == 't':
subject += ' ' + token
elif current == 's':
object_ += ' ' + token
elif current == 'o':
relation += ' ' + token
if subject != '' and relation != '' and object_ != '':
triplets.append(
(
subject.strip(),
relation.strip(),
object_.strip()
)
)
return triplets
Method 2: Using Wikipedia to filter relations
The methodology in this section is based by the article, Building a Knowledge Base from Texts: a Full Practical Example, authored by Fabio Chiusano
A challenge inherent in using pre-trained models like the HuggingFace pipeline is the “black-box” nature of these models; their internal operations remain largely opaque to us. To enhance the reliability of the extracted relations, we can impose additional validation steps. One effective approach is to use Wikipedia as a supplementary source for vetting the relations we extract. By cross-referencing the model’s output with Wikipedia’s extensive database, we can add an extra layer of credibility to our knowledge graph, ensuring that the relations are not only syntactically correct but also semantically meaningful.
load model from huggingface
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Babelscape/rebel-large")
model = AutoModelForSeq2SeqLM.from_pretrained("Babelscape/rebel-large")
model.to(device)
Set up a model output parser
This output extractor is a parser to make the model output into triplets, similar as method 1, but instead of directly calling pipeline and intake raw text, we are only passing model output into here.
def extract_relations_from_model_output(text):
relations = []
relation, subject, relation, object_ = '', '', '', ''
text = text.strip()
current = 'x'
text_replaced = text.replace("<s>", "").replace("<pad>", "").replace("</s>", "")
for token in text_replaced.split():
if token == "<triplet>":
current = 't'
if relation != '':
relations.append({
'head': subject.strip(),
'type': relation.strip(),
'tail': object_.strip()
})
relation = ''
subject = ''
elif token == "<subj>":
current = 's'
if relation != '':
relations.append({
'head': subject.strip(),
'type': relation.strip(),
'tail': object_.strip()
})
object_ = ''
elif token == "<obj>":
current = 'o'
relation = ''
else:
if current == 't':
subject += ' ' + token
elif current == 's':
object_ += ' ' + token
elif current == 'o':
relation += ' ' + token
if subject != '' and relation != '' and object_ != '':
relations.append({
'head': subject.strip(),
'type': relation.strip(),
'tail': object_.strip()
})
return relations
Set up a KB class for using wikipedia filtering when creating relations
We can create a custom KB class to perform manipulations and validation on model output triplets. Here we use wikipedia to conduct tests on entities to validate, read more here
import math
import wikipedia
class KB():
def __init__(self):
self.entities = {} # { entity_title: {...} }
self.relations = [] # [ head: entity_title, type: ..., tail: entity_title,
# meta: { article_url: { spans: [...] } } ]
self.sources = {} # { article_url: {...} }
def merge_with_kb(self, kb2):
for r in kb2.relations:
article_url = list(r["meta"].keys())[0]
source_data = kb2.sources[article_url]
self.add_relation(r, source_data["article_title"],
source_data["article_publish_date"])
def are_relations_equal(self, r1, r2):
return all(r1[attr] == r2[attr] for attr in ["head", "type", "tail"])
def exists_relation(self, r1):
return any(self.are_relations_equal(r1, r2) for r2 in self.relations)
def merge_relations(self, r2):
r1 = [r for r in self.relations
if self.are_relations_equal(r2, r)][0]
# if different article
article_url = list(r2["meta"].keys())[0]
if article_url not in r1["meta"]:
r1["meta"][article_url] = r2["meta"][article_url]
# if existing article
else:
spans_to_add = [span for span in r2["meta"][article_url]["spans"]
if span not in r1["meta"][article_url]["spans"]]
r1["meta"][article_url]["spans"] += spans_to_add
def get_wikipedia_data(self, candidate_entity):
try:
page = wikipedia.page(candidate_entity, auto_suggest=False)
entity_data = {
"title": page.title,
"url": page.url,
"summary": page.summary
}
return entity_data
except:
return None
def add_entity(self, e):
self.entities[e["title"]] = {k:v for k,v in e.items() if k != "title"}
def add_relation(self, r, article_title, article_publish_date):
# check on wikipedia
candidate_entities = [r["head"], r["tail"]]
entities = [self.get_wikipedia_data(ent) for ent in candidate_entities]
# if one entity does not exist, stop
if any(ent is None for ent in entities):
return
# manage new entities
for e in entities:
self.add_entity(e)
# rename relation entities with their wikipedia titles
r["head"] = entities[0]["title"]
r["tail"] = entities[1]["title"]
# add source if not in kb
article_url = list(r["meta"].keys())[0]
if article_url not in self.sources:
self.sources[article_url] = {
"article_title": article_title,
"article_publish_date": article_publish_date
}
# manage new relation
if not self.exists_relation(r):
self.relations.append(r)
else:
self.merge_relations(r)
def print(self):
print("Entities:")
for e in self.entities.items():
print(f" {e}")
print("Relations:")
for r in self.relations:
print(f" {r}")
print("Sources:")
for s in self.sources.items():
print(f" {s}")
Create a method to call the model and class
def from_text_to_kb(text, article_url, span_length=128, article_title=None,
article_publish_date=None, verbose=False):
# tokenize whole text
inputs = tokenizer([text], return_tensors="pt")
# inputs["input_ids"] = inputs["input_ids"].to("cuda")
# compute span boundaries
num_tokens = len(inputs["input_ids"][0])
if verbose:
print(f"Input has {num_tokens} tokens")
num_spans = math.ceil(num_tokens / span_length)
if verbose:
print(f"Input has {num_spans} spans")
overlap = math.ceil((num_spans * span_length - num_tokens) /
max(num_spans - 1, 1))
spans_boundaries = []
start = 0
for i in range(num_spans):
spans_boundaries.append([start + span_length * i,
start + span_length * (i + 1)])
start -= overlap
if verbose:
print(f"Span boundaries are {spans_boundaries}")
# transform input with spans
tensor_ids = [inputs["input_ids"][0][boundary[0]:boundary[1]]
for boundary in spans_boundaries]
tensor_masks = [inputs["attention_mask"][0][boundary[0]:boundary[1]]
for boundary in spans_boundaries]
inputs = {
"input_ids": torch.stack(tensor_ids).to(device),
"attention_mask": torch.stack(tensor_masks).to(device)
}
# generate relations
num_return_sequences = 3
gen_kwargs = {
"max_length": 256,
"length_penalty": 0,
"num_beams": 3,
"num_return_sequences": num_return_sequences
}
generated_tokens = model.generate(
**inputs,
**gen_kwargs,
)
# decode relations
decoded_preds = tokenizer.batch_decode(generated_tokens,
skip_special_tokens=False)
# create kb
kb = KB()
i = 0
for sentence_pred in decoded_preds:
current_span_index = i // num_return_sequences
relations = extract_relations_from_model_output(sentence_pred)
for relation in relations:
relation["meta"] = {
article_url: {
"spans": [spans_boundaries[current_span_index]]
}
}
kb.add_relation(relation, article_title, article_publish_date)
i += 1
return kb
Test
Now we are ready to implement the above methods, if you are unfarmilar with using llama index, please check my previous article on creating a simple knowledge graph with LlamaIndex.
Set up logging and llama Index
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from llama_index import (
SimpleWebPageReader,
ServiceContext,
KnowledgeGraphIndex,
)
from llama_index.graph_stores import SimpleGraphStore
from llama_index.storage.storage_context import StorageContext
from llama_index.llms import OpenAI
Set up index
# LLM model and service context
use_context = {
"temperature": 0,
"model": "gpt-3.5-turbo",
"chunk_size": 256,
"max_triplets_per_chunk": 3
}
# example url
url = "http://paulgraham.com/worked.html"
# load data
documents = SimpleWebPageReader(html_to_text=True).load_data([url])
# set up service context
llm = OpenAI(temperature=use_context['temperature'],
model=use_context['model'])
service_context = ServiceContext.from_defaults(llm=llm,
chunk_size=use_context['chunk_size'])
# set up graph storage context
graph_store = SimpleGraphStore()
storage_context = StorageContext.from_defaults(graph_store=graph_store)
Using Method 1
To use method 1, we can simply input the triplet extractor method we implemented above into the LlamaIndex KnowledgeGraphIndex initialization.
Like this,
index = KnowledgeGraphIndex.from_documents(
documents=documents,
max_triplets_per_chunk=use_context["max_triplets_per_chunk"],
kg_triplet_extract_fn=extract_triplets,
storage_context=storage_context,
service_context=service_context,
include_embeddings=True,
);
We can then query and create graph,
from IPython.display import Markdown, HTML
# input query
query = "tell me more about IBM 1401"
# set up query engine
query_engine = index.as_query_engine(include_text=False, response_mode="tree_summarize")
# response
response = query_engine.query(query)
Markdown(f"<b>{response}</b>")
you should see
IBM 1401 is a subject that is related to the predicate “manufacturer” and the object “IBM”.
from pyvis.network import Network
g = index.get_networkx_graph()
net = Network(notebook=True, cdn_resources="in_line", directed=True)
net.from_nx(g)
net.show("example_pipeline.html")
HTML(filename='example_pipeline.html')
Using Method 2
In order to use method 2 like how we used method 1, we need to write a wrapper around the whole implementation, so we can pass it in as an argument as well.
def extract_triplets_wiki(input_text):
kb = from_text_to_kb(input_text, url)
tubs = [(i['head'], i['type'], i['tail']) for i in kb.relations]
return tubs
and similarly, (very slow, will try to improve in next iteration)
# set up index
index1 = KnowledgeGraphIndex.from_documents(
documents=documents,
max_triplets_per_chunk=use_context["max_triplets_per_chunk"],
kg_triplet_extract_fn=extract_triplets_wiki,
storage_context=storage_context,
service_context=service_context,
include_embeddings=True,
);
query and make graph
# input query
query = "tell me more about IBM 1401"
# set up query engine
query_engine = index1.as_query_engine(include_text=False, response_mode="tree_summarize")
# response
response = query_engine.query(query)
Markdown(f"<b>{response}</b>")
you should see
IBM 1401 is a subject that is related to the predicate “service entry” and the object “1401”. It is also related to the predicate “manufacturer” and the object “IBM”. Additionally, it is influenced by the programming language Fortran.
## create graph
g = index1.get_networkx_graph()
net = Network(notebook=True, cdn_resources="in_line", directed=True)
net.from_nx(g)
net.show("example_wiki.html")
HTML(filename='example_wiki.html')
Discussion and Next Steps
As we can see from the graphs, these two methods yield very different knowledge graphs. In the next article of our series on exploring RAG with knowledge graph and LlamaIndex, we will look into how to evaluate our datasets.
Thanks for reading.
originally published on:
https://www.quantoceanli.com/writings/Make+Meaningful+Knowledge+Graph+with+REBEL