Make Meaningful Knowledge Graph from OpenSource REBEL Model

8 min readSep 30, 2023

In this tutorial, we will explore how to construct a knowledge graph database from any webpage by leveraging llama index in conjunction with Wikipedia.

read more about Llama_index on official documentations:
- installation
- basic usage
- knowledge graph index
- llamahub

read more about REBEL model and transformers:
- Building a Knowledge Base from Texts: a Full Practical Example
- Babelscape/rebel-large
- transformers
- pytorch

Utilizing Wikipedia allows us to validate the plausibility of a relation before incorporating it into our knowledge base.

Subsequently, we will employ the open-source REBEL model for the extraction of triplets — head, type, and tail structures that provide more meaningful insights into relationships.

This tutorial is designed for a MacOS environment running on Apple Silicon. For those operating on different platforms, the primary distinction lies in the GPU acceleration configuration; you will simply need to modify the device settings accordingly.

Set up

Environment and Dependency

Depending on your environment please set up a venv or conda environment following the requirements for latest pytorch or tensorflow:

For MacOS users with Apple Silicon interested in GPU acceleration, you may:

Install PyTorch Metal: PyTorch on Apple Silicon
Alternatively, install TensorFlow along with the TensorFlow-Metal plugin

For users on other systems:

Install PyTorch: PyTorch Official Site
Alternatively, install TensorFlow

then run the following in a Jupyter Notebook cell:

%pip install html2text wikipedia pyvis IPython cchardet transformers llama_index

Note:

if you use google Colab, use `!` in replacement of `%`
if other packages appear uninstalled in your local environment, run:

%pip install your-missing-package

Then input your OPENAI API KEY into the environment. Check the guide here

Verifying Your PyTorch Device for GPU Acceleration

Before proceeding, validate your PyTorch device’s capability for GPU acceleration:

For MacOS Apple Silicon users, execute the following code:

import torch

if torch.backends.mps.is_available():
    mps_device = torch.device("mps")
    x = torch.ones(1, device=mps_device)
    print(x)
else:
    print("MPS device not found.")

If GPU acceleration is available, the output should be:

tensor([1.], device='mps:0')

Then run:

device = 'mps:0'

For CUDA users (e.g., on a free Google Colab GPU runtime), execute the following code:

import torch

if torch.cuda.is_available():
    cuda_device = torch.device("cuda")
    x = torch.ones(1, device=cuda_device)
    print(x)
else:
    print("CUDA device not found.")

If GPU acceleration is available, the output should be:

tensor([1.], device='cuda:0')

Then run:

device = 'cuda:0'

Method 1 — Using HuggingFace Pipeline

The following approach is based on the Google Colab: Rebel + LlamaIndex Knowledge Graph Query Engine and the LlamaIndex documentation on knowledge graphs.

Initialize the Model Pipeline

We will leverage the HuggingFace library to load a pre-trained model tailored for triplet extraction tasks. The `pipeline` function simplifies the model initialization process, allowing you to specify the task, model, and tokenizer with only a few lines of code. This approach enables you to quickly deploy state-of-the-art machine learning models without the need for extensive configurations.

Specifically, we will use `Babelscape/rebel-large`, check huggingface model card

from transformers import pipeline

triplet_extractor = pipeline('text2text-generation', 
        model='Babelscape/rebel-large',
        tokenizer='Babelscape/rebel-large', 
        device=device)

Implement triplet extractor method

Now we would like to create a method for processing and parsing the model output into triplets.

def extract_triplets(input_text):
    text = triplet_extractor.tokenizer.batch_decode(
        [
            triplet_extractor(input_text, return_tensors=True, return_text=False)[0][
                "generated_token_ids"
            ]
        ]
    )[0]

    triplets = []
    relation, subject, relation, object_ = '', '', '', ''
    text = text.strip()
    current = 'x'
    for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
        if token == "<triplet>":
            current = 't'
            if relation != '':
                triplets.append(
                    (
                        subject.strip(),
                        relation.strip(),
                        object_.strip()
                    )
                )
                relation = ''
            subject = ''
        elif token == "<subj>":
            current = 's'
            if relation != '':
                triplets.append(
                    (
                        subject.strip(),
                        relation.strip(),
                        object_.strip()
                    )
                )
            object_ = ''
        elif token == "<obj>":
            current = 'o'
            relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token

    if subject != '' and relation != '' and object_ != '':
        triplets.append(
            (
                subject.strip(),
                relation.strip(),
                object_.strip()
            )
        )

    return triplets

Method 2: Using Wikipedia to filter relations

The methodology in this section is based by the article, Building a Knowledge Base from Texts: a Full Practical Example, authored by Fabio Chiusano

A challenge inherent in using pre-trained models like the HuggingFace pipeline is the “black-box” nature of these models; their internal operations remain largely opaque to us. To enhance the reliability of the extracted relations, we can impose additional validation steps. One effective approach is to use Wikipedia as a supplementary source for vetting the relations we extract. By cross-referencing the model’s output with Wikipedia’s extensive database, we can add an extra layer of credibility to our knowledge graph, ensuring that the relations are not only syntactically correct but also semantically meaningful.

load model from huggingface

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Babelscape/rebel-large")
model = AutoModelForSeq2SeqLM.from_pretrained("Babelscape/rebel-large")
model.to(device)

Set up a model output parser

This output extractor is a parser to make the model output into triplets, similar as method 1, but instead of directly calling pipeline and intake raw text, we are only passing model output into here.

def extract_relations_from_model_output(text):
    relations = []
    relation, subject, relation, object_ = '', '', '', ''
    text = text.strip()
    current = 'x'
    text_replaced = text.replace("<s>", "").replace("<pad>", "").replace("</s>", "")
    for token in text_replaced.split():
        if token == "<triplet>":
            current = 't'
            if relation != '':
                relations.append({
                    'head': subject.strip(),
                    'type': relation.strip(),
                    'tail': object_.strip()
                })
                relation = ''
            subject = ''
        elif token == "<subj>":
            current = 's'
            if relation != '':
                relations.append({
                    'head': subject.strip(),
                    'type': relation.strip(),
                    'tail': object_.strip()
                })
            object_ = ''
        elif token == "<obj>":
            current = 'o'
            relation = ''
        else:
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and relation != '' and object_ != '':
        relations.append({
            'head': subject.strip(),
            'type': relation.strip(),
            'tail': object_.strip()
        })
    return relations

Set up a KB class for using wikipedia filtering when creating relations

We can create a custom KB class to perform manipulations and validation on model output triplets. Here we use wikipedia to conduct tests on entities to validate, read more here

import math
import wikipedia

class KB():
    def __init__(self):
        self.entities = {} # { entity_title: {...} }
        self.relations = [] # [ head: entity_title, type: ..., tail: entity_title,
          # meta: { article_url: { spans: [...] } } ]
        self.sources = {} # { article_url: {...} }

    def merge_with_kb(self, kb2):
        for r in kb2.relations:
            article_url = list(r["meta"].keys())[0]
            source_data = kb2.sources[article_url]
            self.add_relation(r, source_data["article_title"],
                              source_data["article_publish_date"])

    def are_relations_equal(self, r1, r2):
        return all(r1[attr] == r2[attr] for attr in ["head", "type", "tail"])

    def exists_relation(self, r1):
        return any(self.are_relations_equal(r1, r2) for r2 in self.relations)

    def merge_relations(self, r2):
        r1 = [r for r in self.relations
              if self.are_relations_equal(r2, r)][0]

        # if different article
        article_url = list(r2["meta"].keys())[0]
        if article_url not in r1["meta"]:
            r1["meta"][article_url] = r2["meta"][article_url]

        # if existing article
        else:
            spans_to_add = [span for span in r2["meta"][article_url]["spans"]
                            if span not in r1["meta"][article_url]["spans"]]
            r1["meta"][article_url]["spans"] += spans_to_add

    def get_wikipedia_data(self, candidate_entity):
        try:
            page = wikipedia.page(candidate_entity, auto_suggest=False)
            entity_data = {
                "title": page.title,
                "url": page.url,
                "summary": page.summary
            }
            return entity_data
        except:
            return None

    def add_entity(self, e):
        self.entities[e["title"]] = {k:v for k,v in e.items() if k != "title"}

    def add_relation(self, r, article_title, article_publish_date):
        # check on wikipedia
        candidate_entities = [r["head"], r["tail"]]
        entities = [self.get_wikipedia_data(ent) for ent in candidate_entities]

        # if one entity does not exist, stop
        if any(ent is None for ent in entities):
            return

        # manage new entities
        for e in entities:
            self.add_entity(e)

        # rename relation entities with their wikipedia titles
        r["head"] = entities[0]["title"]
        r["tail"] = entities[1]["title"]

        # add source if not in kb
        article_url = list(r["meta"].keys())[0]
        if article_url not in self.sources:
            self.sources[article_url] = {
                "article_title": article_title,
                "article_publish_date": article_publish_date
            }

        # manage new relation
        if not self.exists_relation(r):
            self.relations.append(r)
        else:
            self.merge_relations(r)

    def print(self):
        print("Entities:")
        for e in self.entities.items():
            print(f"  {e}")
        print("Relations:")
        for r in self.relations:
            print(f"  {r}")
        print("Sources:")
        for s in self.sources.items():
            print(f"  {s}")

Create a method to call the model and class

def from_text_to_kb(text, article_url, span_length=128, article_title=None,
                    article_publish_date=None, verbose=False):
    # tokenize whole text
    inputs = tokenizer([text], return_tensors="pt")
    # inputs["input_ids"] = inputs["input_ids"].to("cuda")

    # compute span boundaries
    num_tokens = len(inputs["input_ids"][0])
    if verbose:
        print(f"Input has {num_tokens} tokens")
    num_spans = math.ceil(num_tokens / span_length)
    if verbose:
        print(f"Input has {num_spans} spans")
    overlap = math.ceil((num_spans * span_length - num_tokens) /
                        max(num_spans - 1, 1))
    spans_boundaries = []
    start = 0
    for i in range(num_spans):
        spans_boundaries.append([start + span_length * i,
                                 start + span_length * (i + 1)])
        start -= overlap
    if verbose:
        print(f"Span boundaries are {spans_boundaries}")

    # transform input with spans
    tensor_ids = [inputs["input_ids"][0][boundary[0]:boundary[1]]
                  for boundary in spans_boundaries]
    tensor_masks = [inputs["attention_mask"][0][boundary[0]:boundary[1]]
                    for boundary in spans_boundaries]

    inputs = {
        "input_ids": torch.stack(tensor_ids).to(device),
        "attention_mask": torch.stack(tensor_masks).to(device)
    }

    # generate relations
    num_return_sequences = 3
    gen_kwargs = {
        "max_length": 256,
        "length_penalty": 0,
        "num_beams": 3,
        "num_return_sequences": num_return_sequences
    }
    generated_tokens = model.generate(
        **inputs,
        **gen_kwargs,
    )

    # decode relations
    decoded_preds = tokenizer.batch_decode(generated_tokens,
                                           skip_special_tokens=False)

    # create kb
    kb = KB()
    i = 0
    for sentence_pred in decoded_preds:
        current_span_index = i // num_return_sequences
        relations = extract_relations_from_model_output(sentence_pred)
        for relation in relations:
            relation["meta"] = {
                article_url: {
                    "spans": [spans_boundaries[current_span_index]]
                }
            }
            kb.add_relation(relation, article_title, article_publish_date)
        i += 1
    return kb

Test

Now we are ready to implement the above methods, if you are unfarmilar with using llama index, please check my previous article on creating a simple knowledge graph with LlamaIndex.

Set up logging and llama Index

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from llama_index import (
    SimpleWebPageReader,
    ServiceContext,
    KnowledgeGraphIndex,
)

from llama_index.graph_stores import SimpleGraphStore
from llama_index.storage.storage_context import StorageContext
from llama_index.llms import OpenAI

Set up index

# LLM model and service context
use_context = {
    "temperature": 0,
    "model": "gpt-3.5-turbo",
    "chunk_size": 256,
    "max_triplets_per_chunk": 3
    }

# example url 
url = "http://paulgraham.com/worked.html"

# load data 
documents = SimpleWebPageReader(html_to_text=True).load_data([url])

# set up service context
llm = OpenAI(temperature=use_context['temperature'], 
    model=use_context['model'])

service_context = ServiceContext.from_defaults(llm=llm, 
              chunk_size=use_context['chunk_size'])

# set up graph storage context
graph_store = SimpleGraphStore()
storage_context = StorageContext.from_defaults(graph_store=graph_store)

Using Method 1

To use method 1, we can simply input the triplet extractor method we implemented above into the LlamaIndex KnowledgeGraphIndex initialization.

Like this,

index = KnowledgeGraphIndex.from_documents(
    documents=documents,
    max_triplets_per_chunk=use_context["max_triplets_per_chunk"],
    kg_triplet_extract_fn=extract_triplets,
    storage_context=storage_context,
    service_context=service_context,
    include_embeddings=True,
);

We can then query and create graph,

from IPython.display import Markdown, HTML

# input query
query = "tell me more about IBM 1401"

# set up query engine
query_engine = index.as_query_engine(include_text=False, response_mode="tree_summarize")

# response
response = query_engine.query(query)
Markdown(f"<b>{response}</b>")

you should see

IBM 1401 is a subject that is related to the predicate “manufacturer” and the object “IBM”.

from pyvis.network import Network

g = index.get_networkx_graph()
net = Network(notebook=True, cdn_resources="in_line", directed=True)
net.from_nx(g)
net.show("example_pipeline.html")

HTML(filename='example_pipeline.html')

Using Method 2

In order to use method 2 like how we used method 1, we need to write a wrapper around the whole implementation, so we can pass it in as an argument as well.

def extract_triplets_wiki(input_text):
    kb = from_text_to_kb(input_text, url)
    tubs = [(i['head'], i['type'], i['tail']) for i in kb.relations]
    return tubs

and similarly, (very slow, will try to improve in next iteration)

# set up index
index1 = KnowledgeGraphIndex.from_documents(
    documents=documents,
    max_triplets_per_chunk=use_context["max_triplets_per_chunk"],
    kg_triplet_extract_fn=extract_triplets_wiki,
    storage_context=storage_context,
    service_context=service_context,
    include_embeddings=True,
);

query and make graph

# input query
query = "tell me more about IBM 1401"

# set up query engine
query_engine = index1.as_query_engine(include_text=False, response_mode="tree_summarize")

# response
response = query_engine.query(query)
Markdown(f"<b>{response}</b>")

you should see

IBM 1401 is a subject that is related to the predicate “service entry” and the object “1401”. It is also related to the predicate “manufacturer” and the object “IBM”. Additionally, it is influenced by the programming language Fortran.

## create graph
g = index1.get_networkx_graph()
net = Network(notebook=True, cdn_resources="in_line", directed=True)
net.from_nx(g)
net.show("example_wiki.html")

HTML(filename='example_wiki.html')

Discussion and Next Steps

As we can see from the graphs, these two methods yield very different knowledge graphs. In the next article of our series on exploring RAG with knowledge graph and LlamaIndex, we will look into how to evaluate our datasets.

Thanks for reading.

originally published on:

https://www.quantoceanli.com/writings/Make+Meaningful+Knowledge+Graph+with+REBEL