Make Knowledge Graph RAG with LlamaIndex from own obsidian notes

Haiyang(Ocean) Li
6 min readSep 28, 2023

--

image by author

read more about Llama_index on official documentations:

Introduction

While we have a lot of data at our fingertips, the real challenge is making sense of it all. In their report, McKinsey estimated that on average, a knowledge worker spent 1.8 hours a day gathering information. Though Large Language Models like GPT-3 have made strides in understanding human language but they can sometimes produce errors, often called “hallucinations.”

This is where Retrieval-Augmented Generation (RAG) comes into play. It improves the performance of Large Language Models by combining them with external databases. One effective way to do this is through a Knowledge Graph, a structured way to organize and link information. This approach has broad applications, ranging from search engines to natural language processing and artificial intelligence.

This tutorial will walk you through creating a Knowledge Graph using the Llama Index Python package developed by

. It works well with Obsidian, a popular note-taking app that uses markdown language. By using Llama Index, you can convert your Obsidian notes into a structured Knowledge Graph. This allows you to ask your own notes and get more accurate and context-relevant answers from language models. By the end of this guide, you’ll be able to generate more reliable and high-quality answers from language models from your own curated sets of notes.

Setup

  1. Python & Jupyter Notebook

The tutorial requires a Python 3 environment within Jupyter Notebook. For those yet to configure Jupyter Notebook,

2. Dependencies

  • Install the dependencies for by executing the command below in a Jupyter Notebook cell
  • This command installs three essential packages: `llama_index`, `pyvis`, and `IPython`, all of which will be indispensable for the tasks at hand.
%pip install llama_index pyvis IPython

3. Enabling Diagnostic Logging

  • Logging provides valuable insights into code execution, making it considerably easier to debug issues and understand the code’s internal logic.
  • To initialize logging, run the Python code snippet below. This code configures the logging level to `INFO`, which will output messages that assist in monitoring the application’s operational flow.
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

4. Import Key Modules from Llama Index

  • the `llama_index` package offers a suite of tools that are quite handy to manipulate Large Language Model related operations. Execute the following code to import the essential modules:
from llama_index import (
ObsidianReader,
LLMPredictor,
ServiceContext,
KnowledgeGraphIndex,
)

from llama_index.graph_stores import SimpleGraphStore
from llama_index.storage.storage_context import StorageContext
from llama_index.llms import OpenAI
from IPython.display import Markdown, display

Here’s a concise overview of each module’s function:

  • ObsidianReader: reading Obsidian’s markdown notes.
  • LLMPredictor: Utilized for generating predictions using Large Language Models (LLMs).
  • ServiceContext: Supplies contextual data vital for orchestrating various services.
  • KnowledgeGraphIndex: Central for both the construction and manipulation of Knowledge Graphs.
  • SimpleGraphStore: Serves as a straightforward repository for storing graph data.
  • StorageContext: Manages the storage layer, crucial for the efficient saving and retrieval of data.
  • OpenAI: Specifies the module for leveraging OpenAI-based LLMs.
  • Markdown, display: These are IPython tools useful for rendering Markdown text within Jupyter Notebooks.

Constructing the Knowledge Graph Index

  1. First locate your obsidian files.
filepath = './data/md'

2. Initialize variables

Set Large Language Model (LLM) context by setting various parameters. These include the model to employ (`text-davinci-002` in our example), the “temperature,” and the chunk size for data processing.

use_context = {
"temperature": 0,
"model": "text-davinci-002",
"chunk_size": 512
}

The `temperature` parameter dictates the randomness of the generated model output. A value of zero renders the output deterministic, minimizing variances.

The data loader, (`ObsidianReader` in this case), extracts your markdown notes from Obsidian into a variable called `documents`. For data persistence, a `SimpleGraphStore` object is needed. To instantiate these, employ the code snippet below:

documents = ObsidianReader(filepath).load_data()
llm = OpenAI(temperature=use_context['temperature'], model=use_context['model'])
service_context = ServiceContext.from_defaults(llm=llm,
chunk_size=use_context['chunk_size'])

graph_store = SimpleGraphStore()
storage_context = StorageContext.from_defaults(graph_store=graph_store)

3. Constructing the Knowledge Graph Index

Now we can construct the knowledge graph index:

index = KnowledgeGraphIndex.from_documents(
documents=documents,
max_triplets_per_chunk=2,
storage_context=storage_context,
service_context=service_context,
include_embeddings=True,
)
(Probability Space, $(\Omega, F, P)$)
(The Sample Space, is, the set of all possible outcomes of a random experiment)
(The sigma-algebra, is, a collection of subsets of The Sample Space)
(probability measure, is, function)
(probability measure, maps, sets)
(probability measure, satisfies, countable additivity property)
(Sigma-field, Example, Coin Toss)
(Sigma-field, Define map, $X$)
(Random Variables, is, R.V.)
(Random Variables, are, measurable maps)
(Random Variables, map, sample space)
(Expectation, is, fundamental statistical measure)
(Expectation, defined as, $\int_{\Omega}X(\omega)dP(\omega)$)
(expectation, is defined as, sum of x_i times p_i)
(random variable, is defined as, probability that random variable takes the value x_i)
(Conditional Expectations, is, $\mathbb{E}[X|Y]$)
(conditional expectation, is, average of random variable)
(conditional expectation, is, conditional density of random variable)
(random variables, have joint density, f)
(conditional probability, is given by, f)
(marginal density, is referred to as, f)
(conditional expectation, can be expressed as, mathbb)
(G, is sub-$\sigma$-algebra of, F)
(X, is random variable that, is either non-negative or adheres to integrability conditions)

(Gaussian distribution, can be used in, finance)
(The draft, provides, examples of notation)
(The draft, provides, formulas)
(The draft, provides, references)

`max_triplets_per_chunk` governs the number of relationship triplets processed per data chunk, while `include_embeddings` toggles the inclusion of vector embeddings within the index for advanced analytics.

Query

Begin by configuring the query engine. This involves specifying multiple parameters such as `response_mode`, `embedding_mode`, and `similarity_top_k` to fine-tune the query process:

query = "Tell me more about Black Scholes formula assumptions"
query_engine = index.as_query_engine(
include_text=True,
response_mode="tree_summarize",
embedding_mode="hybrid",
similarity_top_k=5,
)

response = query_engine.query(query)
display(Markdown(f"<b>{response}</b>"))

Executing the above Python snippet will yield a Markdown-formatted output that not only addresses your query but does so with enhanced accuracy and contextual relevance. This is achieved by leveraging the underlying Knowledge Graph Index.

<b>
The Black-Scholes model is predicated on a set of underlying assumptions that act as simplifications for the intricate dynamics of financial markets. Key assumptions encompass:
1. **European-Style Options**: This model is fundamentally designed for European options that can be exercised only upon reaching the expiration date.

2. **Constant Volatility and Interest Rates**: The model presumes both the volatility \(\sigma\) and the risk-free interest rate \(r\) to be invariant throughout the option's lifecycle.

3. **Log-Normally Distributed Returns**: Asset returns are conjectured to conform to a log-normal distribution.

4. **Frictionless Markets**: Assumes a market devoid of transaction costs, taxation, or borrowing expenses, coupled with limitless borrowing and lending at the risk-free rate.

5. **No Dividends**: Assumes that the underlying asset does not disburse dividends over the lifespan of the option.
6. **Continuous Trading**: The model presupposes uninterrupted trading, permitting the purchase or sale of fractional shares.</b>

Graph Visualization and Data Persistence

For the graphical representation of the knowledge graph, we deploy the `pyvis` library. This library furnishes a user-friendly medium for the design of interactive graphs. If you have not yet added this package to your environment, please initiate its installation by executing `%pip install pyvis` in a Jupyter Notebook cell.

The code segment below outlines the procedure for graph visualization:

from pyvis.network import Network
g = index.get_networkx_graph()
net = Network(notebook=True, cdn_resources="in_line", directed=True)
net.from_nx(g)
net.show("example.html")

Here, the `get_networkx_graph()` method is called upon the `index` object, yielding a NetworkX graph object denoted as `g`. This graph captures the essence of your knowledge graph. Subsequently, the `Network` class from `pyvis` is instantiated to form a `net` object. This instantiation is customized by setting several options:

  • `notebook=True` secures the graph’s compatibility with Jupyter Notebooks.
  • `cdn_resources=”in_line”` specifies the in-line arrangement of resources.
  • `directed=True` designates the graph as a directed entity.

Data retention plays an instrumental role, particularly when your knowledge graph and associated index are intricate or have necessitated significant computational effort for their construction. By persisting the data, you can effortlessly retrieve it for future analysis without the need for a complete rebuild.

Data persistence is straightforwardly achieved via the code below:

storage_context.persist()

--

--