Creating a Custom AI RAG from your Notion Database: OpenAI, Python, LangChain, Notion, Qdrant, Streamlit
INTRODUCTION
What is Retrieval Augmented Generation (RAG)
Retrieval Augmented Generation (RAG) is a technique to retrieve context for use in prompting Large Language Models (LLMs). RAG starts with searching a series of documents that contain text or image files for content that is relevant to a query. Then, you can use the text and/or image context in a prompt. This enables you to ask questions with additional context that is not available to a model. Check out this article for more information: What is Retrieval Augmented Generation (RAG).
Our RAG Architecture with Notion
We will use OpenAI for our LLM.
Next we will review our Python packages to implement this solution.
- notion-utils — uses Notion API to fetch data from Notion database
- notion-load — (a) uses notion-utils to fetch data from Notion database; (b) creates embeddings; c) loads embeddings into qdrant vector database.
- notion-chat — Provides app to call LLM and qdrant vector database to answer questions.
SETUP
The setup will assume the following local prerequisites.
- Python is installed
- Poetry for Python is installed
Setup qdrant
- Create qdrant free account. Save the following for later use:
- Cluster name (QDRANT_COLLECTION_NAME)
- Cluster URL (QDRANT_URL) - Create an API key (QDRANT_API_KEY) and save for later.
Setup Notion
- Create Notion integration API key (NOTION_TOKEN) and save value for later.
- Enable your Notion database to use your API key
- Find and save the ID for your Notion database
- Click on your Notion database in browser
- Copy Notion database from ID (NOTION_DATABASE_ID) from browser URL and save for later. For example,
https://www.notion.so/johntday/db0ee43b0572d7c9ad97d8dd57ff34a3?v=e16d4dba585746de9c067f4c32c0b020
In my case, the Notion database ID is db0ee43b0572d7c9ad97d8dd57ff34a3
Setup OpenAI
- Create an OpenAI account.
- Create an API key (OPENAI_API_KEY) and save for later.
Setup Python
# STEP 0
# --------------------------------------------------------------------
# navigate to a good place to clone the repos
# save this dir in an exported variable for later use
export MYHOME=$(pwd)
# CLONE REPOS and SETUP PYTHON ENV
# --------------------------------------------------------------------
# the first 2 repos just contain the code dependencies for the last repo
cd $MYHOME
git clone https://github.com/johntday/notion-utils.git
# contains utils for using Notion API
cd notion-utils
poetry install
cd $MYHOME
git clone https://github.com/johntday/notion-load.git
# contains qdrant client and loader script
cd notion-load
poetry install
cd $MYHOME
git clone https://github.com/johntday/notion-chat.git
# contains app
cd notion-chat
poetry install
# UPDATE ENV VARIABLES
# --------------------------------------------------------------------
# add app keys to ".env" file for each project
# for example
cd $MYHOME/notion-utils
cp .env.example .env
# update ".env" with your information
# NOTE: in my setup, I only have one ".env" file. Symbolic links provide access for each project
# SIMPLE NOTION DATABASE CONNECTION TEST
# --------------------------------------------------------------------
cd $MYHOME/notion-utils && source .venv/bin/activate
python3 notion_utils/test_notion_utils.py
# this will print a count of the records in your Notion database
LOADING qdrant FROM NOTION DATABASE
Here is an example of one of my Notion databases.
Required Fields for Your Notion Database
The following fields are required.
- Title — (“name field) — Default “name” field that is renamed to “Title”. WHY: app uses this to provide a title and link for usedNotion pages.
- Source ID — (Select) — Source tag. For example, I use this to identify the domain of the source article. WHY: app will use in future for filtering.
- Published — (Date) — Publish date of the Notion page. WHY: app uses this to show publish data for used Notion pages.
- Source — (URL) — URL to the Notion page. WHY: app uses this to provide link to used Notion pages.
How to Use PDF Attachments
To make sure PDF’s are processed and included in the load process, you need to do the following.
- Create checkbox field called “PDF”
RULES
- Each Notion database page can have 0 or 1 PDF attachment.
- For a Notion database page with 1 PDF attachment. The PDF should be the first content after the title. The PDF checkbox field should be true.
- For a Notion database page with no PDF attachment, the PDF checkbox field should be false.
I also like to also include a link of the origin source of the PDF after the PDF attachment.
Notion Query Filter
Before loading, check the following “Notion query filter” is appropriate for your database.
# the default filter from Notion database is found in repo notion-utils at
MyNotionDBLoader.py
In my case, I only want to load records that have Pub=True AND Status=”Reviewed”.
For example, if want to load all records (no filter), then
# change query filter for fetching ALL records
...
QUERY_DICT = {
"page_size": 100
}
...
Load Command
# start loading qdrant vector database with embeddings from Notion database
cd $MYHOME/notion-load
source .venv/bin/activate
python3 notion_load/load.py notion qdrant -r -v
RUN RAG CHAT WEB APP
cd $MYHOME/notion-chat
source .venv/bin/activate
cd notion_chat
streamlit run chat.py
This will open a Streamlit app in your default browser.
Settings Panel
- Reset chat — (Button) — starts a new chat thread.
- Model — (Select) — pick an OpenAI LLM.
- Temperature — (Number from 0 to 1) — picks temperature. Temperature governs the randomness and thus creativity of the responses.
- K — (Number from 1 to 10) — picks k. K is the number of nearest vectors (your text embeddings) that matches a query. This number determines the number of “Top Sources for Answers” that are used.
- Search Type — (Select) — Similarity or MMR (Maximal Marginal Relevance). MMR tries to reduce the redundancy of results while at the same time maintaining query relevance.
- Verbose Logging — (Checkbox) — enables additional logging.
Top Sources for Answer Panel
REFERENCES
- Project repos: notion-utils, notion-load, notion-chat
- Retrieval Augmented Generation — Pinecone
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
- qdrant — vector database
- Notion API
- AI Researchers
- RAG — Retrieval Augmented Generation
- What is Retrieval Augmented Generation (RAG)
- LLM — large language model. In our case, we use OpenAI.
- OpenAI