Creating a Custom AI RAG from your Notion Database: OpenAI, Python, LangChain, Notion, Qdrant, Streamlit

5 min readFeb 9, 2024

INTRODUCTION

What is Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) is a technique to retrieve context for use in prompting Large Language Models (LLMs). RAG starts with searching a series of documents that contain text or image files for content that is relevant to a query. Then, you can use the text and/or image context in a prompt. This enables you to ask questions with additional context that is not available to a model. Check out this article for more information: What is Retrieval Augmented Generation (RAG).

Our RAG Architecture with Notion

We will use OpenAI for our LLM.

Next we will review our Python packages to implement this solution.

notion-utils — uses Notion API to fetch data from Notion database
notion-load — (a) uses notion-utils to fetch data from Notion database; (b) creates embeddings; c) loads embeddings into qdrant vector database.
notion-chat — Provides app to call LLM and qdrant vector database to answer questions.

Process to load Notion data into qdrant as embeddings

Chat with OpenAI. Using our embeddings to provide context to OpenAI

SETUP

The setup will assume the following local prerequisites.

Python is installed
Poetry for Python is installed

Setup qdrant

Create qdrant free account. Save the following for later use:
- Cluster name (QDRANT_COLLECTION_NAME)
- Cluster URL (QDRANT_URL)
Create an API key (QDRANT_API_KEY) and save for later.

Setup Notion

Create Notion integration API key (NOTION_TOKEN) and save value for later.
Enable your Notion database to use your API key
Find and save the ID for your Notion database
- Click on your Notion database in browser
- Copy Notion database from ID (NOTION_DATABASE_ID) from browser URL and save for later. For example,

https://www.notion.so/johntday/db0ee43b0572d7c9ad97d8dd57ff34a3?v=e16d4dba585746de9c067f4c32c0b020

In my case, the Notion database ID is db0ee43b0572d7c9ad97d8dd57ff34a3

Setup OpenAI

Create an OpenAI account.
Create an API key (OPENAI_API_KEY) and save for later.

Setup Python

# STEP 0
# --------------------------------------------------------------------
# navigate to a good place to clone the repos
# save this dir in an exported variable for later use
export MYHOME=$(pwd)


# CLONE REPOS and SETUP PYTHON ENV
# --------------------------------------------------------------------
# the first 2 repos just contain the code dependencies for the last repo
cd $MYHOME
git clone https://github.com/johntday/notion-utils.git
# contains utils for using Notion API
cd notion-utils
poetry install

cd $MYHOME
git clone https://github.com/johntday/notion-load.git
# contains qdrant client and loader script
cd notion-load
poetry install

cd $MYHOME
git clone https://github.com/johntday/notion-chat.git
# contains app
cd notion-chat
poetry install


# UPDATE ENV VARIABLES
# --------------------------------------------------------------------
# add app keys to ".env" file for each project
# for example
cd $MYHOME/notion-utils
cp .env.example .env
# update ".env" with your information
# NOTE: in my setup, I only have one ".env" file. Symbolic links provide access for each project


# SIMPLE NOTION DATABASE CONNECTION TEST
# --------------------------------------------------------------------
cd $MYHOME/notion-utils && source .venv/bin/activate
python3 notion_utils/test_notion_utils.py
# this will print a count of the records in your Notion database

LOADING qdrant FROM NOTION DATABASE

Here is an example of one of my Notion databases.

Example metadata for my Notion “Hybris” database

Required Fields for Your Notion Database

The following fields are required.

Title — (“name field) — Default “name” field that is renamed to “Title”. WHY: app uses this to provide a title and link for usedNotion pages.
Source ID — (Select) — Source tag. For example, I use this to identify the domain of the source article. WHY: app will use in future for filtering.
Published — (Date) — Publish date of the Notion page. WHY: app uses this to show publish data for used Notion pages.
Source — (URL) — URL to the Notion page. WHY: app uses this to provide link to used Notion pages.

How to Use PDF Attachments

To make sure PDF’s are processed and included in the load process, you need to do the following.

Create checkbox field called “PDF”

RULES

Each Notion database page can have 0 or 1 PDF attachment.
For a Notion database page with 1 PDF attachment. The PDF should be the first content after the title. The PDF checkbox field should be true.
For a Notion database page with no PDF attachment, the PDF checkbox field should be false.

I also like to also include a link of the origin source of the PDF after the PDF attachment.

Example of Notion database page with PDF attachment

Notion Query Filter

Before loading, check the following “Notion query filter” is appropriate for your database.

# the default filter from Notion database is found in repo notion-utils at
MyNotionDBLoader.py

Query filter when fetching records from Notion database

In my case, I only want to load records that have Pub=True AND Status=”Reviewed”.

For example, if want to load all records (no filter), then

# change query filter for fetching ALL records
...
QUERY_DICT = {
  "page_size": 100
}
...

Load Command

# start loading qdrant vector database with embeddings from Notion database
cd $MYHOME/notion-load
source .venv/bin/activate
python3 notion_load/load.py notion qdrant -r -v

RUN RAG CHAT WEB APP

cd $MYHOME/notion-chat
source .venv/bin/activate
cd notion_chat
streamlit run chat.py

This will open a Streamlit app in your default browser.

Settings Panel

Reset chat — (Button) — starts a new chat thread.
Model — (Select) — pick an OpenAI LLM.
Temperature — (Number from 0 to 1) — picks temperature. Temperature governs the randomness and thus creativity of the responses.
K — (Number from 1 to 10) — picks k. K is the number of nearest vectors (your text embeddings) that matches a query. This number determines the number of “Top Sources for Answers” that are used.
Search Type — (Select) — Similarity or MMR (Maximal Marginal Relevance). MMR tries to reduce the redundancy of results while at the same time maintaining query relevance.
Verbose Logging — (Checkbox) — enables additional logging.

REFERENCES

Project repos: notion-utils, notion-load, notion-chat
Retrieval Augmented Generation — Pinecone
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
qdrant — vector database
Notion API
AI Researchers
RAG — Retrieval Augmented Generation
What is Retrieval Augmented Generation (RAG)
LLM — large language model. In our case, we use OpenAI.
OpenAI