Retrieval Augmented Generation (RAG) with pgvector vector database

6 min readJan 17, 2024

Introduction

Using a Large Language Model(LLM) out of the box in an enterprise has very limited organization value. LLM needs to be made fit for the outcome expected in the organization. There are a few options to make the model fit for usage.
- Prompt tuning
- Fine tuning
- RAG
- Reinforcement Learning From Human Feedback (RHLF)

This document focuses on RAG in an enterprise setting. i.e. Data (query or the answers) stays in-house to the organization. RAG is an cost-efficient approach to increase relevancy, accuracy, trustworthiness of the answers.
Read more on RAG here

Good RAG depends on query enrichment with relevant context. Good context is generated using semantic search instead of conventional keyword search. In this example pgvector vector database is used for storing vectors and to perform semantic search.

pgvector is an open source vector similarity search extension for PostgreSQL database. PostgreSQL is an enterprise grade database, widely used and supported by many enterprise IT. Hence the RAG implementation may not run into major road blocks by the enterprise IT.

pgvector is scalable and performant. In the overall picture, any millisecond differences in comparison with other vector DBs is mostly shadowed by the latency of the LLM.

Build an end to end application

Application setup

Ingest your enterprise data, build a chat application that users can query to get relevant product description.

Our example application is built on ingesting html documents, from Cloudera docs , and providing relevant answers on specific product components.

High level application setup steps

Install and configure the vector database
Crawler to download sitemap and html files
Extract text from html, pdf, docx, xlsx, csv, and text files
Vectorize and upload text to the vector database
Model answers queries with RAG (or without) or both

You can view the source code for the application in this repository

1. Install and configure pgvector

a. Assuming a 2GB database server on Ubuntu. Here is PostgreSQL14.10 database installation

# Refer: https://www.postgresql.org/download/linux/ubuntu/
sh -c 'echo "deb https://apt.postgresql.org/pub/repos/apt $(lsb_release -cs 2>/dev/null)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'

# Import the repository signing key:
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | apt-key add -

# Update the package lists:
apt-get update

# If you want a specific version, use 'postgresql-12' or similar instead of 'postgresql':
apt-get -y install postgresql-server-14 postgresql-server-dev-14

echo "host    all             all             192.168.0.0/16          scram-sha-256" > /etc/postgresql/14/main/pg_hba.conf


sed -i "s/#listen_addresses = 'localhost'/listen_addresses = '*'/" /etc/postgresql/14/main/postgresql.conf
sed -i "s/shared_buffers = 128MB/shared_buffers = 256MB/" /etc/postgresql/14/main/postgresql.conf
sed -i "s/#maintenance_work_mem = 64MB/maintenance_work_mem = 512MB/" /etc/postgresql/14/main/postgresql.conf
sed -i "s/#jit = on/jit = off/" /etc/postgresql/14/main/postgresql.conf

systemctl restart postgresql

Note:
> -dev- package (e.g. postgresql-server-dev-14) is required by pgvector
> jit is disabled (jit = off) for performance. We don’t have long running analytical queries.
> Configure shared_buffers, maintenance_work_mem to suit your application requirements. e.g. for high volume semantic searches, text ingestion, a high capacity server with higher configs values

b. Configure pgvector database

echo "create database ragdb;
create user ragu with encrypted password '<yourpassword>';
grant all privileges on database ragdb to ragu;

\c ragdb
CREATE EXTENSION if not exists vector;
CREATE TABLE t_documents (id bigserial PRIMARY KEY, doc_name varchar(256), 
                          created_at timestamp default now());
      
CREATE TABLE t_document_chunks (id bigserial PRIMARY KEY,
         doc_id bigserial not null references t_documents(id),
         chunk jsonb,       
         embedding vector(384),
         created_at timestamp default now());

CREATE INDEX ON t_document_chunks USING hnsw (embedding vector_ip_ops) WITH 
(m = 16, ef_construction = 128);
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA PUBLIC to ragu;
GRANT ALL ON ALL SEQUENCES IN SCHEMA PUBLIC to ragu;
\q
" > pgvector.sql

su -c 'psql < pgvector.sql' postgres

Note:
> Setting ef_construction=128 (Default=64) for better recall during semantic(similarity) search.
> “vector_ip_ops” — we are using inner product (instead of cosine) for a slight performance gain. It means that the vectors we store needs to be normalized. We will come back to this later.
> We are storing all the sentences of the chunk as a json object in the database.

2. Crawler to download sitemap and html files

Crawler is written with Scrapy. lxml, BeautifulSoup is used for parsing xml and html files.

Refer the files sitemap_spider.py and cdphtmldocs_spider.py in the repo.

3. Extract text from html files

BeautifulSoup is used for text extraction. Feel free to chose any other library(ies) for html parsing and text extraction.

Refer txtfrmfl.py in the repository.

from bs4 import BeautifulSoup as bs
with open(fname, 'rt', encoding='utf-8', errors='replace') as html_fl:
 html = bs(html_fl, "html.parser")
...
...

We perform basic text preprocessing and cleanup. Lookup the below 2 functions.

alltxts = self.get_parsed_lines(html.text.splitlines())
self.default_filters(alltxts)

Texts are then stored with corresponding .txt extension in a separate folder

with open(Path(_TEXTDIR, fldr.name, fl_w), 'w') as wfl:
 for snt in ptxt.sents:
  wfl.write(f"{str(snt).strip()}\n")

Note: Below packages are used for text extraction for other file formats

pdfminer-six: pdf files
Custom code (unzip, read xml): docx files
openpyxl: xlsx files

There is no crawler to download and save files in other formats. But if those files are found in the input directory, they will be processed.

There is also a bit of code to identify the type of file, in case the file does not have an extension.

ftype = txtext.guess_filetype(rfl)

Important: Do not trust files downloaded from external (or internal) sites. Virus scan before using the files for text extraction.

4. Vectorize and upload text to the pgvector vector database

This and the next section are the interesting part of the application.

LLMs can have large embedding dimensions, 1024, 4096 etc. Storing, indexing and querying a vector database with large dimensions is resource intensive and affects performance. Hence choose a separate model with lesser(384) embedding dimensions for encoding the text.

Refer MTEB leaderboard and choose the one that fits best.

We read the text files, chunk them, vectorize, normalize and then store in the vector database.

Refer save_embeddings_to_db in coreutils.py

txtchunk = f"{txtchunk} {txt}"
txtlst.append(txt)
if len(txtchunk.split()) >= _MAX_TKNLEN:
 embeddings = self.emb_mdl.encode(txtchunk)
 # Normalizing the embeddings, just in case
 # default is Frobenius norm
 # https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html
 fnorm = np.linalg.norm(embeddings)
 lst = list(embeddings/fnorm)
 # json supports only np.float64. Convert np.float32
 embed_str = json.dumps(lst, default=np.float64)
 _ = self.dbexec(self.dbo_stmts['ins_txt'],
     (docid[0][0], json.dumps(txtlst), embed_str),
     "Insert chunk into Document")

We chunk so that we are within the model sequence length. Why do we normalize before storing in vector db? see the note above in step1.b

We are storing all the individual sentences as a list, json.dumps(txtlst), in the database. This helps in sentence deduplication during context building. See below.

5. Model answers queries with RAG

When an user submits a query, the query is tokenized and normalized. Refer the function get_similar_texts in coreutils.py

embeddings = self.emb_mdl.encode(text)
# Normalize before querying the DB
fnorm = np.linalg.norm(embeddings)
lst = list(embeddings/fnorm)
# json supports only np.float64. Convert np.float32
embed_str = json.dumps(lst, default=np.float64)

And, the vector database is queried to find similar texts. The query submitted to the database is

 "sim_txts":f"SELECT doc_id, chunk FROM t_document_chunks \
              ORDER BY embedding <#> %s LIMIT {_MAX_SIM_TXTS}"
 sim_txts = self.dbexec(self.dbo_stmts['sim_txts'], (embed_str,), "Get similar texts")

We are using an inner product <#>. Why innerproduct? see the note above in step1.b

We perform a few more steps to build the context.

We will be retrieving a few document chunks that are similar to our query. Those chunks may have overlapping sentences. Hence deduplication is required. The lesser the noise in the context, the better the answers from the LLM. So, we iterate through the json chunk, and filter out duplicate sentences. Sentence order is maintained.

all_txts = []
contxt = ''
# Avoid duplicate sentences, less noise in context is better for LLM response
for itm in sim_txts:
    for txt in itm[1]:
        if txt not in all_txts:
            all_txts.append(txt)
            contxt = f"{contxt} {txt}"

2. We restrict the context length (total tokens). This has benefit of reducing processing costs. More tokens requires higher GPU processing, memory and can lead to OOM errors on smaller GPUs.

if len(contxt.split()) >= _MAX_TKNLEN*_MAX_SIM_TXTS:
    break

Important: Good context is the key for accurate responses from the LLM in an enterprise setting. Above are 2 simple approaches. There are other techniques that can be used as well. These are all called Advanced RAG. Refer here and here for more details. We can chose to implement a few other advanced RAG techniques depending on our use cases.

Results

Refer the jupyter notebook, LLM-RAG in the repo, for the gradio widget.

Here are 2 question-answer results

Observe the product details in the answer when context is provided.

Final Note: In this example application, very few random documents were used, texts extracted, stored in vector DB, and context enriched using similarity search to answer our questions. And the model performed reasonably well.