Simplifying RAG with PostgreSQL and PGVector

Levi Stringer
8 min readJan 23, 2024

--

Image generated by DALLE Jan 2024

Authors: Levi Stringer

Source code: https://github.com/levi-katarok/simplified-rag

Building an application powered by Retrieval Augmented Generation (RAG) can be difficult, time-consuming, and expensive. Spending a lot of time in the LLM space, you begin to crave simplicity and avoid bloated libraries. When building a RAG application, I wanted only the necessities to get started. To achieve this I’ve began using PGVector within SQL, PostgresSQL is a tried and true database and if you’re like me and don’t want manage several database providers and wanted to get started quickly. These are the preliminary steps in getting the right context to give to an LLM. Here is a fantastic article if you’re unfamilar with the concepts of RAG.

Goal

Develop a basic application that:

  1. Extracts text from documents (e.g., PDFs).
  2. Converts text into embeddings using a pre-trained model.
  3. Stores embeddings in a PostgreSQL database using PG Vector.
  4. Queries the database to find documents whose embeddings are most similar to the embedding of a query text.

This post outlines of transforming textual content from PDF documents into vectorized forms and querying them for similarity, using PostgreSQL with PG Vector and SQLAlchemy.

This will be useful for those who prefer sticking to a single database like PostgreSQL and avoiding the overhead of juggling multiple database systems. We’ll be exploring a straightforward example that demonstrates how to create embeddings from text data and utilize PostgreSQL for efficient querying. A simplified overview can be seen below.

Interaction between components (Application, PostgreSQL, OpenAI)

Setup and Requirements

  • PostgreSQL Database: Make sure you have PostgreSQL installed and running. This method assumes you’re working with a local or remote PostgreSQL instance where you have the privileges to install extensions and create tables. Ensure you have PG Vector installed in your PostgreSQL database. This usually requires superuser access. If you’re not sure whether PG Vector is installed, you can check by connecting to your PostgreSQL database and running:
CREATE EXTENSION IF NOT EXISTS vector;
  • Python Environment: A Python environment (preferably Python 3.6 or newer) is essential for running the scripts. I highly recommend using a virtual environment for your project to manage dependencies efficiently and avoid conflicts with system-wide packages. If you’re unfamiliar with Python virtual environments, they allow you to create isolated spaces on your machine, tailor-made for your project’s requirements. To set up a virtual environment, run:
python3 -m venv myenv
source myenv/bin/activate # On Windows, use `myenv\Scripts\activate`
  • Libraries and Modules:
  • sqlalchemy for interacting with the PostgreSQL database.
  • pgvector.sqlalchemy for integrating PG Vector with SQLAlchemy.
  • openai for accessing OpenAI's GPT models for vectorization. (Read embedding model description below)
  • pypdf for reading PDF documents.
  • pandas for handling data manipulations.
  • numpy for numerical operations.

Note: You can do this with simple Python lists instead of pandas and numpy. In general NumPy arrays use less memory than Python lists for numerical calculations. We aren’t doing any calculations on our embeddings here, I’ve just gotten in the habit of using them.

  • Embedding Model: I’ve used OpenAI’s text-embedding-ada-002 model. You’ll need an API key from OpenAI to use their models for generating embeddings for this example. You can can sign up and get an OpenAI API key here. Note this process will work for any embedding model you like. The choice of embedding model, such as OpenAI’s text-embedding-ada-002, affects the dimensionality and quality of your data embeddings. Select a model that aligns with your application's requirements for the best results. Some other paid ones include Cohere, Hugging Face, Google’s Vertex AI. With a list of open source ones here.

To install these libraries :

pip install numpy pandas sqlalchemy pgvector.sqlalchemy openai pypdf
  • When using embedding models, it’s vital to adjust the N_DIM parameter to match the dimensionality of the embeddings produced by your chosen model. For OpenAI's text-embedding-ada-002 model, the dimension is 1536. This information is crucial when defining your database schema to store embeddings correctly. If you opt for a different model, consult the model's documentation to find the correct dimensionality and adjust the N_DIM in your code accordingly. This ensures that your database is correctly set up to store the vector data without loss or truncation.

Troubleshooting Common Issues

As you embark on setting up your RAG application, you might encounter a few common hurdles:

  1. PG Vector Installation Issues: If you run into problems while installing PG Vector, ensure you have the necessary permissions on your PostgreSQL instance. Sometimes, cloud-hosted databases require specific steps to enable extensions. Check your hosting provider’s documentation for details.
  2. Python Dependency Conflicts: In case of conflicts between installed libraries, ensure that you’re using a virtual environment as recommended. If issues persist, try updating your packages to their latest versions, or consult the package documentation for compatibility information.
  3. API Key Security: Keep your OpenAI API key secure. Avoid hard-coding it into your scripts. Instead, use environment variables or secure vaults to store sensitive information. Don’t push your API key to Github!
  4. Database Connection Errors: Ensure your database connection strings are correct. Connection issues often arise from incorrect credentials, hostnames, or firewall settings blocking access to your database.
  5. Model Dimensionality Mismatch: If you encounter errors related to the size of the data being inserted into the database, double-check the N_DIM parameter against your embedding model's output dimensionality. Mismatches here can lead to data truncation or insertion failures.

Part 1: Extracting Text

For this example I’ve chosen to demonstrate how to extract text from a PDF. After Part 1, Step 1 the process remains exactly the same for text extracted from any document format;.txt, .docx, .html, etc.. So for us the first part of our process involves reading a PDF document and extracting its text. This is a crucial step as the extracted text will be used for generating embeddings.

Step 1: Reading a PDF Document

We’ll use the pypdf library to read a PDF document from your local machine. PDF parsing has been battleground for a long time, and there many many different libraries who can do it (this post outlines a few). I chose pypdf as I found it had the largest community support.

from pypdf import PdfReader

def read_pdf(file_path):
pdf_reader = PdfReader(file_path)
text = ""

for page in pdf_reader.pages:
extracted_text = page.extract_text()
if extracted_text: # Check if text is extracted successfully
text += extracted_text + "\n" # Append text of each page

return text

# Example usage
pdf_text = read_pdf("path_to_your_pdf.pdf")
print(pdf_text)

Step 2: Splitting Text into Manageable Chunks

After extracting the text, we’ll split it into smaller, manageable chunks. This is particularly useful for large documents, as embedding generation typically has a limit on the text length. For this tutorial I’ve just used a naive chunking strategy, this article by Roie Schwaber-Cohen outlines why and when different strategies should be used. TLDR, it will take some experimentation to figure out the best way to chunk your documents.

def split_text(text, chunk_size=500, overlap=50):
chunks = []
start = 0
end = 0

while end < len(text):
end = start + chunk_size
if end > len(text):
end = len(text)
chunks.append(text[start:end])
start = end - overlap # Overlap chunks

return chunks

# Example usage
text_chunks = split_text(pdf_text)

Part 2: Generating Text Embeddings

After splitting the text into manageable chunks, we’ll generate embeddings for each chunk using OpenAI’s text embedding model. This involves sending the chunks to the model and receiving a vector representation for each. The vector dimension for OpenAI embeddings is

import openai

openai.api_key = 'your_openai_api_key'

def generate_embeddings(text_chunks):
embeddings = []
for chunk in text_chunks:
response = openai.Embedding.create(
input=chunk,
model="text-embedding-ada-002"
)
embeddings.append(response['data'][0]['embedding'])
return embeddings

Part 3: Storing Embeddings in PostgreSQL

To store the embeddings in PostgreSQL, we’ll first need to ensure the PG Vector extension is enabled in our database. Then, using SQLAlchemy and pgvector.sqlalchemy, we’ll create a table to hold our text embeddings. Typically you’ll also want to store the content in some fashion to augment your LLM prompt. This can be useful for citations as well. I used N_DIM=1526 here as that is the output dimension of OpenAI’s embedding model.

from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from pgvector.sqlalchemy import Vector
import numpy as np

Base = declarative_base()
N_DIM = 1536

class TextEmbedding(Base):
__tablename__ = 'text_embeddings'
id = Column(Integer, primary_key=True, autoincrement=True)
content = Column(String)
embedding = Vector(N_DIM)

# Connect to PostgreSQL
engine = create_engine('postgresql://user:password@localhost/dbname')
Base.metadata.create_all(engine)

# Create a session
Session = sessionmaker(bind=engine)
session = Session()

To insert embeddings:

def insert_embeddings(embeddings):
for embedding in embeddings:
new_embedding = TextEmbedding(embedding=embedding)
session.add(new_embedding)
session.commit()

Part 4: Querying for Similar Text Embeddings

Finally, we’ll query the stored embeddings to find similar text. PG Vector supports various similarity and distance metrics that can be leveraged for this purpose.

In the find_similar_embeddings function, k and similarity_threshold are parameters used to refine the search for embeddings similar to a given query:

  • k (limit=5): Specifies the maximum number of similar embeddings to retrieve. Setting k = 5 means the query will return at most five embeddings that meet the similarity criteria. It controls the breadth of the search results, balancing between relevance and quantity.
  • similarity_threshold (0.7): Defines how similar an embedding needs to be to the query embedding to be considered a match, based on cosine distance. A threshold of 0.7 filters for embeddings that are sufficiently similar to the query, with lower values indicating greater distance (less similarity) and higher values indicating closer proximity (more similarity). This parameter helps ensure the relevance of the results to the query.

Together, k and similarity_threshold fine-tune the search by determining the quantity and relevance of the returned embeddings, allowing for a balanced retrieval based on the specific needs of the application.

def find_similar_embeddings(query_embedding, limit=5):
k = 5
similarity_threshold = 0.7
query = session.query(TextEmbedding, TextEmbedding.embedding.cosine_distance(query_embedding)
.label("distance"))
.filter(TextEmbedding.embedding.cosine_distance(query_embedding) < similarity_threshold)
.order_by("distance")
.limit(k)
.all()

This method allows you to pass a query embedding and retrieve the most similar embeddings from the database, based on a specified similarity threshold and ordered by similarity.

Conclusion

By following these steps, you can efficiently transform textual content into vectorized forms and perform similarity queries using PostgreSQL, PG Vector, and SQLAlchemy. Hopefullt this setup simplifies the architecture for applications requiring text similarity searches while leveraging a free and database such as PostgreSQL.

To further enhance this solution, consider integrating more advanced text preprocessing, experimenting with different embedding models, and optimizing database queries for performance.

Remember to replace placeholder values (e.g., database connection strings, OpenAI API key) with your actual configuration details.

Please share your thoughts, questions, or any corrections in the comments below. I hope this was helpful!

--

--