Powering Semantic Search: A Guide to Using OpenAI Embeddings with PostgreSQL in AWS Lambda

Tim Rizvanov
7 min readSep 16, 2024

--

Using OpenAI Embeddings with PostgreSQL in AWS Lambda

In this article, we’ll explore how to generate embeddings using OpenAI and store them in PostgreSQL within an AWS Lambda function. This approach can be useful for creating searchable databases of text data, such as artist descriptions.

Understanding Embeddings

Before we dive into the implementation, let’s briefly discuss what embeddings are and how they work. Embeddings are dense vector representations of text, where each word or piece of text is mapped to a high-dimensional vector of real numbers. These vectors are designed so that semantically similar pieces of text are closer together in the vector space. For example, the embeddings for “cat” and “kitten” would be closer to each other than to “automobile”.

Embeddings have a remarkable ability to capture complex semantic relationships. A fascinating illustration of this is the analogy between cities and their countries. In a well-trained embedding space, we often find that:

vector(“Paris”) — vector(“France”) ≈ vector(“Tokyo”) — vector(“Japan”)

This means that if we take the vector for “Paris” and subtract the vector for “France”, we get a result that’s very similar to what we’d get if we subtracted the vector for “Japan” from the vector for “Tokyo”. This difference roughly represents the concept of “capital city of”.

For a deeper dive into embeddings and their applications in AI, check out Andrew Ng’s courses on DeepLearning.AI, which cover these concepts in detail.

Embeddings can represent more than just word similarities. They can capture complex relationships between words, such as analogies. This allows the system to understand abstract concepts and how they relate to each other.

These embeddings have many practical uses. We can use them to find similar texts, group documents by topic, or even power recommendation systems. They open up new ways to understand and work with text data in various applications.

Why Store Embeddings in the Database?

Storing embeddings directly in the database alongside the original text data offers several advantages:

1. Performance: By keeping embeddings in the same database as your text data, you can perform similarity searches and other operations directly in the database, reducing the need for data transfer between your application and the database.

2. Consistency: Storing embeddings with the corresponding text ensures that they stay in sync. If you update the text, you can easily update the embedding in the same transaction.

3. Scalability: Modern databases like PostgreSQL with the `pgvector` extension can efficiently handle vector operations on large datasets, allowing your system to scale as your data grows.

4. Simplified Architecture: By leveraging database features for vector operations, you can offload complex calculations to the database, potentially simplifying your application code.

5. Real-time Updates: With embeddings stored in the database, you can easily update and immediately use new data without needing to regenerate or reload embeddings in a separate system.

6. Cost-Effectiveness: Storing vectors in your own database can be more cost-effective than using dedicated vector databases, especially for small to medium-sized datasets. Extensions like `pgvector` offer similar functionality at a lower cost, though large-scale applications might still benefit from specialized solutions.

Now, let’s look at how to implement this approach.

Prerequisites

Before we begin, ensure you have the following:

1. An AWS account
2. OpenAI API access
3. A PostgreSQL database
4. The `pgvector` extension installed in your PostgreSQL database
5. Python libraries: `boto3`, `psycopg2`, and `openai`

Note: To use the `psycopg2` and `openai` libraries in AWS Lambda, you’ll need to create Lambda layers for these dependencies. Creating and using Lambda layers is beyond the scope of this article, but we’ll cover this topic in a future post. For now, assume these libraries are available in your Lambda environment.

To install the `pgvector` extension:

  1. Connect to your PostgreSQL database.
  2. Run the following command:
sql
CREATE EXTENSION vector;

3. Create a table to store your data and embeddings:

sql
CREATE TABLE artistinfo (
id SERIAL PRIMARY KEY,
artist TEXT NOT NULL,
description TEXT,
embedding vector(1536),
updated TIMESTAMP
);

Integrating in AWS Lambda

Now let’s put all these pieces together in an AWS Lambda function. Here’s a more complete example of how our Lambda function might look:


import json
import logging
import os
from openai import OpenAI
import psycopg2

# Initialize logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
# Initialize OpenAI client
client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])
# Environment variables
EMBEDDING_MODEL = os.environ.get('EMBEDDING_MODEL', 'text-embedding-3-small')
DB_HOST = os.environ['DB_HOST']
DB_NAME = os.environ['DB_NAME']
DB_USER = os.environ['DB_USER']
DB_PASSWORD = os.environ['DB_PASSWORD']

def get_postgres_connection():
"""Connect to PostgreSQL database."""
try:
conn = psycopg2.connect(
host=DB_HOST,
dbname=DB_NAME,
user=DB_USER,
password=DB_PASSWORD
)
return conn
except Exception as e:
logger.error(f"Error connecting to PostgreSQL: {e}")
return None
def generate_embedding(description):
"""Generate an embedding for the given description using OpenAI."""
try:
response = client.embeddings.create(
input=description,
model=EMBEDDING_MODEL
)
embedding = response.data[0].embedding
return embedding
except Exception as e:
logger.error(f"Error generating embedding: {e}")
return None
def insert_artist_info_postgres(artist_name, description, embedding):
"""Insert artist information and embeddings into PostgreSQL."""
try:
conn = get_postgres_connection()
if not conn:
logger.error("PostgreSQL connection failed.")
return
cur = conn.cursor()
insert_query = """
INSERT INTO artistinfo (artist, description, embedding, updated)
VALUES (%s, %s, %s, CURRENT_TIMESTAMP)
"""
cur.execute(insert_query, (artist_name, description, embedding))
conn.commit()
cur.close()
conn.close()
logger.info(f"Inserted artist '{artist_name}' into PostgreSQL.")
except Exception as e:
logger.error(f"Error inserting artist '{artist_name}' into PostgreSQL: {e}")

def lambda_handler(event, context):
"""Main Lambda function handler."""
logger.info("Received event.")
# Example: Process a new artist entry
artist_name = "Vincent van Gogh"
description = "Vincent van Gogh was a Dutch post-impressionist painter who is among the most famous and influential figures in Western art history."
# Generate embedding
embedding = generate_embedding(description)
if embedding:
# Insert into PostgreSQL
insert_artist_info_postgres(artist_name, description, embedding)
return {
'statusCode': 200,
'body': json.dumps(f'Successfully processed artist: {artist_name}')
}
else:
return {
'statusCode': 500,
'body': json.dumps('Failed to generate embedding')
}

In this Lambda function:

1. We’ve included all necessary imports and environment variable setups.
2. The `lambda_handler` function processes a sample artist entry (Vincent van Gogh) by generating an embedding and inserting it into the database.
3. We’ve kept the core functionalities of connecting to the database, generating embeddings, and inserting data.

To use this Lambda function:

1. Set up the necessary environment variables in your Lambda configuration (OpenAI API key, database credentials, etc.).
2. Deploy the function to AWS Lambda.
3. Configure a trigger for your Lambda function (e.g., an API Gateway endpoint, an S3 event, etc.).

When triggered, this Lambda function will process the hardcoded artist information for Vincent van Gogh. In a real-world scenario, you’d likely modify the `lambda_handler` to process artist information from the triggering event, allowing it to handle different artists dynamically.

Retrieving Information Using Vector Similarity

Once you have stored embeddings in your PostgreSQL database, you can use them to perform similarity searches. Here’s a separate Python script that demonstrates how to retrieve information based on vector similarity:


from openai import OpenAI
import psycopg2
import json
import os

# Set up OpenAI API client
client = OpenAI(api_key=os.environ.get('OPENAI_API_KEY')) # Ensure your API key is set in an environment variable
# Define a function to get embeddings for a query
def get_query_embedding(query_text):
response = client.embeddings.create(
input=query_text,
model='text-embedding-3-small' # Ensure this matches your embedding model
)
return response.data[0].embedding

# Define the natural language query
query_text = "Find Dutch painters"
query_embedding = get_query_embedding(query_text)
# Convert the query embedding to a format PostgreSQL understands (JSON string)
query_embedding_str = json.dumps(query_embedding)

# Connect to PostgreSQL database
conn = psycopg2.connect(
host="your-database-host",
port="5432",
user="your-username",
password=os.environ.get('POSTGRES_PASSWORD'), # Use environment variable for the password
dbname="your-database-name"
)
cur = conn.cursor()
# Prepare the SQL query for similarity search
similarity_query = """
SELECT artist, description, (embedding <=> %s) AS similarity
FROM artistinfo
ORDER BY similarity
LIMIT 5;
"""

# Execute the query and pass the query embedding
cur.execute(similarity_query, (query_embedding_str,))

# Fetch and display the results
results = cur.fetchall()

# Output the artists and descriptions
for row in results:
print(f"Artist: {row[0]}, Description: {row[1]}")

# Close the database connection
cur.close()
conn.close()

This script does the following:

1. It uses the OpenAI API to generate an embedding for the natural language query “Find Dutch painters”.
2. It connects to the PostgreSQL database where your artist information and embeddings are stored.
3. It performs a similarity search using the `<=>` operator (cosine distance) to find the most similar artists based on the query embedding.
4. It retrieves and displays the top 5 most similar artists and their descriptions.

To use this script, make sure to replace the database connection details with your own, and ensure that your OpenAI API key and PostgreSQL password are set as environment variables for security.

This example demonstrates how you can leverage the embeddings stored in your database to perform semantic searches, allowing you to find relevant information even when there isn’t an exact text match. In this case, it would likely return Vincent van Gogh among the results, as he was described as a Dutch painter in our earlier example.

It’s worth noting that this search method, while powerful, relies on the semantic understanding of the query and the stored descriptions. For a query like “Find Dutch painters”, it might return artists who are associated with Dutch art in various ways, not just those explicitly described as Dutch painters.

The accuracy and relevance of search results can be further tuned by writing an appropriate prompt when generating embeddings.

Edit: I wrote a quick article on how the retrieval of such information can be improved

While not covered in this article, prompt engineering is an important aspect of optimising embedding-based searches.

Conclusion

We’ve created a system that generates and stores AI-powered embeddings along with text data. This approach enables semantic search and other advanced text processing capabilities, all while leveraging the power and efficiency of storing embeddings directly in the database.

Remember to keep an eye on your API usage and stay updated with the latest documentation for OpenAI API and AWS services.

Thank you for reading !

--

--

Tim Rizvanov

An engineer aiming for the Goldilocks zone of innovation, where 'How do I get it done?' transforms into 'What do I want to achieve?'