Q&A With Your Docs: A Gentle Introduction to Matching Engine + PaLM
How Use Similarity Search and Document Q&A on GCP
Introduction
Instead of requiring exact query matches, like with traditional databases, vector database technology enables similarity searching, using semantic similarity instead of exact matches.
This is a powerful way to surface content for all kinds of use cases, including search and recommendations. Additionally, semantic similarity search is a foundational of component of modern “Q&A-with-your-docs”-style LLM interactions, which I will demonstrate in this tutorial.
This how-to guide will demonstrate, step-by-step, how to get up and running with Vertex AI’s Matching Engine in Google Cloud. (Update: Matching Engine has since been rebranded to Vector Search) Then we’ll pair Matching Engine with Google’s PaLM API to enable context-aware generative AI responses.
This diagram provides an overview of how this system will work:
Names and References
Names and IDs I’m using throughout this how-to are:
Project ID: genai-jsg
GCS Bucket Name: genai-jsg-b
Matching Engine Index Name: pg-index
Public Endpoint Name: public-endpoint-test1
Deployed Index ID: genai_jsg_deployed_index_id
Deployed Index Name: genai_jsg_deployed_index_name
Region: us-central1
Step 1: Enable Needed APIs
Run gcloud init
to authenticate with your GCP user and project.
Enable the necessary APIs:
gcloud services enable aiplatform.googleapis.com --async
Step 2: Gather Your Documents
Gather the documents that you want to index. For this demo, I used a modified version of this code to pull Paul Graham’s essays into individual local .txt
files. I placed essay files into a local directory called ./essays/
Step 3: Generate Embeddings
Once you have your documents, you need to convert their contents to vector embeddings. These embeddings are what will populate your Matching Engine index. Each document will have its own corresponding set of embeddings.
To generate embeddings, we will use the textembedding-gecko
model from Vertex AI's Model Garden. Below is the code that I used:
from vertexai.preview.language_models import TextEmbeddingModel
import os
import json
model = TextEmbeddingModel.from_pretrained("textembedding-gecko")
# Add all text filenames to list
filenames = []
for filename in os.listdir("./essays/"):
filenames.append(os.path.join("./essays/", filename))
# Extract the contents of each file into a list
texts = []
for f in filenames:
print("Opening: ", f)
with open(f,"r") as f_d:
texts.append((f_d.read(), f))
data = {} # To hold the document ID and embeddings
lookup = {} # To associate the document ID with the filename
# Get Embeddings and Write to File
i = 0
for text, filename in texts:
embeddings = model.get_embeddings([text])
vector = embeddings[0].values
data["id"] = str(i)
data["embedding"] = vector
with open("data.json","a") as f:
json.dump(data, f)
f.write("\n")
lookup[i] = filename
i += 1
with open("lookup.json","w") as f:
json.dump(lookup, f)
This code will produce two files: data.json
and lookup.json
. We will use data.json
to populate the Matching Engine index. The lookup.json
file will be used to associate the document ID with the actual filepath.
Note that the data.json
file is not a true JSON format; it’s actually JSON Lines format. It has one record per line with no commas between them, and it looks like this:
{"id": "0", "embedding": [0.1, -0.1, ... , 0.1, -0.1]}
{"id": "1", "embedding": [0.1, -0.1, ... , 0.1, -0.1]}
{"id": "2", "embedding": [0.1, -0.1, ... , 0.1, -0.1]}
{"id": "3", "embedding": [0.1, -0.1, ... , 0.1, -0.1]}
Errors in this format could cause the Matching Engine index creation to fail later on.
Step 4: Upload Embeddings to GCS
Before we can create the Matching Engine index, we first need to upload the embeddings we just generated to Google Cloud Storage.
Use gsutil
to create the bucket and upload data.json
to it:
gsutil mb -l us-central1 gs://genai-jsg-b
gsutil cp ./data.json gs://genai-jsg-b
Step 5: Create the Matching Engine Index
Create a config file called index_metadata.json
with the following contents:
{
"contentsDeltaUri": "gs://genai-jsg-b",
"config": {
"dimensions": 768,
"approximateNeighborsCount": 150,
"distanceMeasureType": "DOT_PRODUCT_DISTANCE",
"shardSize": "SHARD_SIZE_MEDIUM",
"algorithm_config": {
"treeAhConfig": {
"leafNodeEmbeddingCount": 5000,
"leafNodesToSearchPercent": 3
}
}
}
}
The given parameters are a good starting point; see here for more information.
Now you can create the Matching Engine Index with the following gcloud
commands:
PROJECT_ID=genai-jsg
LOCATION=us-central1
gcloud ai indexes create \
--metadata-file=./index_metadata.json \
--display-name=pg-index \
--project=$PROJECT_ID \
--region=$LOCATION
gcloud ai indexes list \
--project=$PROJECT_ID \
--region=$LOCATION
This command can take a while; it took +30 min for me.
Step 8: Create Public Endpoint
Now in order to deploy the index, you will need an endpoint. I am using a public endpoint for this tutorial.
First create request.json
:
{
"display_name": "public-endpoint-test1",
"publicEndpointEnabled": "true"
}
Now send the creationPOST
request:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://us-central1-aiplatform.googleapis.com/v1/projects/genai-jsg/locations/us-central1/indexEndpoints"
Step 9: Deploy Index to Endpoint
Now you can deploy the index to the endpoint. Deploy it with:
gcloud ai index-endpoints deploy-index xxxxxxxxxxxxxxx0896 \
--deployed-index-id=genai_jsg_deployed_index_id \
--display-name=genai_jsg_deployed_index_name \
--index=xxxxxxxxxxxxxxx8464 \
--project=genai-jsg \
--region=us-central1
Note that this will also take a while, +30 min.
Once the index has been deployed to the endpoint, you will need the publicEndpointDomainName
of your deployed index. To do this, first observe the details of your deployed index in the response to this gcloud
command:
gcloud ai indexes list --project="genai-jsg" --region="us-central1"
Use the response to that command to populate the ENDPOINT, PROJECT_ID, REGION, and INDEX_ENDPOINT_ID variables in preparation for this final curl
call:
ENDPOINT=https://us-central1-aiplatform.googleapis.com
PROJECT_ID=xxxxxxxx3856
REGION=us-central1
INDEX_ENDPOINT_ID=xxxxxxxxxxxxxxx0896
curl -H "Content-Type: application/json" -H "Authorization: Bearer `gcloud auth print-access-token`" ${ENDPOINT}/v1/projects/${PROJECT_ID}/locations/${REGION}/indexEndpoints/${INDEX_ENDPOINT_ID}
In the response, take note of the publicEndpointDomainName
value. It should look something like this:
0123456789.us-central1-xxxxxxxx3856.vdb.vertexai.goog
Now you are ready to query your index’s endpoint!
Step 10: Query Index and Combine with LLM
The following code is doing three main things (recall the diagram from the start of this post):
- Convert the user query into an embedding vector
- Use this vector to lookup relevant documents in the Matching Engine Index
- Use the relevant documents in the LLM text generation context
We will use the text-bison@001
text generation model from PaLM.
Be sure to replace the needed variables with the values from above. Note that the NUM_RELEVANT_DOCS
variable indicates how many of the closest documents returned will be included in the LLM context.
import google.cloud.aiplatform_v1beta1 as aiplatform_v1beta1
from vertexai.preview.language_models import TextEmbeddingModel
from vertexai.preview.language_models import TextGenerationModel
import sys
import re
import json
# Set Variables
API_ENDPOINT="0123456789.us-central1-xxxxxxxx3856.vdb.vertexai.goog"
INDEX_ENDPOINT="projects/xxxxxxxx3856/locations/us-central1/indexEndpoints/xxxxxxxxxxxxxxx0896"
DEPLOYED_INDEX_ID="genai_jsg_deployed_index_id"
# Load the Embedding and Generation Models
embedding_model = TextEmbeddingModel.from_pretrained("textembedding-gecko")
generation_model = TextGenerationModel.from_pretrained("text-bison@001")
# Configure Matching Engine Index Client
client_options = {
"api_endpoint": API_ENDPOINT
}
vertex_ai_client = aiplatform_v1beta1.MatchServiceClient(
client_options=client_options,
)
# Get the user query
command_line_arguments = sys.argv
if len(command_line_arguments) > 1:
user_query = command_line_arguments[1]
print("\nYour question is: ", user_query, "\n")
else:
print("Must specify a query")
# Get embeddings from user query
embeddings = embedding_model.get_embeddings([user_query])
vector = embeddings[0].values
# Query Matching Engine Index w/ user query embedding
datapoint = aiplatform_v1beta1.IndexDatapoint(
datapoint_id="0",
feature_vector=vector
)
query = aiplatform_v1beta1.FindNeighborsRequest.Query(
datapoint=datapoint
)
request = aiplatform_v1beta1.FindNeighborsRequest(
index_endpoint=INDEX_ENDPOINT,
deployed_index_id=DEPLOYED_INDEX_ID,
)
request.queries.append(query)
response = vertex_ai_client.find_neighbors(request) # https://cloud.google.com/python/docs/reference/aiplatform/1.26.1/google.cloud.aiplatform_v1.types.FindNeighborsResponse
# Parse response for nearest neighbors
with open('lookup.json', 'r') as f:
filepaths = json.load(f)
i = 0
nn = []
for r in response.nearest_neighbors:
for n in r.neighbors:
id = n.datapoint.datapoint_id
distance = n.distance
filepath = filepaths[str(id)]
nn.append((id, distance, filepath))
print("The most relevant documents related to this question are:\n\n")
print("ID\tDist.\tFilepath\t\n")
print("".join([f"{id}\t{round(distance, 4)}\t{filepath}\n" for id, distance, filepath in nn]), "\n")
# Read in essay content from most relevant docs
context = ""
NUM_RELEVANT_DOCS = 1
for i in range(NUM_RELEVANT_DOCS):
n = nn[i]
filepath = n[2] # Access filepath
with open(filepath, 'r') as f:
context += f.read()
context = re.compile(r"<.*?>", re.DOTALL).sub("", context) # Remove residual HTML content
# Craft Prompt and Invoke Model
prompt = f"""
Context: You are Paul Graham, a programmer, startup advisor, and essayist.
Use the following essay you wrote to give a detailed answer to any questions you receive: {context}
Question: {user_query}
"""
print("Answer:")
print(generation_model.predict(prompt, temperature = 0.2, max_output_tokens = 1024))
Here are some example queries and their responses:
Just using the vanilla PaLM model without the added context would result in only generic responses. Even though they’re not perfect, these answers seem a lot closer to what Paul Graham himself might say.
Conclusion
I hope this has been a helpful introduction to Document Q&A with Matching Engine and PaLM. Note that this tutorial was intended to get you touching all the different pieces and building something that works; it is clearly not a production-ready system. One area for improvement would be in splitting up the documents. Feeding the LLM only the most relevant paragraph(s) of an essay instead of the entire piece would likely provide better results.
Additionally, with the generation_model.predict()
call, how the prompt is formulated and what parameters are chosen have a big impact on the results; there's endless opportunity for tweaking. You can explore this area further here: Gen AI Overview of text prompt design
References and Further Reading
- Vertex AI Matching Engine setup
- Very in-depth and rigorous demonstration of Doc Q&A from Googler Mike Henderson: Github
- Helpful YouTube tutorial on Matching Engine