Q&A With Your Docs: A Gentle Introduction to Matching Engine + PaLM

How Use Similarity Search and Document Q&A on GCP

John Grinalds
Google Cloud - Community
7 min readJul 13, 2023

--

Introduction

Instead of requiring exact query matches, like with traditional databases, vector database technology enables similarity searching, using semantic similarity instead of exact matches.

This is a powerful way to surface content for all kinds of use cases, including search and recommendations. Additionally, semantic similarity search is a foundational of component of modern “Q&A-with-your-docs”-style LLM interactions, which I will demonstrate in this tutorial.

This how-to guide will demonstrate, step-by-step, how to get up and running with Vertex AI’s Matching Engine in Google Cloud. (Update: Matching Engine has since been rebranded to Vector Search) Then we’ll pair Matching Engine with Google’s PaLM API to enable context-aware generative AI responses.

This diagram provides an overview of how this system will work:

Names and References

Names and IDs I’m using throughout this how-to are:

Project ID: genai-jsg
GCS Bucket Name: genai-jsg-b
Matching Engine Index Name: pg-index
Public Endpoint Name: public-endpoint-test1
Deployed Index ID: genai_jsg_deployed_index_id
Deployed Index Name: genai_jsg_deployed_index_name
Region: us-central1

Step 1: Enable Needed APIs

Run gcloud init to authenticate with your GCP user and project.

Enable the necessary APIs:

gcloud services enable aiplatform.googleapis.com --async

Step 2: Gather Your Documents

Gather the documents that you want to index. For this demo, I used a modified version of this code to pull Paul Graham’s essays into individual local .txt files. I placed essay files into a local directory called ./essays/

Step 3: Generate Embeddings

Once you have your documents, you need to convert their contents to vector embeddings. These embeddings are what will populate your Matching Engine index. Each document will have its own corresponding set of embeddings.

To generate embeddings, we will use the textembedding-gecko model from Vertex AI's Model Garden. Below is the code that I used:

from vertexai.preview.language_models import TextEmbeddingModel
import os
import json

model = TextEmbeddingModel.from_pretrained("textembedding-gecko")

# Add all text filenames to list
filenames = []
for filename in os.listdir("./essays/"):
filenames.append(os.path.join("./essays/", filename))

# Extract the contents of each file into a list
texts = []
for f in filenames:
print("Opening: ", f)
with open(f,"r") as f_d:
texts.append((f_d.read(), f))

data = {} # To hold the document ID and embeddings
lookup = {} # To associate the document ID with the filename

# Get Embeddings and Write to File
i = 0
for text, filename in texts:
embeddings = model.get_embeddings([text])
vector = embeddings[0].values

data["id"] = str(i)
data["embedding"] = vector
with open("data.json","a") as f:
json.dump(data, f)
f.write("\n")

lookup[i] = filename
i += 1

with open("lookup.json","w") as f:
json.dump(lookup, f)

This code will produce two files: data.json and lookup.json. We will use data.json to populate the Matching Engine index. The lookup.json file will be used to associate the document ID with the actual filepath.

Note that the data.json file is not a true JSON format; it’s actually JSON Lines format. It has one record per line with no commas between them, and it looks like this:

{"id": "0", "embedding": [0.1, -0.1, ... , 0.1, -0.1]}
{"id": "1", "embedding": [0.1, -0.1, ... , 0.1, -0.1]}
{"id": "2", "embedding": [0.1, -0.1, ... , 0.1, -0.1]}
{"id": "3", "embedding": [0.1, -0.1, ... , 0.1, -0.1]}

Errors in this format could cause the Matching Engine index creation to fail later on.

Step 4: Upload Embeddings to GCS

Before we can create the Matching Engine index, we first need to upload the embeddings we just generated to Google Cloud Storage.

Use gsutil to create the bucket and upload data.json to it:

gsutil mb -l us-central1 gs://genai-jsg-b 
gsutil cp ./data.json gs://genai-jsg-b

Step 5: Create the Matching Engine Index

Create a config file called index_metadata.json with the following contents:

{
"contentsDeltaUri": "gs://genai-jsg-b",
"config": {
"dimensions": 768,
"approximateNeighborsCount": 150,
"distanceMeasureType": "DOT_PRODUCT_DISTANCE",
"shardSize": "SHARD_SIZE_MEDIUM",
"algorithm_config": {
"treeAhConfig": {
"leafNodeEmbeddingCount": 5000,
"leafNodesToSearchPercent": 3
}
}
}
}

The given parameters are a good starting point; see here for more information.

Now you can create the Matching Engine Index with the following gcloud commands:

PROJECT_ID=genai-jsg
LOCATION=us-central1

gcloud ai indexes create \
--metadata-file=./index_metadata.json \
--display-name=pg-index \
--project=$PROJECT_ID \
--region=$LOCATION

gcloud ai indexes list \
--project=$PROJECT_ID \
--region=$LOCATION

This command can take a while; it took +30 min for me.

Step 8: Create Public Endpoint

Now in order to deploy the index, you will need an endpoint. I am using a public endpoint for this tutorial.

First create request.json:

{
"display_name": "public-endpoint-test1",
"publicEndpointEnabled": "true"
}

Now send the creationPOST request:

curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://us-central1-aiplatform.googleapis.com/v1/projects/genai-jsg/locations/us-central1/indexEndpoints"

Step 9: Deploy Index to Endpoint

Now you can deploy the index to the endpoint. Deploy it with:

gcloud ai index-endpoints deploy-index xxxxxxxxxxxxxxx0896 \
--deployed-index-id=genai_jsg_deployed_index_id \
--display-name=genai_jsg_deployed_index_name \
--index=xxxxxxxxxxxxxxx8464 \
--project=genai-jsg \
--region=us-central1

Note that this will also take a while, +30 min.

Once the index has been deployed to the endpoint, you will need the publicEndpointDomainName of your deployed index. To do this, first observe the details of your deployed index in the response to this gcloud command:

gcloud ai indexes list --project="genai-jsg" --region="us-central1"

Use the response to that command to populate the ENDPOINT, PROJECT_ID, REGION, and INDEX_ENDPOINT_ID variables in preparation for this final curl call:

ENDPOINT=https://us-central1-aiplatform.googleapis.com
PROJECT_ID=xxxxxxxx3856
REGION=us-central1
INDEX_ENDPOINT_ID=xxxxxxxxxxxxxxx0896

curl -H "Content-Type: application/json" -H "Authorization: Bearer `gcloud auth print-access-token`" ${ENDPOINT}/v1/projects/${PROJECT_ID}/locations/${REGION}/indexEndpoints/${INDEX_ENDPOINT_ID}

In the response, take note of the publicEndpointDomainName value. It should look something like this:

0123456789.us-central1-xxxxxxxx3856.vdb.vertexai.goog

Now you are ready to query your index’s endpoint!

Step 10: Query Index and Combine with LLM

The following code is doing three main things (recall the diagram from the start of this post):

  1. Convert the user query into an embedding vector
  2. Use this vector to lookup relevant documents in the Matching Engine Index
  3. Use the relevant documents in the LLM text generation context

We will use the text-bison@001 text generation model from PaLM.

Be sure to replace the needed variables with the values from above. Note that the NUM_RELEVANT_DOCS variable indicates how many of the closest documents returned will be included in the LLM context.

import google.cloud.aiplatform_v1beta1 as aiplatform_v1beta1
from vertexai.preview.language_models import TextEmbeddingModel
from vertexai.preview.language_models import TextGenerationModel
import sys
import re
import json

# Set Variables
API_ENDPOINT="0123456789.us-central1-xxxxxxxx3856.vdb.vertexai.goog"
INDEX_ENDPOINT="projects/xxxxxxxx3856/locations/us-central1/indexEndpoints/xxxxxxxxxxxxxxx0896"
DEPLOYED_INDEX_ID="genai_jsg_deployed_index_id"

# Load the Embedding and Generation Models
embedding_model = TextEmbeddingModel.from_pretrained("textembedding-gecko")
generation_model = TextGenerationModel.from_pretrained("text-bison@001")

# Configure Matching Engine Index Client
client_options = {
"api_endpoint": API_ENDPOINT
}
vertex_ai_client = aiplatform_v1beta1.MatchServiceClient(
client_options=client_options,
)

# Get the user query
command_line_arguments = sys.argv
if len(command_line_arguments) > 1:
user_query = command_line_arguments[1]
print("\nYour question is: ", user_query, "\n")
else:
print("Must specify a query")

# Get embeddings from user query
embeddings = embedding_model.get_embeddings([user_query])
vector = embeddings[0].values

# Query Matching Engine Index w/ user query embedding
datapoint = aiplatform_v1beta1.IndexDatapoint(
datapoint_id="0",
feature_vector=vector
)
query = aiplatform_v1beta1.FindNeighborsRequest.Query(
datapoint=datapoint
)
request = aiplatform_v1beta1.FindNeighborsRequest(
index_endpoint=INDEX_ENDPOINT,
deployed_index_id=DEPLOYED_INDEX_ID,
)
request.queries.append(query)
response = vertex_ai_client.find_neighbors(request) # https://cloud.google.com/python/docs/reference/aiplatform/1.26.1/google.cloud.aiplatform_v1.types.FindNeighborsResponse

# Parse response for nearest neighbors
with open('lookup.json', 'r') as f:
filepaths = json.load(f)

i = 0
nn = []
for r in response.nearest_neighbors:
for n in r.neighbors:
id = n.datapoint.datapoint_id
distance = n.distance
filepath = filepaths[str(id)]
nn.append((id, distance, filepath))

print("The most relevant documents related to this question are:\n\n")
print("ID\tDist.\tFilepath\t\n")
print("".join([f"{id}\t{round(distance, 4)}\t{filepath}\n" for id, distance, filepath in nn]), "\n")

# Read in essay content from most relevant docs
context = ""
NUM_RELEVANT_DOCS = 1
for i in range(NUM_RELEVANT_DOCS):
n = nn[i]
filepath = n[2] # Access filepath
with open(filepath, 'r') as f:
context += f.read()

context = re.compile(r"<.*?>", re.DOTALL).sub("", context) # Remove residual HTML content

# Craft Prompt and Invoke Model
prompt = f"""
Context: You are Paul Graham, a programmer, startup advisor, and essayist.
Use the following essay you wrote to give a detailed answer to any questions you receive: {context}
Question: {user_query}
"""

print("Answer:")
print(generation_model.predict(prompt, temperature = 0.2, max_output_tokens = 1024))

Here are some example queries and their responses:

Cities
Kids
Founders

Just using the vanilla PaLM model without the added context would result in only generic responses. Even though they’re not perfect, these answers seem a lot closer to what Paul Graham himself might say.

Conclusion

I hope this has been a helpful introduction to Document Q&A with Matching Engine and PaLM. Note that this tutorial was intended to get you touching all the different pieces and building something that works; it is clearly not a production-ready system. One area for improvement would be in splitting up the documents. Feeding the LLM only the most relevant paragraph(s) of an essay instead of the entire piece would likely provide better results.

Additionally, with the generation_model.predict() call, how the prompt is formulated and what parameters are chosen have a big impact on the results; there's endless opportunity for tweaking. You can explore this area further here: Gen AI Overview of text prompt design

References and Further Reading

  1. Vertex AI Matching Engine setup
  2. Very in-depth and rigorous demonstration of Doc Q&A from Googler Mike Henderson: Github
  3. Helpful YouTube tutorial on Matching Engine

--

--