BigQuery Vector Search: A Practitioner’s Guide

Roy Arsan
Google Cloud - Community
11 min readNov 26, 2024

Special thanks to Francis Lan, Minmin Jin, Guang Cheng, and Joe Malone for contributing to this unofficial guide, and to the entire BigQuery team.

This technical article on BigQuery vector search is geared for administrators and users, including data analysts, machine learning engineers, and database administrators who are responsible for managing and optimizing vector search indexes and queries in BigQuery.

We first provide a conceptual high-level flow diagram of vector search to explain the different processing stages such as pruning, pre-filtering and post-filtering. This will offer insights into how to tweak your query to improve accuracy, latency and cost-efficiency. We then present best practices from index management, query optimization to cost control. We finally go over common pitfalls to avoid, and provide answers to frequently asked questions, all based on learnings from the field.

It’s important to note that BigQuery vector search is an evolving feature. Internal details may change, and new capabilities may be introduced over time. Consult BigQuery reference documentation and don’t hesitate to engage with your Google Cloud team for the latest updates and recommendations for your specific vector search use case such as log analysis, anomaly detection, semantic search, entity matching, drug discovery and more.

Journey from N clusters to top-k matches

Dude, where are my top-k?

Let’s use the following query as an example:

SELECT
query.id, base.id, base.type, distance
FROM
VECTOR_SEARCH(
-- base table filter on column 'type' is treated as:
-- pre-filter if 'type' is a stored column in the index,
-- otherwise, treated as post-filter
(SELECT * FROM mydataset.base_table WHERE type != 'anomaly'),
'embedding',
(SELECT id, embedding FROM mydataset.test_table),
top_k => 3,
options => '{"fraction_lists_to_search":0.01}'
)
WHERE
-- query filter on column 'id' is treated as post-filter
base.id < 1000

The following diagram shows conceptually the sequence of the operations during a VECTOR_SEARCH query.

Conceptual flow diagram of BigQuery vector search query and parameters

While the internal vector search implementation is more involved and may change over time, this simple diagram abstraction is meant to visualize how index-time and search-time settings impact each processing stage, and ultimately your final result set of matches:

  1. Number of clusters (N) or lists in your index can be set directly using num_lists (IVF index) or indirectly using leaf_node_embeddings_count (TreeAH index) during index creation time. At this time, N does not change if the dataset grows. In the case of a TreeAH index, N also depends on the initial dataset size where the TreeAH algorithm divides base table size by leaf_node_embeddings_count (default 1000) to get an approximate N.
  2. Number of top clusters (m) depends on fraction_lists_to_search. By adjusting this value, you can influence the portion of the index lists that are searched, thereby controlling tradeoff between search speed and accuracy at query time.
  3. Number of base records (L) to compare against depends on the set of records in top clusters (m) and any potential pre-filters applied on that set. With high pruning (low fraction_lists_to_search) and selective pre-filtering, it’s possible that L is lower than top-k, which will result in fewer than top-k matches. In that case, try increasing fraction_lists_to_search.
  4. Number of final matches depends on any post-filters, and therefore the result set could have fewer than top-k matches or even zero matches. In that case, try increasing top_k to increase potential candidate matches before the final post-filtering stage.

Best Practices

Grab your map and compass!

1) Indexing and Data Management

  • Append-Only Ingestion: appending to base tables is recommended instead of DML operations (update/insert/delete) which might invalidate more of the index, leading to temporary drops in index coverage. Maintaining a high index coverage ensures faster and more consistent search performance.
  • Use Partitioning Tables: partitioning your base tables helps internal index management, especially if only a limited number of partitions are updated each time. Partitioning also optimizes query performance and cost if your vector queries include partition filters, such as filtering by timestamp partition column for time-partitioned tables.
  • Use Stored Columns: storing frequently accessed columns with the index alongside the embeddings improves performance and reduces costs by avoiding base table scans for joins. At index creation, use STORING clause to store all referenced columns, including columns used in your vector search query (SELECT or WHERE clause) and any columns used as base table filters (base table’s argument WHERE clause). Refer to ‘pre-filtering’ in the next section “Query Optimization”.
  • Experiment with index option num_lists (or leaf_node_embeddings_count for TreeAH index) for optimum performance-cost tradeoff. If you don’t specify num_lists, BigQuery calculates a reasonable value based on your base table size which is appropriate for most cases. Depending on your data distribution, a higher num_lists value may lead to higher recall but also increases indexing cost. Indexing time generally scales linearly with num_lists. Use num_lists in conjunction with fraction_lists_to_search (a search-time parameter) for optimal performance. Refer to “Tuning fraction_lists_to_search” in the next section “Query Optimization”.
  • Vector Index Monitoring: Regularly monitoring the INFORMATION_SCHEMA.VECTOR_INDEXES view and job stats help track index coverage and usage, enabling proactive identification and resolution of potential issues.
  • Recreate index* for large data ingestions: After ingesting a significant amount of new data, particularly if the new data doesn’t follow the same distribution as the existing data, recreating the vector index is encouraged to maintain result accuracy and avoid data skew.
  • Sharding for Scale*: For massive datasets (more than 10 billion entries), splitting your base table across multiple tables (and indexes) allows scaling beyond the capacity of individual indexes (max 10 billion entries for IVF index), while also parallelizing your queries.

* The last 2 tips are valid at the time of writing, but may change or become obsolete as new capabilities are added to BigQuery vector index.

2) Query Optimization

  • Pre-filtering using stored columns: Pre-filtering expedites searches by narrowing down the search space to records with specific attributes. For example, you could pre-filter by time window or by feature depending on your use case. This is currently only supported for IVF index. What about cost impact?
    — For on-demand users, storing the column will only marginally increase storage costs, but potentially reduce the cost for bytes scanned as long as the base table join is avoided.
    — For capacity-based users, storing columns will only marginally increase storage costs, but improve performance and reduce slot consumption.
  • Batch Queries: Use large batch jobs whenever possible for scale and cost optimization (amortize index scan cost and base table join cost when applicable). Using large batches also helps you stay within job concurrency limits if your vector search SQL query uses VertexAI LLM functions such as ML.GENERATE_EMBEDDING which defaults to 1 or 5 depending on base model used; although that quota limit can be increased.
    — For on-demand users, batch query cost (bytes billed) will eventually plateau yielding further cost efficiencies. That usually occurs when there is a sufficient number of different test records to scan all clusters in the base table.
    — For large batch queries (10k-20k and more) and small base tables (<200M rows), consider TreeAH index for improved performance over IVF index.
  • Tuning fraction_lists_to_search: this parameter can significantly impact query performance and cost. It is recommended to increase that parameter in the following two cases:
    — If the number of clusters is low, increase fraction_lists_to_search to help search enough top clusters. The default value is only around 0.002 for IVF indexes and 0.05 for TreeAH indexes. With a small number of clusters, you may want to start with values such as 0.01 and experiment based on your base table size and recall requirements.
    — If the pre-filter is very selective, increase faction_lists_to_search to increase the chance of finding filtered base table vectors within the top clusters. Otherwise the top clusters might not contain enough filtered base vectors. For highly selective pre-filtering, a higher faction_lists_to_search value or even use_brute_force might be appropriate (as computations remain low), while for less selective predicates, a lower value might suffice.

Note: When using selective pre-filtering with TreeAH index, the top-k result is likely to be reduced since pre-filters are treated as post-filters. The impact can be significant especially if the pre-filters correlate with closest clusters. In that case, to account for the reduction and improve recall, you can increase top-k value (e.g. by 50%, 2x or even 3x); in contrast increasing fraction_lists_to_search may have little to no impact. If increasing top-k parameter is not an option (due to cost or latency), consider creating a separate indexed table with the pre-filtered subset of base records to avoid recall hit of TreeAH base table filtering.

3) Cost Control

  • Use capacity-based pricing and slot reservations for both cost predictability and cost optimization. Vector search index optimizes compute (slots) rather than IO (bytes scanned); therefore, it is generally more beneficial to use slots in queries. Using slots, you can also realize incremental cost savings with better slot usage from stored columns, and avoid scanned bytes cost from potential base table joins. Other benefits from slot reservations include decoupling indexing from search workloads, avoiding resource contentions, and simpler cost attribution.
  • Implement safeguards for zero index coverage: To prevent unexpected costs due to full table scans when the index coverage temporarily drops to zero, implement safeguards in your query logic. For example, if the index coverage is zero, you can skip the query execution if your use case allows it. See “Append-Only Ingestion” and “Use Partitioning Tables” recommendations above, to mitigate index coverage drops.

Things to Avoid

Look before you leap.

  • Using IVF with a very small base table (< 500k rows) may encounter a performance bottleneck (unusually high query latency) due to a small number of shards in the underlying small index table. To improve parallelism, consider using TreeAH index which generally performs better, particularly for smaller base tables.
  • Using use_brute_force with a large batch query may exhaust your entire slot allocation (with capacity-based) or your maximum allowed CPU time (with on-demand pricing) at which point, the query will fail without results. For brute force testing, use a single test vector query or a very small batch query (10 or less).
  • Setting fraction_lists_to_search too high (e.g., 0.3) can negatively impact performance and cost, with little to no recall impact. To evaluate maximum recall, instead use use_brute_force with a small batch. To increase the number of matches when using post-filters, increase top-k instead.
  • Using temporary tables instead of base table: referencing the base table in the first argument of VECTOR_SEARCH function, either directly or in a query, is crucial for leveraging vector indexes and improving performance. Using temporary tables (e.g. nested subquery) bypasses the index, leading to brute force searches and potentially slower query times, especially with larger datasets.

FAQs

You’re in good company.

1. How many clusters does the vector index have?

You can usually deduce how many clusters there are in your vector search index, by inspecting your query execution graph.

Let’s take this example query where:

  • Base table with close to 150k records
  • Index IVF created with num_lists=50
  • Search with fraction_lists_to_search=0.06, top_k=10

Look for the index table at the top of the execution graph. You can see 50 records were retrieved from the index table, corresponding to the 50 clusters and associated details like centroid ID and embeddings.

Note: Your query’s execution graph may be different from the one depicted in this example, as it varies depending on specific query and the underlying index type. The vector search internal implementation could also change over time.

2. How many clusters are searched?

As you tune your vector search efficiency, you may want to confirm the initial cluster pruning is working as expected.

Let’s continue with the same example query and its according execution graph.

Those 50 cluster records are then fed into one of the first JOIN steps (S04: Join+ in this case) for cluster pruning. At this step, the m closest clusters are determined (to be later searched) using ORDER BY LIMIT 3 where 3 = fraction_lists_to_search * num_lists = 0.06 * 50, as expected.

3. How can I evaluate the recall of my vector search?

Vector search uses Approximate Nearest Neighbors method to find the nearest neighbors or top-k results.

To evaluate the recall of your vector search, you can compare these top-k results against results returned with the same vector search query but using brute force. For an example query, see Evaluate recall.

4. How can I improve the recall of my vector search?

You can improve the recall of your BigQuery vector search results by:

a) Tuning fraction_lists_to_search: this search-time parameter controls the balance between speed and accuracy at search time. It determines the fraction of the index lists to search for potential matches. Increasing the value improves the chances of finding relevant matches but can also slow down the search. You can use this in conjunction with num_lists to tune the approximate-ness of your search.

b) Tuning num_lists (or inversely leaf_node_embeddings_count in the case of TreeAH index): this index-time parameter affects the accuracy and performance of the search, in conjunction with fraction_lists_to_search. You are not required to specify num_lists as BigQuery chooses appropriate value based on your data size. However, if the recall target is not met (when compared to brute force search), and if you have data that is distributed in many small groups in the embedding space, then consider re-indexing the data with a higher num_lists, to obtain an index with more lists, where each list has fewer and closer data points. Note using a high num_lists may increase the index build time.

Let’s go back to our earlier example…

Let’s assume the vector embeddings actually make up 200 separate clusters in the embedding space, where each cluster has ~750 data points each. The IVF index was created with only 50 lists, thereby dividing embedding space into 50 clusters, each with roughly 3000 data points, thereby grouping together somewhat different data points. As noted earlier, given fraction_lists_to_search=0.06, only the 3 top clusters are scanned, for a total of ~9000 distance computations. These 3 clusters might miss some potential matches; and each cluster may have high variability with (distant) base records, negatively impacting the final top_k results.

In this case, consider re-indexing with num_lists to 200 (instead of 50) to better match your data distribution. Assuming the same fraction_lists_to_search=0.06, then the search will scan the top 12 clusters. Roughly the same number of distance computations are calculated, but this time fetched from finer-grained clusters with generally closer data points to the test record. This new index should yield better search accuracy. You might even be able to reduce fraction_lists_to_search, reducing search time and cost, without impacting recall.

Note the maximum limit for num_lists is currently 5000 to limit index build cost impact.

Summary

BigQuery vector search offers powerful capabilities for finding similar items based on vector embeddings. In this article, we focused on practical tips for administrators and users deploying and using BigQuery vector search, including best practices for indexing, data management, query optimization, in addition to cost control strategies. We also highlighted some pitfalls to avoid and provided FAQs from working with customers in the field.

We hope this article can help you unlock the possibilities for building and operating innovative vector search-based applications on BigQuery. To get started, see Search embeddings with vector search introductory tutorial. For a more advanced use case, see Anomaly Detection of Infrastructure Logs using Gemini and BigQuery vector search Jupyter notebook.

We cannot wait to see how businesses revolutionize their applications and push boundaries in their respective domains, empowered by the cost-effectiveness and scalability of BigQuery vector search.

--

--

Google Cloud - Community
Google Cloud - Community

Published in Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Roy Arsan
Roy Arsan

Written by Roy Arsan

Solutions Architect taming big data at Google Cloud. Previously at Splunk and Nvidia. Views are my own. @RoyArsan https://www.linkedin.com/in/arsan/

No responses yet