Databricks SQL AI: Query a Vector Search Endpoint with only SQL

Published in

DBSQL SME Engineering

15 min readSep 4, 2024

Intro

With the boom in accessibility for the average person to AI in the last year, I’ve felt a growing gap emerging for the analyst. On one hand, many tools, like ChatGPT, are geared toward the layperson. They require no prior coding experience to use. On the other hand, advanced implementations of AI in the business sphere tend to be implemented by experienced software engineers who may know multiple coding languages, like Python and Java. Where, then, are the tools for the analyst? Where are the tools geared toward business intelligence users who only know SQL? There appears to be a gap in the market for mainstream platforms to implement a tool that allows their SQL analysts to build similar quality AI applications using large language models (LLMs) without worrying about pipelining and infrastructure management.

Ideally, SQL analysts should be able to build their own AI applications without bothering platform architects and data engineers. They can leverage their coding skills to write specific queries and prompts based on their business knowledge without regressing to no-code solutions built to be flashy to the general public. However, that ultimately needs more finesse to be deployed in a production environment.

Enter Databricks SQL AI Functions, “built-in SQL functions that allow you to apply AI on your data directly with SQL.” In this article, and others like it that I will publish, I will walk you through how to implement these built-in functions on your enterprise data.

This article will focus on the vector_search* function, which allows you to query a vector store. To follow along with this article, you will need:

A Unity Catalog enabled Databricks workspace
Some basic SQL skills
A bottle of wine (just wait, you’ll see why)

* This function is in Private Preview as of the publishing date of this article. Submit the enrollment form to participate in the preview.

What Exactly is a Vector Store, and Why Do I Care?

So, what exactly is a vector store? Analysts may need to become more familiar with this term as it is a pervasive architecture pattern in different AI applications, like retrieval-augmented generation (RAG) applications. Still, the TL;DR is this: a vector store is a specialized database that stores vectors, which are lists of numbers that capture the essential features of an item of data stored in mathematical form.

Why do we store data in vectors? The main reason is that most algorithms require numerical input when discussing machine learning and AI applications. You can’t pass non-vectorized data like text, images, audio, and unstructured data to these algorithms. Additionally, vectorizing the data allows for better computation efficiency since computers can quickly perform operations (like addition, multiplication, etc.) on vectors. Also, depending on your chosen vectorization method, you measure item similarity by vectorizing text data since the vectors will be stored close to each other in the vector space. This is essential for efficient recommendation engines, search engines, and natural language processing. Lastly, vectors allow data type interoperability, so you can compare items that were initially different types, like text vs. image, to each other since they are all vectors now. There are other benefits, too, like dimension reduction and feature representation, but for now, what you need to know is that vector stores store vectors, which are the input to the algorithms behind AI and LLMs.

How do you create vectors? Well, you can create vectors in many different ways, and that topic alone could be multiple other articles so we won’t focus on the how so much here but more on the why.

When should you use a vector store? You should use a vector store when the application you’re building requires factual recall and context. This means that simple language use cases like summarization, translation, and sentiment analysis do not require additional factual context to complete outside of simply understanding language, so you do not need a vector store for these use cases. While you may not need a vector store for “simple” use cases, it is always worth considering the pains of not having a vector database with specific business context for your application. Suppose your business uses specific specialized outputs like acronyms and industry-specific terms. You might find that off-the-shelf LLMs will not be specific enough to cover your use case sufficiently. So, suppose you anticipate your application needing business-specific context, like providing your chatbot with information about your product SKUs and internal names for processes. In that case, you might need a vector store. A quick rule of thumb is this: could a random person off of the street answer the questions you’re building your application to solve without context? If yes, you don’t need a vector store. If no (which should be the majority of business applications), you need a vector store.

Common Vector Store Applications

Some common business use cases that would require a vector store are:

Semantic Search: enhancing search functionality by understanding the intent behind search queries and not just keyword matching to the search terms
- Example: An e-commerce platform might use a vector store to power a semantic search engine, allowing customers to find products conceptually similar to their queries, even if the exact keywords aren’t present.
Recommendation Systems: providing personalized recommendations based on similarities between user history/preferences and items
- Example: Video streaming services use vector stores to compare user behavior and preferences with content features (e.g., movies, and songs) to suggest similar content.
Fraud Detection: Identifying patterns and anomalies in transaction data to detect fraudulent patterns
- Example: Financial institutions might use vector stores to store embeddings of transaction data, allowing them to identify transactions similar to known fraudulent patterns quickly.
Document Retrieval and Summarization: retrieving relevant documents and summarizing them based on a user’s needs
- Example: Legal firms could use a vector store to quickly retrieve and summarize relevant case files, contracts, or legal documents by comparing the context of the query to the stored document embeddings.
Natural Language Processing (NLP) Applications: enhancing NLP models, like chatbots or translation systems, with contextual relevant information
- Example: A customer support chatbot could use a vector store to retrieve the most contextually appropriate responses based on the user’s previous queries and interactions.
Image and Video Search: Allowing the user to query images or videos based on visual similarities and not just text-based metadata
- Example: A digital asset management system might use a vector store to allow users to find images that are visually similar to a given image, enabling more intuitive searches.
Personalized Marketing: Delivering customized marketing content based on user behavior and preferences
- Example: An online retailer could use a vector store to analyze customer interactions and preferences, enabling them to send targeted marketing emails with products similar to those the customer has shown interest in.
Content Moderation: Detecting and filtering inappropriate content across platforms
- Example: Social media platforms might use a vector store to compare new content against embeddings of known inappropriate or harmful content to flag or remove it.
Anomaly Detection in Manufacturing: Detecting deviations from standard patterns in manufacturing processes
- Example: A manufacturing company might use a vector store to detect anomalies in machine sensor data, identifying potential equipment failures before they happen.
Product or Feature Clustering: Grouping similar products or features based on their characteristics
- Example: A product management team might use a vector store to cluster similar features or product ideas, enabling teams to identify trends or gaps in the market.

So, now you should understand what a vector is, what a vector store is when you might need one, some everyday use cases where it’s used, and why we need to be familiar with the concept when discussing AI and LLM applications and use cases.

Creating a Vector Store and Vector Search Index in Databricks without any code

Okay, now how do you create a Vector Store in Databricks? You know why they are essential but don’t know how to make them. Using the Mosaic AI Vector Search, you “can create Vector Search indices using the UI, the Python SDK, or the REST API.” In this article, I will briefly cover creating a Vector Search index using the UI. First, there are a few requirements to go over.

Prerequisites

Your workspace must have Unity Catalog enabled
You must have Serverless compute enabled
You must have the appropriate permissions (CREATE TABLE) on the table that you wish to create the Vector Search index on
You will need either a Personal Access Token (PAT) or a service principal token to create the endpoint

To create a Vector Search index, we must first make a Vector Search endpoint. To do this, navigate to your workspace and click ‘Compute’ in the left sidebar.

Then click ‘Vector Search’ and ‘Create’ to create your endpoint.

Okay, that’s it for creating the endpoint! It may take a while to provision, but don’t worry — it’s being made in the background.

Now, let’s create the index on top of some data in our lakehouse. In this article, I will walk you through creating a Delta Sync Index, which “automatically syncs with a source Delta Table, automatically and incrementally updating the index as the underlying data in the Delta Table changes.” Still, you should be aware that you can also create a Direct Vector Access Index, which supports direct reading and writing of vectors, and the user is responsible for updating the index. However, you cannot create Direct Vector Access Indices in the UI, only the Python SDK or the REST API.

Before we make the index on top of our sample data, I want you to go and pour yourself a glass of wine (here’s where that bottle I said you would need earlier comes in). Creating the Vector Search Index and generating the embeddings will take a while, depending on the size of your data and the embedding model you choose. Oh, and by the way, the sample data I’ll be using for the rest of this demo is the Global Wine Ratings dataset.

We must first load the data in a Unity Catalog delta table to create the index. I’ll use the UI to do this, but you can also do it via the Python SDK or API. To do it in the UI, navigate to ‘Catalog’ in the left sidebar and go to where you want to store the table. Once you’re in the schema you want to use, click the ‘Create’ button and then ‘Create table.’

Once you’ve clicked that button, upload the wine ratings csv. On the next screen, you can change the Table name and validate the types of each column before clicking ‘Create table.’

Navigate to your table in the catalog explorer. Now, it’s finally time to create the vector index. Click ‘Create’ and ‘Vector Search index’ to start.

This next screen has some terms that might intimidate you if you’re unfamiliar with vector stores, and that’s okay. I will break down each of these fields and what they mean.

Name: This is going to be the name of your index. It’s going to follow the three-level namespace convention in Unity Catalog, so it’ll be <catalog>.<schema>.<name>, and this is going to be how you’ll refer to your index in queries. For this data, I’m going with a generic name, ‘wine_demo_index,’ but if you are building multiple Vector Search indices on top of a single table, I recommend a more specific naming convention. For example, we’re only going to be embedding the ‘notes’ column, but if in one index you’ll be embedding ‘notes’ and in another, you’ll be embedding ‘name,’ then maybe you should name one ‘wine_notes_<embedding_model>,’ where you replace <embedding_model> with the name of the embedding model you chose.

Moving on, the Primary Key is the primary key column for your index. In the case of the wine dataset, that’s the ‘name.’

Next, Endpoint, which is the Vector Search endpoint you created earlier (in the ‘Compute’ section).

Columns to sync: This is where you specify which columns to ‘sync’ or retain in your Vector Search index. We’ll leave this blank for this demo to sync all columns, but you should evaluate it for your specific use if necessary. For example, you can only sync the primary key and embedding column (which are always automatically synced) and just join back to your source table to access the other columns if you need them in your queries.

Okay, here’s where it gets a bit technical, so I will explain each of these as simply as possible.

Embedding source: There are two options for creating embeddings: ‘Compute embeddings’ or ‘Use existing embedding model.’ We’ll choose ‘Compute embeddings’ here. Embeddings are vectors that represent the meaning of items, helping algorithms understand complex data. While creating embeddings, long texts may need to be broken into smaller chunks due to limits on input length or for better efficiency and accuracy. However, in our case, the source column is short, so we can directly select ‘Compute embeddings’ without worrying about chunking.
Embedding source column: The column on which to generate the embeddings. For this data, we’ll be using the ‘notes’ column.
Embedding model: Which embedding model should be used to generate the embeddings? To go into any detail on this would require more detail than you probably need to know as an analyst, so for now, just select ‘databricks-gte-large-en.’
Sync computed embeddings: You can toggle this to save the generated embeddings to a table in Unity Catalog. We’re not going to be toggling this.
Sync mode controls how often the index syncs with the underlying table, meaning that when you add new records, modify current records, etc., this choice controls how quickly those changes would be reflected in the Vector Search index. Continuous would keep the index in sync with seconds of latency, but it is expensive since it “requires a compute cluster to be provisioned to run the continuous sync streaming pipeline.” Triggered is more cost-effective, but for the index to be in sync, you must manually start the sync using the API. Since this is only an example, we will select ‘Triggered’ because we won’t need to update the source table. However, for production jobs, you should carefully weigh the cost benefits of your sync mode.

Okay, that’s all of the configurations! Congratulations, once you click ‘Create,’ you’ve just created your Vector Search index. You can find the documentation here and read more details if you need more information on any of the parameters discussed here. The process of generating the embeddings will probably take a while, so now is a good time to sip on that glass of wine that you poured yourself earlier ;)

Can We Use Some SQL Now?: The vector_search Function

At this point, you’re probably wondering where the SQL comes in. You’re probably thinking, ‘Reilly, you said this article would be about SQL, and so far, we haven’t used any SQL, and I’ve already had a glass of wine.’ Well, first of all, you’re welcome for convincing you to sample some wine in the name of learning SQL, and second of all, the SQL part comes in now (finally). But first, to summarize what you’ve done so far, you have:

Created a Vector Search index using the UI
Uploaded an example dataset to Unity Catalog using the UI
Generated embeddings and created a Vector Search index on your example dataset using the UI
Drank a glass of wine while waiting for the embeddings to generate

It’s finally time to use that Vector Search index you created. As I mentioned in the introduction, Databricks has a suite of built-in SQL AI functions, and the one we are focusing on today is the vector_search function. This function allows you to query a Vector Search index using Databricks SQL. This means that analysts who only know SQL can now harness the power of Vector Search indexes in their queries and dashboards.

So, let’s assume that I really like oak, cherry, and vanilla notes in my wines. I’m going to use the vector_search to search my index and find new wines to try.

The syntax of the function is as follows:

vector_search(index, query, num_results)

A quick note: you must pass all of the arguments to this function by name. Like so:

vector_search(
  index=<indexName>, 
  query=<query text>, 
  num_results=<num_results>
)

index: STRING, the whole, three-level name (<catalog>.<schema>.<name>) of an existing Vector Search index to query in the same workspace. The person using this must have SELECT permissions on the index
- Example: demos_rww.articles.wine_demo_index
query: STRING, what to search the index for
- Example: “cherry, oak, vanilla, spicy”
num_results (optional): the number of rows to return; default is 10

When executed with the above examples, we get the following:

SELECT * except (search_score)
FROM VECTOR_SEARCH(
 index => "demos_rww.articles.wines_index_demo",
 query => "cherry, oak, vanilla, spicy,"
 num_results => 15
)

Which returns:

That’s it. No, really, that’s all it takes to query a Vector Search index using Databricks SQL. Now, you can tailor this query to be a bit more specific and get fancy with it by searching for multiple terms at once using a LATERAL query, but that’s all there is to it.

Here’s a quick example of using LATERAL to query for multiple search terms at the same time. To explain what’s going on here a bit more, I first created a temporary view named ‘query_table’ with three search terms (‘cherry, oak, vanilla, spice,’ ‘pear,’ and ‘zesty, lime’) and then I used a LATERAL subquery to search for all of those terms at once using vector_search.

CREATE OR REPLACE TEMPORARY VIEW query_table AS (
  SELECT ’cherry, oak, vanilla, spice’ as query_txt 
  UNION ALL SELECT ’pear’ 
  UNION ALL SELECT ’zesty, lime’
);


SELECT
 query_table.query_txt,
 search.* except(search_score)
FROM
 query_table,
 LATERAL (
   SELECT * FROM VECTOR_SEARCH(
     index => "demos_rww.articles.wines_index_demo",
     query => query_table.query_txt,
     num_results => 2
   )
 ) as search;

The results of the above query look like this:

Using vector_search in a dashboard

A really cool feature of this function is that you can also use it in Databricks dashboards. With some small changes to the query, we can ensure it’s in the right format for a dashboard. Start by navigating to ‘Dashboards’ in the left sidebar and click ‘Create Dashboard’ to make a new dashboard.

Once you’ve created a dashboard, add data to it by clicking ‘Data’ and then ‘Create from SQL.’

When you get to the next page to write your query, modify your multi-term search query above to the following by replacing the hard-coded search terms with search_1, search_2, and search_3. These modifications allow the query to have parameters in a dashboard for search terms.

WITH query_table AS (
 SELECT :search_1 as query_txt
 UNION ALL
 SELECT :search_2
 UNION ALL
 SELECT :search_3
)


SELECT
 query_table.query_txt,
 search.* EXCEPT (search_score)
FROM
 query_table
CROSS JOIN LATERAL (
 SELECT * FROM VECTOR_SEARCH(
   index => "demos_rww.articles.wines_index_demo",
   query => query_table.query_txt,
   num_results => 2
 )
) as search;

Once you paste the above query and add values for the parameters, you should see this.

Now click ‘Canvas’ and go to the dashboard canvas to build visualizations. We will keep it simple here and just display a table showing the query results. Click the ‘Add a visualization’ button in blue (number 2) and create a table showing the query results (shown in the pink outline). Then, click the ‘Add a filter (field/parameter)’ button (number 3) three times and add three filters to the dashboard, one for each search term. This will allow users to define their search terms when viewing the dashboard. Finally, once you’re happy with how the dashboard looks, publicize it so others can see it by clicking the ‘Publish’ button (number 4).

Now, you have a dashboard where others can query the Vector Search index we built, and they won’t need any SQL or coding skills.

Conclusions

So, by now you should know what a vector store is, why you should care, and how to create a Vector Search index in Databricks using the UI. You should also know how to use the vector_search function and why it’s important that SQL users have access to this feature now. Hopefully, you can also now use that new index you created from the sample data and search for some new wines to sample!

Resources

Please find links to the official documentation for many topics I reviewed during this article below. In addition, I have included links to two different demos, one demonstrating AI Functions and their uses and the other for deploying an LLM Chatbot with RAG (written in Python but can applied using SQL AI Functions)