A RAG approach using Databricks AI Lakehouse functionalities

Published in

SDG Group

10 min readJan 22, 2024

Please note that the information provided in this article is accurate as of January 2024. However, due to the dynamic nature of the topic, it is recommended checking back source links periodically for updates, so that you have the latest information available.

In the latter half of 2023, Databricks fortified its position as a preeminent platform for managing generative AI models in a centralized way by introducing innovative components and enhancing existing ones:

1. Vector Search Indexing: Databricks introduced a cutting-edge vector database designed for automatic indexing of data as embedding vectors. This feature also facilitates semantic search functionality based on similarity.

2. Self-Managed Models: Through model serving, Databricks now provides access to open-source models directly within its Marketplace. The catalog has undergone meticulous review to optimize performance. Furthermore, Databricks offers GPU-optimized model serving specifically tailored for leading open-source LLMs.

3. AutoML for LLMs: Building upon the existing toolset, Databricks expanded its AutoML functionality to enable fine-tuning of text classification and embedding models with custom data, ensuring adaptability to diverse use cases.

4. Lakehouse Monitoring: Databricks now offers a comprehensive, unified service for monitoring both data and AI assets integrated into your workflow to improve efficiency.

5. Inference Tables: Introducing Delta tables designed to store the output of models, enabling a systematic evaluation of performance and result quality.

6. MLflow for LLMOps: Databricks extended the MLflow API to efficiently track LLM models and their parameters. Additionally, new prompt engineering evaluation tools have been integrated for managing and optimizing LLMOps.

We will see in detail all of them in a use case to perform Retrieval Augmented Generation (RAG), in addition to providing a sample architecture that gives an overview of all the elements required to execute the use case in production. So far, we can say that, by incorporating these advancements, Databricks has solidified its commitment to providing a comprehensive and streamlined environment for working with generative AI models, addressing key aspects of efficiency, adaptability, and performance monitoring.

Model serving

A remarkable innovation is the integration of generative AI models, specifically Large Language Models (LLMs), into Databricks’ Model Serving service. This service provides automatic scalability for the deployed endpoints, adjusting their capabilities to the technical demands of each model, in order to optimize costs and enhance overall performance.

Model serving supports:

Custom models: you can register your own Python model packaged in the MLflow format in the Unity Catalog or the Model Registry. Afterwards, you need to create a serving endpoint in order to query your custom model (see Model serving tutorial).
Foundation model APIs: accessible open-source models are at your disposal, crafted for optimized inference processes. This resource proves beneficial across diverse applications, including, but not limited to, constructing chatbots, querying Large Language Models (LLMs), and conducting comparative analyses between distinct LLMs.

Databricks recommends interacting with the models, when developing, directly within a notebook environment through the Python SDK (see How to query Foundation Models APIs). When we are already thinking about moving to production, other options to query such models include the Serving UI (see Send scoring requests with the UI) and the REST API (Query serving endpoints using the REST API).

Foundation models require a Databricks API token to authenticate endpoint requests. They are subject to pricing across two distinct modalities, depending on your use case (see Model serving pricing):

Pay-per-token: This modality enables the execution of tasks such as chat or text completion and text embedding. Available regions include us-east-1, us-east-2, us-west-2.

Supported models for pay-per-token are:

For additional information, see: Supported models and Foundation Models APIs reference.

Provisioned throughput: it is recommended for workloads that request assured performance, fine-tuned models or additional security conditions. This option requires serverless compute and logged models using either MLflow 2.4 (or above), or Databricks Runtime 13.2 ML (or above). The regions of availability include: us-east-1, us-east-2, us-west-2, ca-central-1, ap-southeast-2, eu-central-1, eu-west-1.

Databricks also recommends to register models in the Unity Catalog for faster upload and performance. You can search for a model family and “Get access” in the model page, finally provide login credentials to install the model to Unity Catalog (see Databricks Marketplace).

Supported model families are:

LLaMA-V2 models
MPT-family models
Mistral models

For more information, see Deploy provisioned throughput Foundation Model APIs.

External models: third-party models hosted outside Databricks, managed by different providers. See Tutorial: Create external model endpoints to query OpenAI models.

Available models offer text embedding and chat or text completion. More information on supported models and specifications : External models in Databricks Model Serving.

Serving endpoints

The essential component enabling the interaction of the aforementioned models is the serving endpoint, linked to the model or models (see Serve multiple models to serving endpoint).

The configuration of the endpoint varies based on whether the model is custom, from the foundation API, or an external model (for specific setup instructions see Create and configure model serving endpoints).

As we’ve seen, for querying such models there are also three options to create a serving endpoint:

Serving UI
REST API
MLflow Deployments SDK: See details of MLflow Databricks Deployment Client.

At the computational level, it is crucial to highlight the option of selecting either a GPU or CPU workload, each presenting its own set of advantages and disadvantages.

Langchain provides functionalities to wrap Databricks serving endpoints as LLMs (see Langchain: Databricks).

Requirements

To leverage model serving functionality, the following prerequisites must be met:

Model Registration: the model must be registered in the Unity Catalog or reside in the Workspace Model Registry.
Access Permissions: granted permissions to access the registered models. For details see Serving endpoints access control.
Enabled Workspace Region: see Databricks clouds and regions.

You can find more information about Limitations and Data protection here.

Retrieval Augmented Generation

To apply the diverse innovations we have delineated, we present an architectural framework for implementing RAG within the Databricks environment.

But first of all, what is RAG? Retrieval-augmented generation is an AI framework designed to enhance the accuracy of responses generated by LLMs. The primary motivation behind its development lies in addressing the tendency of LLMs to generate inaccurate information, commonly referred to as hallucinations, particularly when confronted with unfamiliar queries. In addition, RAG presents a cost-effective alternative to more expensive fine-tuning methods.

Implementation of RAG involves the integration of personalized data specific to the use-case into a vector database. Subsequently, inquiries made to the model should be formulated to explicitly indicate that the desired response is to be framed within the provided contextual information.

Below, we outline two proposals for implementing Retrieval augmented generation in Databricks. It is essential to observe that there are two distinct approaches, namely the productive and the development-oriented, as the optimization of the workflow entails tailoring each component to its specific role within each environment.

RAG approach for development (to view the image clearly, click on the magnifying glass, and then right-click + open image in a new tab).

RAG approach for production (to view the image clearly, click on the magnifying glass, and then right-click + open image in a new tab).

Detailed example code in LLM Chatbot With Retrieval Augmented Generation (RAG) and Llama 2 70B.

Data preparation

The initial stage involves the ingestion of source documents in the Databricks platform. To conform to the LLM foundational context window, the records must be segmented into manageable chunks. This process includes data parsing and metadata extraction. Since LLMs work by tokens, the splitting should be working with tokens, ideally using the same tokenizer as the LLM uses. Langchain has several functionalities to implement this first step (see Split by token).

Following the segmentation of documents, the resultant chunks are stored in a Delta Table and an LLM must be called to execute the embedding process on these chunks. The configuration of both the table structure and the embedding call depends on the specific type of Vector Search Index chosen, as commented below.

Vector Search Index

Databricks Vector Search is designed to store a vectorized representation of data and perform similarity searches on it. In a vector database, the vector search indexes are stored in association with the source data (see Vector Search).

There are two primary types of Vector Search Index:

1. Delta Sync Index: Databricks oversees the indexing process using embeddings from a Delta source table. It automatically synchronizes and updates the indexes in response to changes in the table. You have the option to let Databricks handle the embeddings by specifying the model serving as the endpoint for the embedding model, or you can choose to manage them independently (refer to differences below).

2. Direct Vector Access Index: You have control over both embedding vectors and index updates. Tasks such as reading and writing embedding vectors and metadata are accomplished through a REST API or SDK.

Source: Databricks LLM Chatbot With Retrieval Augmented Generation (RAG) and Llama 2 70B.

To leverage this capability, you need to create a vector search endpoint (see How to create and query a Vector Search index). Subsequently, a vector search index must be generated in connection to your data, allowing for the initiation of queries through the REST API or the SDK. A singular search endpoint vector has the capacity to accommodate various indexes.

Requirements

Unity Catalog enabled: The indexes are configured as object-like tables. The administration of diverse permissions associated with them is regulated by the Unity Catalog.
Serverless compute enabled in the account console
Source tables must have Change Data Feed enabled
CREATE TABLE privileges on catalog schema(s) to create indexes.
Personal access tokens enabled.

See availability regions and other limitations here.

QA chain

Langchain provides a range of functionalities for interacting with Large Language Models, including operations such as embedding calls, prompt engineering and retrieval chains. In addition, Langchain is equipped to wrap Databricks endpoints and establish Vector stores tailored for Databricks Vector search .

Following the last section, the next step involves orchestrating user interactions by asking questions to an LLM. The model will use relevant information associated with the question, drawing from documents supplied by the Vector Search Index as its contextual framework.

In case you want to make an evaluation of prompts to define the best structure for your use case, Mlflow is the right tool for that (see Prompt engineering).

Lakehouse monitoring

The final segment is designed to establish a monitoring framework applicable to various stages within the process. Therefore providing a consolidated tool that enables the continual tracking of data quality, model performance, and user feedback over time.

These features make it easier to diagnose possible errors, perform root cause analysis and find solutions. Simultaneously, it facilitates the generation of reporting visualizations to enhance the flexibility and dynamism of the monitoring process.

To monitor a table in Databricks, you create a monitor attached to the table. To monitor the performance of a machine learning model, you attach the monitor to an inference table that holds the model’s inputs and corresponding predictions.

Types of analysis

Requirements

Your workspace must be enabled for Unity Catalog and you must have access to Databricks SQL.
Only Delta managed tables, external tables, views, materialized views, and streaming tables are supported for monitoring. Materialized views and streaming tables do not support incremental processing.

See information about Lakehouse monitoring expenses.

Databricks’ downsides

Once exposed all the innovations and how to use them properly, it’s time to discuss the main drawbacks this functionalities have:

Steep learning curve and intricate setup: Despite the extensive documentation available, grasping the full scope of Databricks’ framework and tools, along with their interconnections, may pose a challenge for developers initially. The absence of visualization tools hinders a more user-friendly deployment and development experience.
High optimization requirement and associated costs: While Databricks’ computation can be cost-effective with optimization, achieving this efficiency demands a deep understanding of the platform and lakehouse infrastructure — particularly crucial for large models with billions of parameters and extensive data sources. Without proper optimization, costs can escalate significantly.
Absence of user interface: Once models are deployed, a user-friendly platform becomes essential for interaction. Databricks lacks this interface, necessitating users to acclimate to the technical framework to query the model and retrieve desired information.
Separate pricing model: In addition to potential high costs, it’s important to note that certain features, such as Lakehouse monitoring and Vector Search Index, come with separate pricing. This contrasts with other platforms where these functionalities might be seamlessly integrated within a single fee structure.
Compact open-source community: As noted, even though there is an extensive array of documents accessible for navigating the Databricks platform, users may encounter complications and uncertainties. Developers usually seek the community for answers and examples, particularly when they perceive the existing documentation to be insufficient in addressing their queries. Unlike some larger communities, the Databricks community is relatively small, and there is restricted no-code support, intensifying the challenges associated with the learning journey previously mentioned.

A RAG approach using Databricks AI Lakehouse functionalities

Model serving

Retrieval Augmented Generation

Databricks’ downsides

Written by Lavinia Hriscu