Using Vector Store Retriever in a Conversational Chain with LangChain: LanceDB vs. Chroma

Trevor Thayer
Indicium Engineering
7 min readJun 10, 2024

--

Let’s explore how to use a Vector Store retriever in a conversational chain with LangChain. Specifically, we will compare two popular vector stores: LanceDB and Chroma. These tools are crucial for managing and retrieving data efficiently, making them indispensable for AI applications. Vector stores allow AI systems to perform complex tasks with high accuracy and speed, significantly enhancing the performance of applications such as search engines, recommendation systems, and natural language processing (NLP). By efficiently handling high-dimensional data, vector stores enable real-time data processing, which is essential for dynamic and responsive AI systems.

What are Vector Stores?

Before diving in, let’s refresh our understanding on — or learn — what a vector store is. A vector store is a specialized type of database designed to store and manage high-dimensional vectors, which are numerical representations of data. These vectors capture the semantic meaning of data, such as words in a document or features in an image. By converting data into vectors, it becomes possible to perform operations like similarity search, where you can find data points similar to a given query.

This capability is particularly useful in applications like search engines, recommendation systems, and artificial intelligence (AI). For example, in natural language processing (NLP), sentences or words can be transformed into vectors that represent their meanings. A vector store then allows us to efficiently search through these vectors to find relevant information, enabling AI systems to understand and respond to queries more effectively. Therefore, vector stores enable us to ‘chat with’ our data, making interactions with large datasets more intuitive and efficient.

Loading and Preprocessing Data

Before being able to retrieve information from the vector store, there are some preliminary steps you must take to store your data. LangChain provides document loaders that allow you to load and preprocess data. In this example, we will use a PDF document. The process involves three main steps: loading the PDF, splitting the data into manageable chunks, and preparing it for retrieval operations.

  1. Loading: The document loader reads the content of the PDF and transforms it into a format that can be easily manipulated and queried. This initial step involves converting raw data into a structured format, which is essential for subsequent processing. Once the documents are loaded, we call the split_documents function to perform the next step: breaking the documents into chunks.
from langchain_community.document_loaders import PyPDFLoader
"""
Load PDF documents and split them into chunks

Args: pdf_names (list): List of PDF filenames to be loaded.
Returns: list: List of split document chunks.

"""
def load_pdf(pdf_names):
docs = []
for pdf in pdf_names:
loader = PyPDFLoader(pdf)
docs.extend(loader.load())
return split_documents(docs)

2. Splitting: As mentioned, once the documents are loaded, they are divided into smaller sections using the split_documents function defined below. This step is crucial for several reasons:

  • Manageability: Large documents can be unwieldy to process in a single chunk. Splitting them into smaller parts makes them easier to handle.
  • Relevance: Smaller chunks allow for more precise retrieval. When a query is made, the system can return the most relevant sections of the document rather than an entire, large document, improving the efficiency of the retrieval process.

Here’s a code snippet that demonstrates how to split the loaded PDF documents using LangChain. We define separators to specify where the text should be split:

from langchain.text_splitter import RecursiveCharacterTextSplitter
"""
Split documents into smaller chunks for processing.

Args:
docs (list): List of documents to split.

Returns:
list: List of split document chunks.
"""

def split_documents(docs):
r_splitter = RecursiveCharacterTextSplitter(
chunk_size=1500,
chunk_overlap=150,
separators=["\n\n", "\n", "\. ", " ", ""]
)
return r_splitter.split_documents(docs)

These separators ensure that the text is divided into coherent sections, making it easier to process and retrieve relevant information efficiently. The chunk size and overlap parameters further control the size of each section and the overlap between consecutive chunks, balancing manageability and relevance.

3. Embedding: After splitting the document, the next step is to embed these chunks into a vector store. Embedding involves converting text data into high-dimensional vectors that capture the semantic meaning of the text, and storing them. This is done using machine learning models that understand the context and meaning of the words.

Embedding is crucial because it translates the textual information into a numerical format that the vector store can efficiently handle. For instance, two different phrases with similar meanings will have closely positioned vectors in the high-dimensional space, enabling the system to identify and retrieve them effectively.

Here is a code snippet demonstrating how to use the document splits to embed and store them with Chroma.

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
"""
Embed and store document splits in Chroma.

Args:
splits (list): List of split document chunks.

Returns:
Chroma: Vector store with embedded documents.

"""

def embed_and_store_splits(splits):
persist_directory = 'docs/chroma/'
embedding = OpenAIEmbeddings()
vectordb = Chroma.from_documents(
documents=splits,
embedding=embedding,
)
return vectordb

Understanding the Embedding Process:

  • Embedding Model: The OpenAIEmbeddings model converts each document section into a high-dimensional vector, leveraging advanced machine learning techniques to understand text context and semantics.
  • Vector Storage: The generated vectors are stored in Chroma, a database designed for efficient storage and retrieval of high-dimensional data, allowing quick and accurate similarity searches.
  • Persistence: The persist_directory parameter specifies where the vectors are stored on the filesystem, ensuring they can be retrieved and used in future sessions without reprocessing.

Embedding and storing document sections as vectors allows for efficient retrieval based on semantic similarity, enhancing speed and accuracy. This process also enables advanced AI functionalities allowing your system to interact with and understand the data more effectively.

Let’s explore our the differences between LanceDB and Chroma.

Using LanceDB for Vector Retrieval

LanceDB is designed for large-scale, high-performance data management. It’s particularly useful for applications that require handling massive datasets with high efficiency. Setting up LanceDB involves loading your data into the system and configuring it for optimal retrieval performance. This makes it suitable for real-time applications where quick data retrieval is critical.

  1. Loading Data: The data is loaded into LanceDB, where it is indexed and prepared for retrieval. This step ensures that the data is structured in a way that allows for efficient searching and retrieval.
  2. Configuring Retrieval: Once the data is loaded, LanceDB is configured to optimize retrieval performance. This may involve setting up indexing strategies and other optimizations that make searching through large datasets fast and efficient.
  3. Retrieval Operations: With the data loaded and configured, LanceDB can now handle retrieval operations. When a query is made, LanceDB quickly searches through the indexed data and returns the most relevant results, making it ideal for applications that require quick, real-time responses.

Using Chroma for Vector Retrieval

Chroma, on the other hand, focuses on simplicity and ease of use. It’s ideal for small to medium-scale applications where integration with existing tools and frameworks is more important than handling massive datasets. Chroma can be easily set up and used to load data, making it a great choice for projects involving natural language processing and conversational AI.

  1. Ease of Setup: Chroma is designed to be user-friendly, with a simple setup process. This makes it accessible for developers who may not have extensive experience with data management systems.
  2. Loading Data: Similar to LanceDB, data is loaded into Chroma and indexed for retrieval. The focus here is on making the process as straightforward as possible.
  3. Integration: Chroma integrates seamlessly with existing tools and frameworks, particularly those used in natural language processing and conversational AI. This makes it an excellent choice for projects where ease of use and quick integration are priorities.

Overview

LanceDB is ideal for high-performance, large-scale applications but requires more setup and has a steeper learning curve. Chroma offers simplicity and ease of use, making it suitable for smaller projects with a focus on quick integration and user-friendliness. So, choosing which to use will depend on your own goals for your system and your team’s capabilities.

Here are some examples of applications that could be used with vector stores based on their strengths:

Use Chroma for:

  • Customer Support Chatbots: Quickly deploy a conversational AI that can understand and respond to customer queries effectively by leveraging Chroma’s seamless integration with NLP tools.
  • Content Recommendation in Blogs: Implement a recommendation system for a blog or a small-scale content platform where ease of setup and maintenance is crucial.
  • Personalized Learning Assistants: Develop an educational assistant that provides personalized responses and recommendations to students based on their queries.
  • Document Summarization: Create systems that summarize large documents or articles, making it easier for users to grasp the main points quickly.

Use LanceDB for:

  • Real-time Fraud Detection: For financial institutions needing to analyze large volumes of transaction data in real-time to detect fraudulent activities, LanceDB’s high-performance capabilities are essential.
  • E-commerce Recommendation Engines: Implement sophisticated recommendation systems that can handle and analyze massive datasets of user behavior and product information to provide personalized shopping experiences.
  • Large-scale Document Search Systems: Develop a robust document search and retrieval system for legal, academic, or enterprise environments where quick access to a large repository of documents is necessary.
  • Genomic Data Analysis: Handle and analyze large-scale genomic data for research in bioinformatics, where quick and efficient retrieval of relevant data can significantly accelerate discoveries.

To Review…

In this tutorial, we’ve explored how to use Vector Store retrievers with LangChain, comparing LanceDB and Chroma. It covers the basics of setting up your environment, loading and preprocessing data, and using LanceDB and Chroma for vector retrieval. Each tool has its strengths and is suited to different types of projects, so the choice depends on your specific requirements. There are countless real-world applications of these practices, and depending on your use case, different vector stores might be more suited to your needs.

For further reading and resources, check out the LangChain Documentation and the DeepLearning.AI LangChain Course.

Supporting Code on GitHub

You can find the supporting complete code in the GitHub repository. This demonstrates the processes outlined above for chatting with your data using Chroma. This includes:

  • The langchain_agent.py script.
  • A requirements.txt file
  • A README.md file to guide you through this process

--

--