Vector Search & RAG Landscape: A review with txtai

Evaluating txtai against popular open-source alternatives

Published in

NeuML

7 min readJul 4, 2024

The Generative AI space is one of the fastest growing business sectors in the world. Most don’t know what “AI” is but many want to get in on the action. Businesses feel compelled to introduce AI into their processes with the expectation of increased efficiency and improved outcomes.

The biggest challenge with Generative AI is creating reliable content based on a businesses’ own internal data. Large Language Models (LLMs) have a tendency to generate incorrect data often when there is no conclusive or easy answer. The common term for this is “hallucinations”.

Retrieval Augmented Generation (RAG) helps reduce the risk of hallucinations by limiting the context in which a LLM can generate answers. This is typically done with a search query that hydrates a prompt with a relevant context. RAG has been one of the most practical use cases of the Generative AI era.

🚀 Let’s talk about how to do this.

Vector Search and RAG Landscape

Web searches for vector databases and rag frameworks bring back almost an infinite number of options. How does one decide?

The most popular RAG frameworks are LangChain (88K+ ⭐’s on GitHub) and LlamaIndex (33K+ ⭐’s). The fastest growing vector database is Chroma DB (13K+ ⭐’s). All are open source and permissively licensed (Apache 2.0 or MIT) and written in Python.

txtai (7K+ ⭐’s) is an all-in-one embeddings database for semantic search, LLM orchestration and language model workflows. txtai is also open source (Apache 2.0) and written in Python.

This remainder of this article will explore vector search and RAG with txtai versus using the referenced libraries. A baseline knowledge of vector search, LLMs and RAG is assumed. If you’re not familiar with either, check out the articles below for more.

Getting started with semantic search

Learn about this rapidly developing capability

medium.com

Getting started with RAG

Learn about the methods available to ground LLMs with facts

medium.com

Vector Search

First, we’ll evaluate vector search. The first step in a RAG process is the R or Retrieval. This can be any query, keyword search, SQL statement, grepping through text, you name it. The most common method though is running a vector search.

LangChain vs txtai

While LangChain isn’t quite a vector database, it has a number of integrations built in for working with vector index formats such as Faiss. It pairs those with methods to store associated metadata. Let’s see how this compares to txtai.

The code above loads the standard txtai intro dataset (a handful of text strings). A full copy of this code is available in this Gist.

While LangChain receives a lot of negative commentary regarding it’s code quality and developer experience, I found it easy to work with. The architecture made sense and it was easy to review the code when I wanted to understand more. The overall developer experience was good.

A large dataset wasn’t considered given that the Faiss vector store component only supports flat indexes. Larger datasets (1M+ documents) caused memory issues. But there was no workaround other than switching the vector store.

txtai vector indexes use SQLite + Faiss by default. This enables search with SQL and dynamic columns. Results are standard Python dictionaries and that allows direct integration with Pandas/Polars DataFrames.

LlamaIndex vs txtai

Like LangChain, LlamaIndex also isn’t quite a vector database. I found both frameworks to be very similar. Let’s see how this compares to txtai.

The code above loads the standard txtai intro dataset. A full copy of this code is available in this Gist.

I found LangChain to be easier to work with than LlamaIndex. While both are extremely similar, navigating the LangChain codebase and structures was clearer. If you understand one of these frameworks though, switching to the other is minimal effort.

One advantage LlamaIndex had over LangChain with their Faiss component, is that a developer can manually create the underlying Faiss index. While this is still more cumbersome than txtai, at least it’s possible.

Otherwise, the same comments as with LangChain apply regarding the results format.

Chroma DB vs txtai

For the last comparison, let’s actually use a project that sets out to be a vector database! Chroma is referenced in many demos as an easy-to-use local vector database. Let’s see how it compares to txtai.

The code above builds a vector database using the arxiv_dataset found on Hugging Face. It’s roughly 2.3M article abstracts. A full copy of this code is available in this Gist.

Chroma was easy to install and get started with. It’s built with the great Hnswlib library. The following obstacles were encountered as we went along. All were resolved successfully.

While Chroma supports in-memory databases, running with the arxiv_dataset caused memory issues running on a machine with 32GB of RAM. Switching to a PersistentClient fixed this.
Trying to stream the data with a single add call resulted in a batch size error as mentioned in this issue. Batching the add calls fixed this.
The first loading attempt was slow as the cuda device must be manually specified to properly utilize the GPU. Setting that improved the vectorization time.

Now let’s look at the code for txtai. It’s two calls, one that creates the Embeddings instance and another that streams the full dataset. txtai also supports hnsw and that setting was enabled.

GPU accelerated vectorization is automatically used if it’s available. Batching is built in. txtai uses mmap-ing and other techniques to ensure that memory limits are respected.

Streaming vector generation and offloading those vectors during index creation allows txtai to build large local indexes without running out of memory.

In terms of runtime performance, txtai was 3x faster than Chroma. It’s entirely possible there are ways to streamline performance but I couldn’t find a solution.

Retrieval Augmented Generation (RAG)

Now let’s talk RAG. Given that Chroma is not a RAG framework and the significant overlap between LangChain and LlamaIndex, we’ll just do a single comparison here with LangChain.

The code above builds a RAG process using these documents from txtai’s test dataset. A full copy of this code is available in this Gist.

In both cases, all PDF documents had text extracted, chunked and were loaded into Faiss. The same prompts, embeddings model and LLM were used in both cases.

Clearly, txtai can accomplish the same task in much less code.

There are also some differences with how the text is extracted. With the default settings, LangChain uses the unstructured library for text extraction. txtai uses Apache Tika. Unstructured is built in Python and has support for these formats. Apache Tika supports these formats.

txtai is able to preserve structured formatting with tables and lists. This often helps the LLM produce better answers. In this instance, that enabled a more relevant answer with txtai.

LangChain says: txtai recommends using the Image Captions Labels model for image captioning.
txtai says: txtai recommends the BLIP model for image captioning.

The source with the answer is shown below.

txtai also has an easy-to-use LLM pipeline that automatically loads models from Hugging Face, llama.cpp and APIs (OpenAI, Ollama etc). See the image below and this Gist for more on that.

Wrapping Up

An article written by the author and primary maintainer of a software project does an extensive search and concludes their library is the best? How convenient right?

For small to medium datasets, most of what’s here doesn’t matter. If you have a couple documents, you can store the vectors in NumPy as Karpathy once said. With larger production datasets, performance starts to matter more.

Developer experience is a preference. Given I’ve built txtai, it’s no surprise I like using it. Not everyone sees things the same way. It’s certainly possible others like the structures with the other libraries.

txtai is primarily one person working on a project for the last 4 years. I’ve often taken the path of just focusing on my own project and tuning out the noise.

The thing is this — txtai does a really good job with the components it focuses on. It doesn’t have the breadth of other frameworks but it absolutely should be in the conversation. There is also a sense of obligation that a library that can possibly help many is more widely known.

So with your next vector search and/or RAG project, give txtai a spin!