Build Generative Q & A over 10M Docs on a Laptop in 4 minutes

Nicholas
ThirdAI Blog
Published in
5 min readApr 11, 2024

With the latest release of ThirdAI, we can build a Q&A system on MSMarco, the largest BEIR Benchmark — achieving SoTA accuracy in less than the time it takes you to read this article!

Running on my M1 MacBook Air with 16GB of RAM (using only the CPU cores), the following demo takes only 4 minutes and achieves an accuracy that is on par with or exceeds that of popular Embedding Models.

1. Getting Started

Run pip3 install "thirdai>=0.7.41" to ensure you have installed ThirdAI with version 0.7.41 or later. If you don’t already have one you can go here to get a trial license for ThirdAI. Now let’s import the library and download the dataset.

from thirdai import neural_db as ndb
from thirdai import demos
import pandas as pd
import time

# Downloads msmarco and puts all of the documents into a csv file.
documents, train_queries, test_queries, _ = demos.download_beir_dataset("msmarco")

2. Building the System

qna_model = ndb.NeuralDB(low_memory=True)

start = time.perf_counter()

qna_model.insert([
ndb.CSV(
documents,
id_column="DOC_ID", # Indicates which column stores the document ids.
strong_columns=["TITLE"], # Indicates which column contains the title.
weak_columns=["TEXT"], # Indicates which column contains the text.
)])

end = time.perf_counter()
print(f"Completed in {end-start:.3f} s")

And that’s it! With just a few lines of code, you’ve created a search system that has scaled to 8.8 million documents in just a matter of minutes, without any specialized hardware!

3. Evaluation

def parse_labels(labels):
return list(map(int, labels.split(":")))


def evaluate(qna_model, test_queries):
test_df = pd.read_csv(test_queries)
test_df["DOC_ID"] = test_df["DOC_ID"].map(parse_labels)

true_positives = 0
start = time.perf_counter()
for _, row in test_df.iterrows():
result = qna_model.search(row["QUERY"], top_k=5)
if len(result) and result[0].id in row["DOC_ID"]:
true_positives += 1
end = time.perf_counter()

precision = true_positives / len(test_df)
print(f"precision@1={precision:.3f}")
avg_time = (end - start) / len(test_df) * 1000
print(f"average query time: {avg_time:.3f} ms")

evaluate(qna_model=qna_model, test_queries=test_queries)

Output:

precision@1=0.791
average query time: 21.369 ms

We see here that ThirdAI achieves a precision@1 of 0.791, so how does that compare to the state-of-the-art embedding models and other retrieval systems?

* Numbers for Google T5 and OpenAI Ada are computed using exact search over the embeddings. Using ANN search with a Vector DB will result in slightly worse accuracy.

Now what about the feasibility of these alternatives? OpenAI Ada Embeddings will require 54GB of storage just to store the embeddings, with the Vector DB imposing additional resource requirements on top of that. ThirdAI requires just 10–12GB of the RAM on my laptop.

Running an open source Embedding Model such as T5 will require GPUs to sustain a reasonable throughput, and will also likely require similar hardware in production to achieve low query latency. ThirdAI has a latency of 21 ms latency just on a laptop, without requiring any specialized hardware, and this same code could easily be run on whatever infrastructure you have at hand already with comparable results.

4. Domain Specialization

While this system already achieves a comparable accuracy to the best Embedding Models, what if we want to improve it further? One of the defining characteristics of successful production search systems is their ability to continually improve based on user interactions. For example, say a company uses the custom acronym IDD to mean “initial design document”. Since this acronym doesn’t appear in the training data used to create the LLMs in the search system, user queries like the “summarize the IDD for project xyz” will fail since the system doesn’t understand the acronym used. With domain specialization, the system can adapt to understand these relationships and answer queries like this correctly. These user interactions allow the underlying system to learn patterns/trends in user preferences that aren’t present in the raw documents.

A major limitation of the Vector DB + Embedding Model approach is there is no efficient way to implement this continual learning, especially at scale. Finetuning the embedding model requires more expensive GPU hours and is a complicated process requiring careful selection of negative samples, data cleaning, and other techniques that are still the subject of ongoing research. Even if we succeed in finetuning the embedding model, we will have to regenerate all of the embeddings and rebuild the Vector DB, which will take even more time and incur more cost. It is also worth noting that some embedding services like OpenAI Ada embeddings are currently not fine-tunable at all.

With ThirdAI, we can efficiently domain specialize (or finetune) based on user interactions without having to rebuild the entire system. With just a few more seconds and lines of code we can leverage the 300,000 training queries in the MSMarco dataset to boost performance.

train_df = pd.read_csv(train_queries)
train_df["DOC_ID"] = train_df["DOC_ID"].map(parse_labels)

start = time.perf_counter()

qna_model.supervised_train_with_ref_ids(
queries=train_df["QUERY"].to_list(), labels=train_df["DOC_ID"].to_list()
)

end = time.perf_counter()
print(f"finetuned in {end-start:.3f} s")

evaluate(qna_model=qna_model, test_queries=test_queries)

Output:

finetuned in 2.164 s
precision@1=0.814
average query time: 21.012 ms

Now if we rerun the evaluation, we see that the precision@1 has improved to 0.814 just by feeding sample user interactions into the system, and the whole finetuning process just took a few seconds.

Comparison of ThirdAI with and without Domain Specialization.

A note on Energy Efficient AI

Another point of emphasis should be the environmental impact. While new advances in AI are undoubtably impressive and revolutionary, the energy requirements are enormous. The next generation of Nvidia H100 GPUs in production are alone projected to surpass the energy usage of a small nation. Finding energy efficient alternatives is essential as this technology continues to develop. A system like ThirdAI that uses only a fraction of the computing resources and requires no hardware accelerators offers a path to significantly less energy usage when deploying generative AI systems.

Conclusion

Building a search system using the standard recipe of Embedding Models and Vector DBs quickly runs into a host of issues such as generating and storing the embeddings, scaling the system, and specializing it to your knowledge base.

In this demo we presented a new solution to these problems, with a system that has unparalleled scaling without sacrificing on accuracy. If you are interested in learning more about our technology, please reach out to contact@thirdai.com.

Links

--

--