Comparing RAG Part 2b: Vector Stores and Top k ; FAISS vs Chroma— Retrieve Multiple Documents

3 min readJan 19, 2024

Intro

Both FAISS and Chroma are popular and open source vector stores that people can freely use.

FAISS boasting its speed using the combination of many different technologies such as Product Quantization (PQ) and Inverted Indexing (INV) with Hierarchical Navigable Small World (HNSW). And Chroma is just using HNSW. However, both of them are set so that their results are deterministic by setting all the random seeds in their respective codes.

In this experiment, we are testing which one of these two is better at retrieving passage / text chunk as a retriever in the context of Retrieval Augmented Generation (RAG) architecture.

Experiment Setup

1. Knowledge Base: sustainable wiki — a wiki that is focusing on sustainability and statistical methods. However, it also talks about other topics from a data science normative point of view.
2. Splitting parameters: Chunk size = 200 characters & Chunk overlaps = 10% (20 characters)
3. Embedding : BGE-large
4. Distance metrics: Eucledian (L2)
5. Evaluator: RAGAS
6. Questions & Ground Truth: Custom Sustainable Dataset

Experiment Steps

Set up similar environments for both vector stores FAISS and Chroma
Using the same 50 custom queries, we tests both vector stores, and they should retrieve the correct passage from the Knowledge Base.
We change the number of documents needs to be retrieved from one until ten.
RAGAS then take the passage retrieved from the vector stores and compare it with the ground truth.
If the passage retrieved can answer the ground truth, it will get high point (max. 1.0), otherwise low point (min. 0.0)

Results

We are able to compare the context precision and recall of both FAISS and Chroma. Additionally, we also checked the result of the first retrieved passages manually to ensure the quality of RAGAS’ score.

Overall Result of comparing FAISS and Chroma with different number of top documents.

Upon examining the data presented in the table, it becomes evident that, in terms of context recall, FAISS generally outperforms Chroma. However, when we shift our focus to context precision, the superiority of one model over the other becomes less clear-cut. The results do not distinctly favor either FAISS or Chroma, indicating that both models exhibit comparable precision in their context understanding. Interestingly, Chroma behaves more stable with the increase number of documents.

Conclusion

In conclusion, while FAISS demonstrates superior context recall compared to Chroma, the distinction between the two models in terms of context precision is less definitive. Both models show similar levels of precision, with Chroma displaying more stability as the number of documents increases. However, further research is needed to definitively determine which model excels in diverse scenarios.

Problem

RAGAS evaluator is an automated framework that leverages the power of LLM to give context a precision and recall score based on the ground truth. Currently the best LLM to use in RAGAS is GPT3.5.

Unfortunately, we do not have a high tier level in OpenAI, therefore we hit a wall when we try to compare more than 5 retrieved text chunks.

Error message from OpenAI that we hit the Token Rate Limit

Another challenge with RAGAS evaluators lies in the agreeability of the algorithm with human evaluators. As stated in their research paper, the algorithm’s decisions align with those of human evaluators only 70% of the time (for context precision and recall). This discrepancy underscores the complexity of the task at hand and the nuances involved in human judgement. Therefore, while the results presented above provide valuable insights, they should be interpreted as a general guideline rather than an absolute measure.

If you have any questions, feel free to reach me in https://www.linkedin.com/in/stepkurniawan/