Member-only story
Evaluating RAG Pipelines with Ragas
Leveraging the Ragas framework to determine the performance of your retrieval augmented generation (RAG) pipeline
Artificial intelligence is really cool, but for better or worse, the outputs of all AI models are inferences. In other words, these outputs are educated guesses, and we can never be truly certain that the output is correct. In traditional machine learning contexts, we can often calculate metrics like ROC AUC, RMSE, and more to ensure that a model remains performant over time.
Unfortunately, there aren’t mathematical metrics like the aforementioned ones for the deep learning context, which also includes the outputs of LLMs. More specifically, we might be interested in determine how we can assess the effectiveness of retrieval augmented generation (RAG) use cases. Given that we can’t apply some typical mathematical formula to derive a metric, what options does that leave us with?
The first option that is always available is human evaluation. While this is certainly an effective route, it’s certainly not efficient nor always the most reliable. First, the challenge with using human evaluators is that they come with their own biases, meaning that you can’t expect one human evaluator to be consistent with another human evaluator. Additionally, it can…