Ragas is a popular framework for evaluating Retrieval Augmented Generation (RAG) applications. Braintrust natively supports the RAGAS metrics, with several improvements to aid debugging and accuracy, in our open source autoevals library.
In this cookbook, we'll walk through using a few Ragas metrics to evaluate a simple RAG pipeline that does Q&A on Coda's help desk. We'll reuse many of the components we built in a previous cookbook on RAG, which
you can check out to learn some of the basics around evaluating RAG systems.
Let's dive in and start by installing dependencies:
We'll quickly set up a full end-to-end RAG application, based on our earlier cookbook. We use the Coda Q&A dataset, LanceDB for our vector database, and OpenAI's embedding model.
Done! Next, we'll write some simple, framework-free code to (a) retrieve relevant documents and (b) generate an answer given those documents.
To perform retrieval, we'll use the same embedding model as we did for the document sections to embed the input query, and then search for the
TOP_K (2) most relevant documents.
You'll notice that here and elsewhere we've decorated functions with @braintrust.traced. For now, it's a no-op, but we'll see shortly how @braintrust.traced
helps us trace python functions and debug them in Braintrust.
Let's try it out on a simple question, and take a look at the retrieved documents:
To generate the final answer, we can simply pass in the retrieved documents and the original question to a simple prompt defined below. Feel free to tweak this prompt as you experiment!
We'll define a convenience function to combine these two steps, and return both the final answer and the retrieved documents so we can observe if we picked useful documents! (Later, returning documents will come in useful for evaluations)
Perfect! Now that we have the whole system working, we can compute Ragas metrics and try a couple improvements.
To get a large enough sample size for evaluations, we're going to use the synthetic test questions we generated in our earlier cookbook. Feel free to check out that cookbook for details on how the synthetic data generation process works.
Ragas provides a variety of metrics, but for the purposes of this guide, we'll show you how to calculate two scores we've found to be useful:
ContextRecall compares the retrieved context to the information in the ground truth answer. This is a helpful way of testing how relevant the retrieved documents are with respect to the answer itself.
AnswerCorrectness evaluates the generated answer to the golden answer. Under the hood, it checks each statement in the answer and classifies it as a true positive, false positive, or false negative.
Before we calculate metrics, we'll write a short wrapper class that splits the returned output and context into two arguments that our Ragas evaluator classes can easily ingest.
And now we can run our evaluation!
Not bad! It looks like we're doing really well on context recall, but worse on the final answer's correctness.
Although Ragas is very powerful, it can be difficult to get detailed insight into low scoring values. Braintrust makes that very simple.
Sometimes an avergae of 67% means that 2/3 of the values had a score of 1 and 1/3 had a score of 0. However, the distribution chart makes it clear
that in our case, many of the scores are partially correct:
Now, let's dig into one of these records. Braintrust allows us to see all the raw outputs from the constituent pieces:
To me, this looks like it might be an error in the scoring function itself. No, starring a doc in Coda does not affect other users seems like a true, not false, positive.
Let's try changing the scoring model for AnswerCorrectness to be gpt-4, and see if that changes anything.
By default, Ragas is configured to use gpt-3.5-turbo-16k. As we observed, it looks like the AnswerCorrectness score may be returning bogus
results, and maybe we should try using gpt-4 instead. Braintrust lets us test the effect of this quickly, directly in the UI, before we run
a full experiment:
Looks better. Let's update our scoring function to use it and re-run the experiment.
Great, it looks like changing our grading model improved the answer correctness score for the same set of questions:
Now, let's see if we can further optimize our RAG pipeline without regressing scores. We're going to try pulling just one document, rather than two.
Although not a pure fail, it does seem like in 3 cases we're not retrieving the right documents anymore, and 11 cases had worse results.
We can drill down on individual examples of each regression type to better understand it. The side-by-side diffs built into Braintrust make
it easy to deeply understand every step of the pipeline, for example, which documents were missing, and why.
And there you have it! Ragas is a powerful technique, that with the right tools and iteration can lead to really high quality RAG applications. Happy evaling!