Welcome to Braintrust ! In this notebook, you'll build and evaluate an AI app that answers questions about Coda's help desk .
To provide the LLM with relevant information from Coda's help desk, we'll use a technique called
RAG (retrieval-augmented generation) to infuse our prompts with text from the most-relevant
sections of their docs. To evaluate the performance of our app, we'll use an LLM to generate
question-answer pairs from the docs, and we'll use a technique called model graded evaluation
to automatically evaluate the final responses against the expected answers.
Before starting, please make sure that you have a Braintrust account. If you do not, please sign up or get in touch . After this tutorial, feel free to dig deeper by visiting the docs .
% pip install - U autoevals braintrust requests openai lancedb markdownify
QA_GEN_MODEL = "gpt-3.5-turbo"
QA_ANSWER_MODEL = "gpt-3.5-turbo"
QA_GRADING_MODEL = "gpt-4"
RELEVANCE_MODEL = "gpt-3.5-turbo"
NUM_SECTIONS = 20
NUM_QA_PAIRS = 20 # Increase this number to test at a larger scale
Let's start by downloading the Coda docs and splitting them into their constituent Markdown sections.
import asyncio
import os
import re
import time
import autoevals
import braintrust
import markdownify
import openai
import requests
data = requests.get(
"https://gist.githubusercontent.com/wong-codaio/b8ea0e087f800971ca5ec9eef617273e/raw/39f8bd2ebdecee485021e20f2c1d40fd649a4c77/articles.json"
).json()
markdown_docs = [{ "id" : row[ "id" ], "markdown" : markdownify.markdownify(row[ "body" ])} for row in data]
i = 0
markdown_sections = []
for markdown_doc in markdown_docs:
sections = re.split( r " (. * \n = + \n ) " , markdown_doc[ "markdown" ])
current_section = ""
for section in sections:
if not section.strip():
continue
if re.match( r " . * \n = + \n " , section):
current_section = section
else :
section = current_section + section
markdown_sections.append({ "doc_id" : markdown_doc[ "id" ], "section_id" : i, "markdown" : section.strip()})
current_section = ""
i += 1
print ( f "Downloaded {len (markdown_sections) } Markdown sections. Here are the first 3:" )
markdown_sections[: 3 ]
Downloaded 988 Markdown sections. Here are the first 3:
[{'doc_id': '8179780',
'section_id': 0,
'markdown': "Not all Coda docs are used in the same way. You'll inevitably have a few that you use every week, and some that you'll only use once. This is where starred docs can help you stay organized.\n\n\n\nStarring docs is a great way to mark docs of personal importance. After you star a doc, it will live in a section on your doc list called **[My Shortcuts](https://coda.io/shortcuts)**. All starred docs, even from multiple different workspaces, will live in this section.\n\n\n\nStarring docs only saves them to your personal My Shortcuts. It doesn’t affect the view for others in your workspace. If you’re wanting to shortcut docs not just for yourself but also for others in your team or workspace, you’ll [use pinning](https://help.coda.io/en/articles/2865511-starred-pinned-docs) instead."},
{'doc_id': '8179780',
'section_id': 1,
'markdown': '**Star your docs**\n==================\n\n\nTo star a doc, hover over its name in the doc list and click the star icon. Alternatively, you can star a doc from within the doc itself. Hover over the doc title in the upper left corner, and click on the star.\n\n\n\nOnce you star a doc, you can access it quickly from the [My Shortcuts](https://coda.io/shortcuts) tab of your doc list.\n\n\n\n![](https://downloads.intercomcdn.com/i/o/793964361/55a80927217f85d68d44a3c3/Star+doc+to+my+shortcuts.gif)And, as your doc needs change, simply click the star again to un-star the doc and remove it from **My Shortcuts**.'},
{'doc_id': '8179780',
'section_id': 2,
'markdown': '**FAQs**\n========\n\n\nWhen should I star a doc and when should I pin it?\n--------------------------------------------------\n\n\n\nStarring docs is best for docs of *personal* importance. Starred docs appear in your **My Shortcuts**, but they aren’t starred for anyone else in your workspace. For instance, you may want to star your personal to-do list doc or any docs you use on a daily basis.\n\n\n\n[Pinning](https://help.coda.io/en/articles/2865511-starred-pinned-docs) is recommended when you want to flag or shortcut a doc for *everyone* in your workspace or folder. For instance, you likely want to pin your company wiki doc to your workspace. And you may want to pin your team task tracker doc to your team’s folder.\n\n\n\nCan I star docs for everyone?\n-----------------------------\n\n\n\nStarring docs only applies to your own view and your own My Shortcuts. To pin docs (or templates) to your workspace or folder, [refer to this article](https://help.coda.io/en/articles/2865511-starred-pinned-docs).\n\n\n\n\n\n---'}]
The Braintrust AI proxy provides a single API to access OpenAI and Anthropic models, LLaMa 2, Mistral and others. Here we use it to access gpt-3.5-turbo
. Because the Braintrust AI proxy automatically caches and reuses results (when temperature=0
or the seed
parameter is set, or when the caching mode is set to always
), we can re-evaluate the following prompts many times without incurring additional API costs.
If you'd prefer not to use the proxy, simply omit the base_url
and default_headers
parameters below.
client = braintrust.wrap_openai(openai.AsyncOpenAI(
base_url = "https://braintrustproxy.com/v1" ,
default_headers = { "x-bt-use-cache" : "always" },
api_key = os.environ.get( "OPENAI_API_KEY" , "Your OPENAI_API_KEY here" ),
))
Before we start evaluating some prompts, let's first use the LLM to generate a bunch of question/answer pairs from the text at hand. We'll use these QA pairs as ground truth when grading our models later.
import json
from typing import List
from pydantic import BaseModel, Field
class QAPair ( BaseModel ):
questions: List[ str ] = Field(
... , description = "List of questions, all with the same meaning but worded differently"
)
answer: str = Field( ... , description = "Answer" )
class QAPairs ( BaseModel ):
pairs: List[QAPair] = Field( ... , description = "List of question/answer pairs" )
async def produce_candidate_questions (row):
response = await client.chat.completions.create(
model = QA_GEN_MODEL ,
messages = [{ "role" : "user" , "content" : f """ \
Please generate 8 question/answer pairs from the following text. For each question, suggest
2 different ways of phrasing the question, and provide a unique answer.
Content:
{ row[ 'markdown' ] }
""" ,
}],
functions = [
{
"name" : "propose_qa_pairs" ,
"description" : "Propose some question/answer pairs for a given document" ,
"parameters" : QAPairs.schema(),
}
],
)
pairs = QAPairs( ** json.loads(response.choices[ 0 ].message.function_call.arguments))
return pairs.pairs
all_candidates_tasks = [
asyncio.create_task(produce_candidate_questions(a)) for a in markdown_sections[: NUM_SECTIONS ]
]
all_candidates = [ await f for f in all_candidates_tasks]
data = []
row_id = 0
for row, doc_qa in zip (markdown_sections[: NUM_SECTIONS ], all_candidates):
for i, qa in enumerate (doc_qa):
for j, q in enumerate (qa.questions):
data.append(
{
"input" : q,
"expected" : qa.answer,
"metadata" : {
"document_id" : row[ "doc_id" ],
"section_id" : row[ "section_id" ],
"question_idx" : i,
"answer_idx" : j,
"id" : row_id,
"split" : "test" if j == len (qa.questions) - 1 and j > 0 else "train" ,
},
}
)
row_id += 1
print ( f "Generated {len (data) } QA pairs. Here are the first 10..." )
for x in data[: 10 ]:
print (x)
/var/folders/4t/66zjzrhs3wjf928c7dh270kr0000gn/T/ipykernel_9598/1104079419.py:33: PydanticDeprecatedSince20: The \`schema\` method is deprecated; use \`model_json_schema\` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.4/migration/
"parameters": QAPairs.schema(),
/var/folders/4t/66zjzrhs3wjf928c7dh270kr0000gn/T/ipykernel_9598/1104079419.py:33: PydanticDeprecatedSince20: The \`schema\` method is deprecated; use \`model_json_schema\` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.4/migration/
"parameters": QAPairs.schema(),
Generated 212 QA pairs. Here are the first 10...
{'input': 'What is the purpose of starring docs?', 'expected': 'Starring docs is a great way to mark docs of personal importance and have them easily accessible in the My Shortcuts section of your doc list.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 0, 'answer_idx': 0, 'id': 0, 'split': 'train'}}
{'input': 'How can starring docs help you stay organized?', 'expected': 'Starring docs is a great way to mark docs of personal importance and have them easily accessible in the My Shortcuts section of your doc list.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 0, 'answer_idx': 1, 'id': 1, 'split': 'test'}}
{'input': 'What happens when you star a doc?', 'expected': 'When you star a doc, it will appear in the My Shortcuts section of your doc list.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 1, 'answer_idx': 0, 'id': 2, 'split': 'train'}}
{'input': 'Where do starred docs appear in your doc list?', 'expected': 'When you star a doc, it will appear in the My Shortcuts section of your doc list.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 1, 'answer_idx': 1, 'id': 3, 'split': 'test'}}
{'input': 'Can starred docs be from different workspaces?', 'expected': 'Yes, starred docs can be from multiple different workspaces and they will all appear in the My Shortcuts section.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 2, 'answer_idx': 0, 'id': 4, 'split': 'train'}}
{'input': 'Do starred docs only appear in your personal My Shortcuts section?', 'expected': 'Yes, starred docs can be from multiple different workspaces and they will all appear in the My Shortcuts section.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 2, 'answer_idx': 1, 'id': 5, 'split': 'test'}}
{'input': 'Does starring docs affect the view for others in your workspace?', 'expected': "Starring docs only saves them to your personal My Shortcuts and it doesn't affect the view for others in your workspace.", 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 3, 'answer_idx': 0, 'id': 6, 'split': 'train'}}
{'input': 'How does starring docs impact other users in your workspace?', 'expected': "Starring docs only saves them to your personal My Shortcuts and it doesn't affect the view for others in your workspace.", 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 3, 'answer_idx': 1, 'id': 7, 'split': 'test'}}
{'input': 'Is pinning the same as starring docs?', 'expected': 'No, pinning is different from starring docs. You should use pinning when you want to shortcut docs not just for yourself but also for others in your team or workspace.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 4, 'answer_idx': 0, 'id': 8, 'split': 'train'}}
{'input': 'When should you use pinning instead of starring docs?', 'expected': 'No, pinning is different from starring docs. You should use pinning when you want to shortcut docs not just for yourself but also for others in your team or workspace.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 4, 'answer_idx': 1, 'id': 9, 'split': 'test'}}
Now let's evaluate a simple prompt that poses each question without providing any context from the Markdown docs. We'll evaluate this naive approach using (again) gpt-3.5-turbo
, with the Factuality prompt from the Braintrust AutoEvals library.
async def simple_qa (input):
completion = await client.chat.completions.create(
model = QA_ANSWER_MODEL ,
messages = [{ "role" : "user" , "content" : f """ \
Please answer the following question:
Question: {input}
""" ,
}],
)
return completion.choices[ 0 ].message.content
await braintrust.Eval(
name = "Coda Help Desk Cookbook" ,
experiment_name = "No RAG" ,
data = data[: NUM_QA_PAIRS ],
task = simple_qa,
scores = [autoevals.Factuality( model = QA_GRADING_MODEL )],
)
Experiment No RAG is running at https://www.braintrust.dev/app/braintrust.dev/p/Coda%20Help%20Desk%20Cookbook/No%20RAG
Coda Help Desk Cookbook [experiment_name=No RAG] (data): 20it [00:00, 45764.36it/s]
Coda Help Desk Cookbook [experiment_name=No RAG] (tasks): 0%| | 0/20 [00:00<?, ?it/s]
=========================SUMMARY=========================
See results for No RAG at https://www.braintrust.dev/app/braintrust.dev/p/Coda%20Help%20Desk%20Cookbook/No%20RAG
The cell above will print a link to a Braintrust experiment -- click on it to view our baseline eval.
Let's see if RAG (retrieval-augmented generation) can improve our results on this task.
First we'll compute embeddings for each Markdown section using text-embedding-ada-002
and create an index over the embeddings in LanceDB , a vector database. Then, for any given query, we can convert it to an embedding and efficiently find the most relevant context for an input query by converting it into an embedding and finding the best matches embedding space, and provide the corresponding text as additional context in our prompt.
import tempfile
import lancedb
tempdir = tempfile.TemporaryDirectory()
LANCE_DB_PATH = os.path.join(tempdir.name, "docs-lancedb" )
@braintrust.traced
async def embed_text (text):
params = dict ( input = text, model = "text-embedding-ada-002" )
response = await client.embeddings.create( ** params)
embedding = response.data[ 0 ].embedding
braintrust.current_span().log(
metrics = { "tokens" : response.usage.total_tokens, "prompt_tokens" : response.usage.prompt_tokens},
metadata = { "model" : response.model},
input = text,
output = embedding,
)
return embedding
embedding_tasks = [asyncio.create_task(embed_text(row[ "markdown" ])) for row in markdown_sections[: NUM_SECTIONS ]]
embeddings = [ await f for f in embedding_tasks]
db = lancedb.connect( LANCE_DB_PATH )
try :
db.drop_table( "sections" )
except :
pass
table = db.create_table(
"sections" ,
data = [
{ "doc_id" : row[ "doc_id" ], "section_id" : row[ "section_id" ], "vector" : embedding}
for (row, embedding) in zip (markdown_sections[: NUM_SECTIONS ], embeddings)
],
)
We're almost there! One more trick -- let's actually retrieve a few more of the best-matching candidates from the vector database than we intend to use, then use gpt-3.5-turbo
to score the relevance of each candidate to the input query. We'll use the TOP_K
blurbs by relevance score in our QA prompt -- this should be a little more intelligent than just using the closest embeddings.
TOP_K = 2
@braintrust.traced
async def relevance_score (query, document):
response = await client.chat.completions.create(
model = RELEVANCE_MODEL ,
messages = [{ "role" : "user" , "content" : f """ \
Consider the following query and a document
Query:
{ query }
Document:
{ document }
Please score the relevance of the document to a query, on a scale of 0 to 1.
""" ,
}],
functions = [
{
"name" : "has_relevance" ,
"description" : "Declare the relevance of a document to a query" ,
"parameters" : {
"type" : "object" ,
"properties" : {
"score" : { "type" : "number" },
},
},
}
],
)
arguments = response.choices[ 0 ].message.function_call.arguments
result = json.loads(arguments)
braintrust.current_span().log(
input = { "query" : query, "document" : document},
output = result,
)
return result[ "score" ]
async def retrieval_qa (input):
embedding = await embed_text( input )
with braintrust.current_span().start_span( name = "vector search" , input = input ) as span:
result = table.search(embedding).limit( TOP_K + 3 ).to_arrow().to_pylist()
docs = [markdown_sections[i[ "section_id" ]][ "markdown" ] for i in result]
relevance_scores = []
for doc in docs:
relevance_scores.append( await relevance_score( input , doc))
span.log(
output = [
{ "doc" : markdown_sections[r[ "section_id" ]][ "markdown" ], "distance" : r[ "_distance" ]} for r in result
],
metadata = { "top_k" : TOP_K , "retrieval" : result},
scores = {
"avg_relevance" : sum (relevance_scores) / len (relevance_scores),
"min_relevance" : min (relevance_scores),
"max_relevance" : max (relevance_scores),
},
)
context = " \n ------ \n " .join(docs[: TOP_K ])
completion = await client.chat.completions.create(
model = QA_ANSWER_MODEL ,
messages = [{ "role" : "user" , "content" : f """ \
Given the following context
{ context }
Please answer the following question:
Question: {input}
""" ,
}],
)
return completion.choices[ 0 ].message.content
await braintrust.Eval(
name = "Coda Help Desk Cookbook" ,
experiment_name = f "RAG TopK= {TOP_K} " ,
data = data[: NUM_QA_PAIRS ],
task = retrieval_qa,
scores = [autoevals.Factuality( model = QA_GRADING_MODEL )],
)
Experiment RAG TopK=2 is running at https://www.braintrust.dev/app/braintrust.dev/p/Coda%20Help%20Desk%20Cookbook/RAG%20TopK%3D2
Coda Help Desk Cookbook [experiment_name=RAG TopK=2] (data): 20it [00:00, 59918.63it/s]
Coda Help Desk Cookbook [experiment_name=RAG TopK=2] (tasks): 0%| | 0/20 [00:00<?, ?it/s]
=========================SUMMARY=========================
RAG TopK=2 compared to No RAG:
68.00% (+23.00%) 'Factuality' score (9 improvements, 2 regressions)
2.78s (+216.68%) 'duration' (0 improvements, 20 regressions)
See results for RAG TopK=2 at https://www.braintrust.dev/app/braintrust.dev/p/Coda%20Help%20Desk%20Cookbook/RAG%20TopK%3D2
Click into the new experiment and check it out. You should notice a few things:
Braintrust will automatically compare the new experiment to your previous one.
You should see an increase in scores with RAG. Click around to see exactly which examples improved.
Try playing around with the constants set at the beginning of this tutorial, such as NUM_QA_PAIRS
, to evaluate on a larger dataset.
We hope you had fun with this tutorial! You can learn more about Braintrust at https://www.braintrust.dev/docs .