file: ./content/docs/changelog.mdx meta: { "title": "Changelog" } # Changelog ## Week of 2025-10-20 * Start a Loop conversation from the CMD-K menu * Move Loop button to the bottom right of the screen * Use case examples when creating a playground * Java SDK * Scope collapse state for span fields by the span type * Collapse/expand all button for LLM data view * By default, collapse all messages in LLM data view besides the last turn * Generate scorer spans when applying scores to logs ## Week of 2025-10-13 * Increased default maximum agentic tool use roundtrips from 5 to 100 * Added support for Gemini tracing * Added support for Claude 4.5 Haiku * Added Loop to the prompt and scorer detail pages * **Refreshed OpenAI Realtime Audio proxy support** - Updated AI proxy to support the latest OpenAI SDK (v6.0+) for realtime audio interactions * Added support for both `OpenAIRealtimeWebSocket` (browser/Cloudflare Workers) and `OpenAIRealtimeWS` (Node.js with ws library) * Updated event types to match the current OpenAI Realtime API specification (`response.output_audio.delta`, `response.output_text.delta`, etc.) * Added header-based authentication and logging with `x-bt-parent` and `x-bt-compress-audio` headers * Improved audio logging with automatic format detection and optional MP3 compression for reduced storage costs * Added "Pretty" span field display option that optimizes for object value readability and renders object values in markdown * The Pretty display option replaces the Markdown option since Pretty renders markdown by default * Added support for viewing spans in the table on the logs page ### TypeScript SDK version 0.4.8 * Added OpenTelemetry distributed tracing helpers (`contextFromSpanExport()` and `spanContextFromSpanExport()`) for seamless trace propagation between Braintrust and OpenTelemetry across service boundaries ### Python SDK version 0.3.5 * Added DSPy integration with `wrap_dspy` wrapper for automatic tracing of DSPy applications * Added OpenTelemetry distributed tracing helpers (`context_from_span_export()` and `span_context_from_span_export()`) for seamless trace propagation between Braintrust and OpenTelemetry across service boundaries ### Python SDK version 0.3.4 * Added support for `GEMINI_API_KEY` environment variable ### Python SDK version 0.3.3 * Properly support querying versioned datasets ### Data plane (1.1.25) * Faster real-time queries (no "excluded docs" in most cases) * Fix thinking for Mistral models * Significantly faster indexing for large payloads * Fix a bug with floating point division in queries * Fix MCP OAuth flow for self-hosted deployments * More iterations in hosted tools (100) * Improve performance for high selectivity filter queries * Fix gemini tool calls that included `$defs` and `$refs` in the schema * Fix bug in `/feedback` endpoint when updating non-root spans ### TypeScript SDK version 0.4.6 * Added `wrapGoogleGenAI` to automatically trace @google/genai (Google Gemini) library * Updated span name for `BraintrustMiddleware` spans from `ai-sdk.generateText/streamText` to `ai-sdk.doGenerate/doStream` * Added support for `loadPrompt` when using `wrapAISDK` * Added support for attachments when using `wrapAISDK` * Fixed remote evals for users in multiple orgs * Properly support querying versioned datasets * Fixed versioning when initializing datasets ### TypeScript SDK version 0.4.5 * Fixed SpanComponentsV4 incompatibility bug ### TypeScript SDK version 0.4.4 * Fixed missing zod/v3 imports in logger and wrapper test files * Fixed prettier formatting fallback when prettier module is not found in CLI pull command ### SDK Integrations: LangChain (Python) v0.1.1 * Fixed a bug where multiple calls were not separate traces. ## Week of 2025-10-06 * Added GPT-5 Pro support * Added **Review** page to see spans marked for review in logs, experiments and datasets across a project. * Fixed Loop prompt optimization of remote evals * Fix issue with thinking events coming from Mistral * Added Toplist and Big number monitor chart types * Support for [JSON attachments](/docs/guides/attachments#json-attachments) * Improve "Raw span data" and new buttons to download a span or entire trace as JSON from the trace viewer ### TypeScript SDK version 0.4.3 * Improved LangChain integrations with simplified parsing for both TypeScript and Python * Added JSON attachment SDK support ### TypeScript SDK version 0.4.2 * Add OpenTelemetry compatibility mode for TypeScript. This allows OTel spans to work with Evals. ### TypeScript SDK version 0.4.1 * Added Google GenAI wrapper support * Updated Mastra wrapper methods from `generateVNext`/`streamVNext` to `generate`/`stream` * Moved langchain-js braintrust dependency to peer dependencies * Fixed handling of attachments for Anthropic to avoid large base64 strings in UI * Fixed preservation of result object when returning from `wrappedStreamObject` in AI SDK * Fixed `LanguageModelV1#supportsUrl` being a function, not a property ### SDK Integrations: LangChain / LangGraph (TypeScript) (version 0.1.0) * **breaking change** Braintrust is now a peer dependency. Please add a direct dependency to braintrust starting v0.1.0. This ensures that node package managers will install one version and the langchain-js library uses the installed version. ## Week of 2025-09-29 * Added Anthropic Claude 4.5 Sonnet support * Fixed Gemini schema support to enable proper function calling and structured outputs when using Google's Gemini models through Braintrust and the AI proxy * Added Claude Agent SDK Integration support * Added Gemini Flash and Lite Preview (Sept 2025) support * Improved prompt detail chat logging and added link to corresponding trace * Fixed bugs with parallel tool calling in Loop * Enabled Loop to write BTQL queries against arbitrary data sources on non-BTQL-sandbox pages ### Python SDK version 0.3.1 * Ensure experiments use SpanComponentsV3 by default. ### Python SDK version 0.3.0 * Added OpenTelemetry compatibility mode for seamless integration between Braintrust and OTEL tracing * Added `setup_claude_agent_sdk` for automatic tracing of Claude Agent SDK applications * Improved Anthropic wrapper to log consistent input/output format * Added `strict` parameter to `Prompt.build` for strict schema validation * Added SpanComponentsV4 support ### TypeScript SDK version 0.4.0 * Added `wrapClaudeAgentSDK` for automatic tracing of Claude Agent SDK applications * Improved Anthropic wrapper to log consistent input/output format * Fixed AI SDK model detection in `wrapGenerate` callback * Added SpanComponentsV4 support ### Data plane (1.1.23) * BTQL enhancements: * Sampling operator * Much faster score and metric aggregations * Traces with no root span now show up in the summary table, which allows you to filter to only AI spans in traces that contain only AI spans **without** sending all root spans. * Faster brainstore indexing * Improve indexing performance by reducing conflicts between compactions and merges while indexing hot data. * Fix compaction for comments * Improve vacuuming and retention performance * Disable parallel tool calls for Azure models ### SDK Integrations: Google ADK (version v0.2.1) * Simplified SDK setup with new `setup_adk` replacing `setup_braintrust` ## Week of 2025-09-22 * Added support for creating datasets and scorers with Loop from the experiment, dataset, and logs pages * Resolved excessive `localStorage` usage in Loop and BTQL sandbox * Improved Loop's `from` clause handling in the BTQL sandbox * Fixed cross-tab syncing and session restoration bugs in Loop * Prompt/scorer activity view UI updates * Before: selecting a version showed a diff vs. the current editor content, where the selected version is the base of the diff. * After: prompt versions can be viewed without diffing vs. editor. When diff is enabled, version is shown as incoming, to indicate what would occur when reverting to that version. ## SDK Integrations: LangChain / LangGraph (Python) (version 0.0.5) * Fixed Anthropic and other model's metrics reporting correctly. Credit: eilonmor. * Removes Python 3.8 (past EOL) support. * Fixed LLM span type to default to task. ## SDK Integrations: LangChain / LangGraph (TypeScript) (version 0.0.9) * Fixed Anthropic and other model's metrics reporting correctly. * Fixed LLM span type to default to task. ## Week of 2025-09-15 * Added support for updating the email associated with billing data * Added support for iterating on logs in playgrounds * Added support for scoring existing logs ### SDK Integrations: Google ADK (version v0.2.0) * Enhanced automatic tracing for runners, agents, and flows - captures complete input/output data and metadata at every step without configuring OpenTelemetry ## Week of 2025-09-08 * Trace tree is now visible in human review mode * BTQL sandbox improvements * Loop is now on the page and can write queries, debug errors and answer syntax questions * Tabs * Simple charts * Improved auto-complete * Updated UI color palette * Custom charts added to the monitor page (requires data plane 1.1.22) * View state changes for non-saved views * Before: We would attempt to restore any previous edited view state to the URL * After: With a few exceptions, edited view state for non-saved views is only represented in the URL ### SDK Integrations: LangChain (JS) (version 0.0.7) * Added `parent` parameter to organize LangChain traces within evaluation hierarchies, enabling better debugging and trace analysis when using LangChain in your evaluations ### SDK Integrations: LangChain (JS) (version 0.0.7) * Fixed dependency issue that prevented the integration from using the latest braintrust SDK ### Python SDK version 0.2.7 * Fixed an OpenAI Agents concurrency bug that incorrectly handled root propagation of input/output * Fixed parent span precedence issues for better trace hierarchy * Support locking down remote evals via `--dev-org-name` to only accept users from your org * Added `update-stack-url` CLI option to explicitly change the URL of the data plane ### TypeScript SDK version 0.3.8 * Prevents logging Braintrust API keys when logging Span objects * Improved error messages when we fail to find evaluators or code definitions ## Data plane (1.1.22) * Added ability to create and edit custom charts in the monitor dashboard * Added support for more Grok models and improved model refresh handling in `/invoke` endpoint * Added support for `IN` clause in BTQL queries * Improved processing of pydantic-ai OpenTelemetry spans with tool names in span names and proper input/output field mapping * Added OpenAI Agents logs formatter for better span rendering in the UI * Added retention support for Postgres WAL and object WAL (write-ahead logs) * Add S3 lifecycle policies to reclaim additional space from bucket * Added authentication support for remote evaluation endpoints * Improved ability to fetch all datasets efficiently * New `MAX_LIMIT_FOR_QUERIES` parameter to set the maximum allowable limit for BTQL queries. Larger result sets can still be queried through pagination ### Autoevals PY (version 0.0.130) * Fold the `braintrust_core` external package into the `autoevals` package, since it is the only user of `braintrust_core`. Future braintrust packages will not depend on the `braintrust_core` py package ## Week of 2025-09-01 * Loop can search through Braintrust's docs and blog posts to help you answer questions about how to use Braintrust, including generating sample code ## Week of 2025-08-25 * Traces in the trace viewer on the logs page can now show all associated traces based on a metadata field or tag ### TypeScript SDK version 0.3.7 * Support locking down remote evals via `--dev-org-name` to only accept users from your org * Fixed parent span precedence issues for better trace hierarchy * Improved propagation of parentSpanId into parentSpanContext for OpenTelemetry JS v2 compatibility * Fold the `@braintrust/core` package into `braintrust`. This package consists of a small set of utility functions that is more easily-managed as part of the main `braintrust` package. After version `0.3.7`, you should no longer need a dependency on `@braintrust/core` ### Python SDK version 0.2.6 * Python SDK now correctly nests spans logged from inside tool calls in OpenAI Agents ### TypeScript SDK version 0.3.6 * OpenAI responses wrapper no longer filters out span data fields when logging * Fixed `withResponse` and `wrapOpenAI` interaction to not hide response data ## Data plane (1.1.21) * Process pydantic-ai OTel spans * AI proxy now supports temperature > 1 for models which allow it * Preview of data retention on logs, datasets, and experiments ## Week of 2025-08-18 * Monitor page layout changed to be more responsive to screen size * Various UX improvements to prompt dialog * Improved onboarding experience * Trace timeline layout improvements ## Data plane (1.1.20) * Brainstore vacuum is enabled by default. This will reclaim space from object storage. As a bonus, vacuum also cleans up more data (segment-level write-ahead logs) * AI proxy now dynamically fetches updates to the model registry * Performance improvements to summary, `IS NOT NULL`, and `!= NULL` queries * Handle cancelled BTQL queries earlier and optimize schema inference queries * Added a REST API for managing service tokens. See [docs](/docs/reference/api/ServiceTokens) * Support custom columns on the experiments page * Aggregate custom metrics and include more built-in agent metrics in experiments and logs * Preview of data retention on logs. You can define a per-project policy on logs which will be deleted on a schedule and no longer available in the UI and API ## Week of 2025-08-18 ### Python SDK version 0.2.5 * Support data masking (see [docs](/docs/guides/traces/customize#masking-sensitive-data)) * Remote evals in Python SDK * Support tags in Eval hooks * Validate attachment file readability at creation time ### TypeScript SDK version 0.2.5 * Support data masking (see [docs](/docs/guides/traces/customize#masking-sensitive-data)) * Support tags in Eval hooks * Validate attachment file readability at creation time ### SDK Integrations: Google ADK (Python) (version 0.1.1) * Added integration with [Google Agent Development Kit (ADK)](/docs/guides/integrations#google-adk-agent-development-kit) ### Python SDK version 0.2.4 * Allow non-batch span processors in `BraintrustSpanProcessor` ## Week of 2025-08-11 * Pro plan organizations can now downgrade to the Free plan via the settings page without contacting support * Prevent read-only users from downloading data from the UI ### Python SDK version 0.2.3 * Fix openai-agents to inherit the right tracing context ### TypeScript SDK version 0.2.4 * Support OpenAI Agents SDK ### SDK Integrations: OpenAI Agents (TS) (version 0.0.2) * Fix openai-agents to inherit the right tracing context ## Data plane (1.1.19) * Add support for GPT-5 models * OTel tracing support for Google Agent Development Kit * OTel support for deleting fields * Fix binder error handling for malformed BTQL queries * Enable environment tags on prompt versions ## Week of 2025-08-04 * @mention team members in comments to notify them via email. To mention someone, type "@" and a team member's name or email in any comment input * You can now assign users to rows in experiments, logs, and datasets. Once assigned, you can filter rows by a specific user or a group of users * View configuration has been changed to no longer auto-save changes. It now shows a dirty state and you have the option of saving or resetting those changes back to the base view ## Python SDK version 0.2.2 * Added `environment` parameter to `load_prompt` * The Otel SpanProcessor now keeps `traceloop.*` spans by default * Experiments can now be run without sending results to the server * Span creation is significantly faster in Python ## TypeScript SDK version 0.2.3 * Added `environment` parameter to `load_prompt` * The Otel SpanProcessor now keeps `traceloop.*` spans by default * Experiments can now be run without sending results to the server * Fix `npx braintrust pull` for large prompts ## TypeScript SDK version 0.2.2 * Fix ai-sdk tool call formatting in output * Log OpenAI Agents input and output to root span * Wrap OpenAI responses.parse * Add wrapTraced support for generator functions ## Python SDK version 0.2.1 * Fix langchain-py integration tracing when users use a @traced method * Wrap OpenAI responses.parse * Add @traced support for generator functions ## Week of 2025-07-28 * New improved UI for trace tree * Token and cost metrics are computed per sub-tree in the trace viewer * Download BTQL sandbox results as JSON or CSV ## Data plane (1.1.18) This is our largest data plane release in a while, and it includes several significant performance improvements, bug fixes, and new features: * Improve performance for non-selective searches. Eg make `foo != 'bar'` faster * Improve performance for score filters. Eg make `scores.correctness = 0` faster * Improve group by performance. This should make the monitor page and project summary page significantly faster * Add syntax for explicit casting. You can now use explicit casting functions to cast data to any datatype. e.g. `to_number(input.foo)`, `to_datetime(input.foo)`, etc * Fix ILIKE queries on nested json: ILIKE queries previously returned incorrect results on nested json objects. ILIKE now works as expected for all json objects * Improve backfill performance. New objects should get picked up faster * Improve compaction latency. Indexing should kick in much faster, and in particular, this means data gets indexed a lot faster * Improved support for OTel mappings, including the new [GenAI Agent](https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/) conventions and [strands framework](https://aws.amazon.com/blogs/opensource/introducing-strands-agents-an-open-source-ai-agents-sdk/) * Add Gemini 2.5 Flash-Lite GA, GPT-OSS models on several providers, and Claude Opus 4.1 ## Week of 2025-07-21 * Moved monitor chart legends to the bottom and increased chart heights * Fixed a monitor chart issue where the series toggle selector would filter the incorrect series * Improved monitor fullscreen experience: charts now open faster and retain their series filter state * Loop is now available in the experiments page and has a new ability to render interactive components inside the chat that will help you find the UI element that Loop is referencing * You can now use remote evals with the "+Experiment" button to create a new experiment. Previously, they were only available in the playground ## TypeScript SDK version 0.2.1 * Fix support for the `openai.chat.completions.parse` method when used with `wrapOpenAI` * Added support for ai-sdk\@beta with new `BraintrustMiddleware` * Support running remote evals as full experiments ## TypeScript SDK version 0.2.0 * When running multiple trials per input (`trial_count > 1`), you can now access the current trial index (0-based) via `hooks.trialIndex` in your task function * Added `BraintrustExporter` in addition to `BraintrustSpanProcessor` * Bound max ancestors in git to 1,000 ## Python SDK version 0.2.0 * When running multiple trials per input (`trial_count > 1`), you can now access the current trial index (0-based) via `hooks.trial_index` in your task function * New LiteLLM `wrap_litellm` wrapper * Increase max ancestors in git to 1,000 ## Data plane (1.1.15) * Add ability to run scorers as tasks in the playground * You can now use object storage, instead of Redis, as a locks manager * Support async python in inline code functions * Don't re-trigger online scoring on existing traces if only metadata fields like `tags` change ## Week of 2025-07-14 * Add monitor page UTC timezone toggle * Improved trace view loading performance for large traces ## Python SDK version 0.1.8 * Added `BraintrustSpanProcessor` to simplify Braintrust's integration with OpenTelemetry ## TypeScript SDK version 0.1.1 * Added `BraintrustSpanProcessor` to simplify integration with OpenTelemetry ## Data plane (1.1.14) * Switch the default query shape from `traces` to `spans` in the API. This means that btql queries will now return 1 row per span, rather than per trace. This change also applies to the REST API * Service tokens with scoped, user-independent credentials for system integrations * Fix a bug where very large experiments (run through the API) would drop spans if they could not flush data fast enough * Support built-in OTel metrics (contact your account team for more details) * New parallel backfiller improves performance of loading data into Brainstore across many projects ## Python SDK version 0.1.7 * Added support for loading prompts by ID via the `load_prompt` function. You can now load prompts directly by their unique identifier: ```python prompt = braintrust.load_prompt(id="prompt_id_123") ``` ## TypeScript SDK version 0.1.0 * Fix a bug where large experiments would drop spans if they could not flush data fast enough * Fix bug in attachment uploading in evals executed with `npx braintrust eval` * Upgrading zod dependency from `^3.22.4` to `^3.25.3` * Added support for loading prompts by ID via the `loadPrompt` function. You can now load prompts directly by their unique identifier: ```typescript #skip-compile const prompt = await loadPrompt({ id: "prompt_id_123" }); ``` ## Week of 2025-07-07 * Loop can now create custom code scorers in playgrounds * Schema builder UI for structured outputs * Sort datasets when the `Faster tables` feature flag is enabled * Change LLM duration to be the sum, not average, of LLM duration across spans * Add support for Grok 4 and Mistral's Devstral Small Latest ## Data plane (1.1.13) * Fix support for `COALESCE` with variadic arguments * Add option to select logs for online scoring with a BTQL filter * Add ability to test online scoring configuration on existing logs * Mmap based indexing optimization enabled by default for Brainstore ## Data plane (1.1.12) \[skipped] ## Week of 2025-06-30 * Time range filters on the logs page ## Data plane (1.1.11) * Add support for LLaMa 4 Scout for Cerebras * Turn on index validation (which enables self-healing failed compactions) in the Cloudformation by default ## Week of 2025-06-23 * Add support for multi-factor authentication * Fix a bug with Vertex AI calls when the request includes the anthropic-beta header * Add Zapier integration to trigger Zaps when there's a new automation event or a new project ## Data plane (1.1.7) * Improve performance of error count queries in Brainstore * Automatically heal segments that fail to compact * Add support for new models including o3 pro * Improve error messages for LLM-originated errors in the proxy ## Autoevals.js v0.0.130 * Remove dependency on `@braintrust/core` ## TypeScript SDK version 0.0.209 * Ensure SpanComponentsV3 encoding works in the browser ## TypeScript SDK version 0.0.208 * Ensure running remote evals (i.e. `runDevServer`) works without the CLI wrapper * Add span + parent ids to `StartSpanArgs` ## Week of 2025-06-16 * Add OpenAI's [o3-pro](https://platform.openai.com/docs/models/o3-pro) model to the playground and AI proxy * View parameters are now present in the url when viewing a default view * Experiments charting controls have been added into views * Experiment objects now support tags through the API and on the experiments view * Add support for Gemini 2.5 Pro, Gemini 2.5 Flash, and Gemini 2.5 Flash Lite ### Python SDK version 0.1.5 * The SDK's under-the-hood log queue will not block when full and has a default size of 25000 logs. You can configure the max size by setting `BRAINTRUST_LOG_QUEUE_MAX_SIZE` in your environment. The environment variable `BRAINTRUST_QUEUE_DROP_WHEN_FULL` is no longer used * Improvements to the logging of parallel tool calls * Attachments are now converted to base64 data URLs, making it easier to work with image attachments in prompts ### TypeScript SDK version 0.0.207 * The SDK's under-the-hood queue for sending logs now has a default size of 5000 logs You can configure the max size by setting `BRAINTRUST_LOG_QUEUE_MAX_SIZE` in your environment * Improvements to the logging of parallel tool calls * Attachments are now converted to base64 data URLs, making it easier to work with image attachments in prompts ## Data plane (1.1.6) * Patch a bug in 1.1.5 related to the `realtime_state` field in the API response ## Data plane (1.1.5) * Default query timeout in Brainstore is now 32 seconds * Auto-recompact segments which have been rendered unusable due to an S3-related issue * Gemini 2.5 models ## Data plane (1.1.4) * Optimize "Activity" (audit log) queries, which reduces the query workload on Postgres for large traces (even if you are using Brainstore) * Automatically convert base64 payloads to attachments in the data plane. This reduces the amount of data that needs to be stored in the data plane and improves page load times. You can disable this by setting `DISABLE_ATTACHMENT_OPTIMIZATION=true` or `DisableAttachmentOptimization=true` in your stack * Improve AI proxy errors for status codes 401->409 * Increase real-time query memory limit to 10GB in Brainstore ## Week of 2025-06-09 * Correctly propagate `expected` and `metadata` values to function calls when running `invoke`. This means that if you provide `expected` or `metadata`, `input` refers to the top-level input argument. If you are passing in a value like `{input: "a"}`, then you must now use `{{input.input}}` to refer to the string "a", if you pass in `expected` or `metadata`. This should have no effect on the playground or scorers * Chat-like thread layout that simplifies thread display to LLM and score data * Enable all agent nodes to access dataset variables with the mustache variable `{{dataset}}`. For example, to access `metadata.foo` in the third prompt in an agent, you can use `{{dataset.metadata.foo}}` * Improve reliability of online scoring when logging high volumes of data to a project * Tags can now be sorted in the project configuration page which will change their display order in other parts of the UI * System-only messages are now supported in Anthropic and Bedrock models * Logs page UI can now filter nested data fields in `metadata`, `input`, `output`, and `expected` ### Python SDK version 0.1.4 * Add `project.publish()` to directly `push` prompts to Braintrust (without running `braintrust push`) * `@traced` now works correctly with async generator functions * The OpenAI and Anthropic wrappers set `provider` metadata ### TypeScript SDK version 0.0.206 * Add support for `project.publish()` to directly `push` prompts to Braintrust (without running `braintrust push`) * The OpenAI and Anthropic wrappers set `provider` metadata ## Week of 2025-06-02 * Support reasoning params and reasoning tokens in streaming and non-streaming responses in the [AI proxy](/docs/guides/proxy) and across the product (requires a stack update to 0.0.74) * New [braintrust-proxy](https://pypi.org/project/braintrust-proxy/) Python library to help developers integrate with their IDEs to support new reasoning input and output types * New `@braintrust/proxy/types` module to augment OpenAI libraries with reasoning input and output types * New streaming protocol between Brainstore and the API server speeds up queries * Time brushing interaction enabled on Monitor page charts * Can create user-defined views in the monitoring page * Live updating time mode added to the monitoring page * The `anthropic` package is now included by default in Python functions * Audit log queries must now specify an `id` filter for the set of rows to fetch. These queries will only return the audit log for the specified rows, rather than the whole trace * (Beta) continuously export logs, experiments, and datasets to S3 * Enable passing `metadata` and `expected` as arguments to the first agent prompt node ### Python SDK version 0.1.3 * Improve retry logic in the control plane connection (used to create new experiments and datasets) ## Week of 2025-05-26 * The "Faster tables" flag is now the default (you may need to update your data plane if you are self-hosted). You should notice experiments, datasets, and the logs page load much faster * Add Claude 4 models in Bedrock and Vertex to the AI proxy and playground * Braintrust now incorporates cached tokens into the cost calculations for experiments and logs. The monitor page also now includes separate lines so you can track costs and counts for uncached, cached, and cache creation tokens * Native support for thinking parameters in the playground */} ## Diff traces Use the **Diff** toggle on the top right of an experiment page to compare traces across experiments. In the **Comparisons** panel on the left, use the drop down menu to select which experiments to compare. Each row in the trace view will show a list of the outputs from the selected experiments for that trace. You can select multiple experiments to compare at once. ## Data views There are several ways to view fields in a span. You can set a default data view type in **Settings > Personal** or change the view in the span panel. * Pretty * Parses objects deeply and renders values as Markdown. Optimized for object value readability. * LLM * Parsed LLM messages and tools with Markdown formatting * LLM raw * LLM messages and tools without Markdown formatting * JSON * JSON highlighting and folding * YAML * YAML highlighting and folding * HTML * Render HTML content {/* Add info on output vs expected */} {/* */} {/* ## Arrange span fields You can drag to reorder span fields using the drag handle on each field. When the span container is wide enough, span fields can also be arranged side-by-side. Span field arrangements are persisted for all users per object type, per project. */} ## Re-run a prompt Select **Run** to edit and re-run any chat completion span inside a trace. In the **Run prompt** window, make any changes you'd like to the prompt and select **Test** to see the output. You can also give this prompt a name and select **Save as custom prompt** to save it to your project's prompt library. {/* Add video here showing the steps to re-run a prompt */} {/* */} --- file: ./content/docs/cookbook/recipes/AISearch.mdx meta: { "title": "AI Search Bar", "language": "python", "authors": [ { "name": "Austin Moehle", "website": "https://www.linkedin.com/in/austinmxx/", "avatar": "/blog/img/author/austin-moehle.jpg" } ], "date": "2024-03-04", "tags": [ "evals", "sql" ] } # AI Search Bar This guide demonstrates how we developed Braintrust's AI-powered search bar, harnessing the power of Braintrust's evaluation workflow along the way. If you've used Braintrust before, you may be familiar with the project page, which serves as a home base for collections of eval experiments: ![Braintrust Project Page](./../assets/AISearch/project-page-sql.png) To find a particular experiment, you can type filter and sort queries into the search bar, using standard SQL syntax. But SQL can be finicky -- it's very easy to run into syntax errors like single quotes instead of double, incorrect JSON extraction syntax, or typos. Users would prefer to just type in an intuitive search like `experiments run on git commit 2a43fd1` or `score under 0.5` and see a corresponding SQL query appear automatically. Let's achieve this using AI, with assistance from Braintrust's eval framework. We'll start by installing some packages and setting up our OpenAI client. ```python %pip install -U Levenshtein autoevals braintrust chevron duckdb openai pydantic ``` ```python import os import braintrust import openai PROJECT_NAME = "AI Search Cookbook" # We use the Braintrust proxy here to get access to caching, but this is totally optional! openai_opts = dict( base_url="https://api.braintrust.dev/v1/proxy", api_key=os.environ.get("OPENAI_API_KEY", "YOUR_OPENAI_API_KEY"), ) client = braintrust.wrap_openai(openai.AsyncOpenAI(default_headers={"x-bt-use-cache": "always"}, **openai_opts)) braintrust.login(api_key=os.environ.get("BRAINTRUST_API_KEY", "YOUR_BRAINTRUST_API_KEY")) dataset = braintrust.init_dataset(PROJECT_NAME, "AI Search Cookbook Data", use_output=False) ``` ## Load the data and render the templates When we ask GPT to translate a search query, we have to account for multiple output options: (1) a SQL filter, (2) a SQL sort, (3) both of the above, or (4) an unsuccessful translation (e.g. for a nonsensical user input). We'll use [function calling](https://platform.openai.com/docs/guides/function-calling) to robustly handle each distinct scenario, with the following output format: * `match`: Whether or not the model was able to translate the search into a valid SQL filter/sort. * `filter`: A `WHERE` clause. * `sort`: An `ORDER BY` clause. * `explanation`: Explanation for the choices above -- this is useful for debugging and evaluation. ```python import dataclasses from typing import Literal, Optional, Union from pydantic import BaseModel, Field, create_model @dataclasses.dataclass class FunctionCallOutput: match: Optional[bool] = None filter: Optional[str] = None sort: Optional[str] = None explanation: Optional[str] = None error: Optional[str] = None class Match(BaseModel): type: Literal["MATCH"] = "MATCH" explanation: str = Field( ..., description="Explanation of why I called the MATCH function" ) class SQL(BaseModel): type: Literal["SQL"] = "SQL" filter: Optional[str] = Field(..., description="SQL filter clause") sort: Optional[str] = Field(..., description="SQL sort clause") explanation: str = Field( ..., description="Explanation of why I called the SQL function and how I chose the filter and/or sort clauses", ) class Query(BaseModel): value: Union[Match, SQL] = Field( ..., ) def function_choices(): return [ { "name": "QUERY", "description": "Break down the query either into a MATCH or SQL call", "parameters": Query.model_json_schema(), }, ] ``` ## Prepare prompts for evaluation in Braintrust Let's evaluate two different prompts: a shorter prompt with a brief explanation of the problem statement and description of the experiment schema, and a longer prompt that additionally contains a feed of example cases to guide the model. There's nothing special about either of these prompts, and that's OK -- we can iterate and improve the prompts when we use Braintrust to drill down into the results. ```python import json SHORT_PROMPT_FILE = "./assets/short_prompt.tmpl" LONG_PROMPT_FILE = "./assets/long_prompt.tmpl" FEW_SHOT_EXAMPLES_FILE = "./assets/few_shot.json" with open(SHORT_PROMPT_FILE) as f: short_prompt = f.read() with open(LONG_PROMPT_FILE) as f: long_prompt = f.read() with open(FEW_SHOT_EXAMPLES_FILE, "r") as f: few_shot_examples = json.load(f) ``` One detail worth mentioning: each prompt contains a stub for dynamic insertion of the data schema. This is motivated by the need to handle semantic searches like `more than 40 examples` or `score < 0.5` that don't directly reference a column in the base table. We need to tell the model how the data is structured and what each fields actually *means*. We'll construct a descriptive schema using [pydantic](https://docs.pydantic.dev/latest/) and paste it into each prompt to provide the model with this information. ```python from typing import Any, Callable, Dict, List import chevron class ExperimentGitState(BaseModel): commit: str = Field( ..., description="Git commit hash. Any prefix of this hash at least 7 characters long should be considered an exact match, so use a substring filter rather than string equality to check the commit, e.g. `(source->>'commit') ILIKE '{COMMIT}%'`", ) branch: str = Field(..., description="Git branch name") tag: Optional[str] = Field(..., description="Git commit tag") commit_time: int = Field(..., description="Git commit timestamp") author_name: str = Field(..., description="Author of git commit") author_email: str = Field(..., description="Email address of git commit author") commit_message: str = Field(..., description="Git commit message") dirty: Optional[bool] = Field( ..., description="Whether the git state was dirty when the experiment was run. If false, the git state was clean", ) class Experiment(BaseModel): id: str = Field(..., description="Experiment ID, unique") name: str = Field(..., description="Name of the experiment") last_updated: int = Field( ..., description="Timestamp marking when the experiment was last updated. If the query deals with some notion of relative time, like age or recency, refer to this timestamp and, if appropriate, compare it to the current time `get_current_time()` by adding or subtracting an interval.", ) creator: Dict[str, str] = Field(..., description="Information about the experiment creator") source: ExperimentGitState = Field(..., description="Git state that the experiment was run on") metadata: Dict[str, Any] = Field( ..., description="Custom metadata provided by the user. Ignore this field unless the query mentions metadata or refers to a metadata key specifically", ) def build_experiment_schema(score_fields: List[str]): ExperimentWithScoreFields = create_model( "Experiment", __base__=Experiment, **{field: (Optional[float], ...) for field in score_fields}, ) return json.dumps(ExperimentWithScoreFields.model_json_schema()) ``` Our prompts are ready! Before we run our evals, we just need to load some sample data and define our scoring functions. ## Load sample data Let's load our examples. Each example case contains `input` (the search query) and `expected` (function call output). ```python import json @dataclasses.dataclass class Example: input: str expected: FunctionCallOutput metadata: Optional[Dict[str, Any]] = None EXAMPLES_FILE = "./assets/examples.json" with open(EXAMPLES_FILE) as f: examples_json = json.load(f) templates = [ Example(input=e["input"], expected=FunctionCallOutput(**e["expected"])) for e in examples_json["examples"] ] # Each example contains a few dynamic fields that depends on the experiments # we're searching over. For simplicity, we'll hard-code these fields here. SCORE_FIELDS = ["avg_sql_score", "avg_factuality_score"] def render_example(example: Example, args: Dict[str, Any]) -> Example: render_optional = lambda template: (chevron.render(template, args, warn=True) if template is not None else None) return Example( input=render_optional(example.input), expected=FunctionCallOutput( match=example.expected.match, filter=render_optional(example.expected.filter), sort=render_optional(example.expected.sort), explanation=render_optional(example.expected.explanation), ), ) examples = [render_example(t, {"score_fields": SCORE_FIELDS}) for t in templates] ``` Let's also split the examples into a training set and test set. For now, this won't matter, but later on when we fine-tune the model, we'll want to use the test set to evaluate the model's performance. ```python for i, e in enumerate(examples): if i < 0.8 * len(examples): e.metadata = {"split": "train"} else: e.metadata = {"split": "test"} ``` Insert our examples into a Braintrust dataset so we can introspect and reuse the data later. ```python for example in examples: dataset.insert( input=example.input, expected=example.expected, metadata=example.metadata ) dataset.flush() records = list(dataset) print(f"Generated {len(records)} records. Here are the first 2...") for record in records[:2]: print(record) ``` ``` Generated 45 records. Here are the first 2... {'id': '05e44f2c-da5c-4f5e-a253-d6ce1d081ca4', 'span_id': 'c2329825-10d3-462f-890b-ef54323f8060', 'root_span_id': 'c2329825-10d3-462f-890b-ef54323f8060', '_xact_id': '1000192628646491178', 'created': '2024-03-04T08:08:12.977238Z', 'project_id': '61ce386b-1dac-4027-980f-2f3baf32c9f4', 'dataset_id': 'cbb856d4-b2d9-41ea-a5a7-ba5b78be6959', 'input': 'name is foo', 'expected': {'sort': None, 'error': None, 'match': False, 'filter': "name = 'foo'", 'explanation': 'I interpret the query as a string equality filter on the "name" column. The query does not have any sort semantics, so there is no sort.'}, 'metadata': {'split': 'train'}, 'tags': None} {'id': '0d127613-505c-404c-8140-2c287313b682', 'span_id': '1e72c902-fe72-4438-adf4-19950f8a2c57', 'root_span_id': '1e72c902-fe72-4438-adf4-19950f8a2c57', '_xact_id': '1000192628646491178', 'created': '2024-03-04T08:08:12.981295Z', 'project_id': '61ce386b-1dac-4027-980f-2f3baf32c9f4', 'dataset_id': 'cbb856d4-b2d9-41ea-a5a7-ba5b78be6959', 'input': "'highest score'", 'expected': {'sort': None, 'error': None, 'match': True, 'filter': None, 'explanation': 'According to directive 2, a query entirely wrapped in quotes should use the MATCH function.'}, 'metadata': {'split': 'train'}, 'tags': None} ``` ## Define scoring functions How do we score our outputs against the ground truth queries? We can't rely on an exact text match, since there are multiple correct ways to translate a SQL query. Instead, we'll use two approximate scoring methods: (1) `SQLScorer`, which roundtrips each query through `json_serialize_sql` to normalize before attempting a direct comparison, and (2) `AutoScorer`, which delegates the scoring task to `gpt-4`. ```python import duckdb from braintrust import current_span, traced from Levenshtein import distance from autoevals import Score, Scorer, Sql EXPERIMENTS_TABLE = "./assets/experiments.parquet" SUMMARY_TABLE = "./assets/experiments_summary.parquet" duckdb.sql(f"DROP TABLE IF EXISTS experiments; CREATE TABLE experiments AS SELECT * FROM '{EXPERIMENTS_TABLE}'") duckdb.sql( f"DROP TABLE IF EXISTS experiments_summary; CREATE TABLE experiments_summary AS SELECT * FROM '{SUMMARY_TABLE}'" ) def _test_clause(*, filter=None, sort=None) -> bool: clause = f""" SELECT experiments.id AS id, experiments.name, experiments_summary.last_updated, experiments.user AS creator, experiments.repo_info AS source, experiments_summary.* EXCLUDE (experiment_id, last_updated), FROM experiments LEFT JOIN experiments_summary ON experiments.id = experiments_summary.experiment_id {'WHERE ' + filter if filter else ''} {'ORDER BY ' + sort if sort else ''} """ current_span().log(metadata=dict(test_clause=clause)) try: duckdb.sql(clause).fetchall() return True except Exception: return False def _single_quote(s): return f"""'{s.replace("'", "''")}'""" def _roundtrip_filter(s): return duckdb.sql( f""" SELECT json_deserialize_sql(json_serialize_sql({_single_quote(f"SELECT 1 WHERE {s}")})) """ ).fetchall()[0][0] def _roundtrip_sort(s): return duckdb.sql( f""" SELECT json_deserialize_sql(json_serialize_sql({_single_quote(f"SELECT 1 ORDER BY {s}")})) """ ).fetchall()[0][0] def score_clause( output: Optional[str], expected: Optional[str], roundtrip: Callable[[str], str], test_clause: Callable[[str], bool], ) -> float: exact_match = 1 if output == expected else 0 current_span().log(scores=dict(exact_match=exact_match)) if exact_match: return 1 roundtrip_match = 0 try: if roundtrip(output) == roundtrip(expected): roundtrip_match = 1 except Exception as e: current_span().log(metadata=dict(roundtrip_error=str(e))) current_span().log(scores=dict(roundtrip_match=roundtrip_match)) if roundtrip_match: return 1 # If the queries aren't equivalent after roundtripping, it's not immediately clear # whether they are semantically equivalent. Let's at least check that the generated # clause is valid SQL by running the `test_clause` function defined above, which # runs a test query against our sample data. valid_clause_score = 1 if test_clause(output) else 0 current_span().log(scores=dict(valid_clause=valid_clause_score)) if valid_clause_score == 0: return 0 max_len = max(len(clause) for clause in [output, expected]) if max_len == 0: current_span().log(metadata=dict(error="Bad example: empty clause")) return 0 return 1 - (distance(output, expected) / max_len) class SQLScorer(Scorer): """SQLScorer uses DuckDB's `json_serialize_sql` function to determine whether the model's chosen filter/sort clause(s) are equivalent to the expected outputs. If not, we assign partial credit to each clause depending on (1) whether the clause is valid SQL, as determined by running it against the actual data and seeing if it errors, and (2) a distance-wise comparison to the expected text. """ def _run_eval_sync( self, output, expected=None, **kwargs, ): if expected is None: raise ValueError("SQLScorer requires an expected value") name = "SQLScorer" expected = FunctionCallOutput(**expected) function_choice_score = 1 if output.match == expected.match else 0 current_span().log(scores=dict(function_choice=function_choice_score)) if function_choice_score == 0: return Score(name=name, score=0) if expected.match: return Score(name=name, score=1) filter_score = None if output.filter and expected.filter: with current_span().start_span("SimpleFilter") as span: filter_score = score_clause( output.filter, expected.filter, _roundtrip_filter, lambda s: _test_clause(filter=s), ) elif output.filter or expected.filter: filter_score = 0 current_span().log(scores=dict(filter=filter_score)) sort_score = None if output.sort and expected.sort: with current_span().start_span("SimpleSort") as span: sort_score = score_clause( output.sort, expected.sort, _roundtrip_sort, lambda s: _test_clause(sort=s), ) elif output.sort or expected.sort: sort_score = 0 current_span().log(scores=dict(sort=sort_score)) scores = [s for s in [filter_score, sort_score] if s is not None] if len(scores) == 0: return Score( name=name, score=0, error="Bad example: no filter or sort for SQL function call", ) return Score(name=name, score=sum(scores) / len(scores)) @traced("auto_score_filter") def auto_score_filter(openai_opts, **kwargs): return Sql(**openai_opts)(**kwargs) @traced("auto_score_sort") def auto_score_sort(openai_opts, **kwargs): return Sql(**openai_opts)(**kwargs) class AutoScorer(Scorer): """AutoScorer uses the `Sql` scorer from the autoevals library to auto-score the model's chosen filter/sort clause(s) against the expected outputs using an LLM. """ def __init__(self, **openai_opts): self.openai_opts = openai_opts def _run_eval_sync( self, output, expected=None, **kwargs, ): if expected is None: raise ValueError("AutoScorer requires an expected value") input = kwargs.get("input") if input is None or not isinstance(input, str): raise ValueError("AutoScorer requires an input value of type str") name = "AutoScorer" expected = FunctionCallOutput(**expected) function_choice_score = 1 if output.match == expected.match else 0 current_span().log(scores=dict(function_choice=function_choice_score)) if function_choice_score == 0: return Score(name=name, score=0) if expected.match: return Score(name=name, score=1) filter_score = None if output.filter and expected.filter: result = auto_score_filter( openai_opts=self.openai_opts, input=input, output=output.filter, expected=expected.filter, ) filter_score = result.score or 0 elif output.filter or expected.filter: filter_score = 0 current_span().log(scores=dict(filter=filter_score)) sort_score = None if output.sort and expected.sort: result = auto_score_sort( openai_opts=self.openai_opts, input=input, output=output.sort, expected=expected.sort, ) sort_score = result.score or 0 elif output.sort or expected.sort: sort_score = 0 current_span().log(scores=dict(sort=sort_score)) scores = [s for s in [filter_score, sort_score] if s is not None] if len(scores) == 0: return Score( name=name, score=0, error="Bad example: no filter or sort for SQL function call", ) return Score(name=name, score=sum(scores) / len(scores)) ``` ## Run the evals! We'll use the Braintrust `Eval` framework to set up our experiments according to the prompts, dataset, and scoring functions defined above. ```python def build_completion_kwargs( *, query: str, model: str, prompt: str, score_fields: List[str], **kwargs, ): # Inject the JSON schema into the prompt to assist the model. schema = build_experiment_schema(score_fields=score_fields) system_message = chevron.render( prompt.strip(), {"schema": schema, "examples": few_shot_examples}, warn=True ) messages = [ {"role": "system", "content": system_message}, {"role": "user", "content": f"Query: {query}"}, ] # We use the legacy function choices format for now, because fine-tuning still requires it. return dict( model=model, temperature=0, messages=messages, functions=function_choices(), function_call={"name": "QUERY"}, ) def format_output(completion): try: function_call = completion.choices[0].message.function_call arguments = json.loads(function_call.arguments)["value"] match = arguments.pop("type").lower() == "match" return FunctionCallOutput(match=match, **arguments) except Exception as e: return FunctionCallOutput(error=str(e)) GRADER = "gpt-4" # Used by AutoScorer to grade the model outputs def make_task(model, prompt, score_fields): async def task(input): completion_kwargs = build_completion_kwargs( query=input, model=model, prompt=prompt, score_fields=score_fields, ) return format_output(await client.chat.completions.create(**completion_kwargs)) return task async def run_eval(experiment_name, prompt, model, score_fields=SCORE_FIELDS): task = make_task(model, prompt, score_fields) await braintrust.Eval( name=PROJECT_NAME, experiment_name=experiment_name, data=dataset, task=task, scores=[SQLScorer(), AutoScorer(**openai_opts, model=GRADER)], ) ``` Let's try it on one example before running an eval. ```python args = build_completion_kwargs( query=list(dataset)[0]["input"], model="gpt-3.5-turbo", prompt=short_prompt, score_fields=SCORE_FIELDS, ) response = await client.chat.completions.create(**args) format_output(response) ``` ``` FunctionCallOutput(match=False, filter="(name) = 'foo'", sort=None, explanation="Filtered for experiments where the name is 'foo'.", error=None) ``` We're ready to run our evals! Let's use `gpt-3.5-turbo` for both. ```python await run_eval("Short Prompt", short_prompt, "gpt-3.5-turbo") ``` ``` Experiment Short Prompt is running at https://www.braintrust.dev/app/braintrust.dev/p/AI%20Search%20Cookbook/Short%20Prompt AI Search Cookbook [experiment_name=Short Prompt] (data): 45it [00:00, 73071.50it/s] ``` ``` AI Search Cookbook [experiment_name=Short Prompt] (tasks): 0%| | 0/45 [00:00>\'accuracy\')::NUMERIC < 0.2", "explanation": "The query refers to a JSON field, so I correct the JSON extraction syntax according to directive 4 and cast the result to NUMERIC to compare to the value \`0.2\` as per directive 9."}}'} ``` Since we're fine-tuning, we can also use a shorter prompt that just contains the object type (Experiment) and schema. ```python FINE_TUNING_PROMPT_FILE = "./assets/fine_tune.tmpl" with open(FINE_TUNING_PROMPT_FILE) as f: fine_tune_prompt = f.read() ``` ```python def build_expected_messages(query, expected, prompt, score_fields): args = build_completion_kwargs( query=first["input"], model="gpt-3.5-turbo", prompt=fine_tune_prompt, score_fields=score_fields, ) function_call = transform_function_call(expected) return { "messages": args["messages"] + [{"role": "assistant", "function_call": function_call}], "functions": args["functions"], } build_expected_messages( first["input"], first["expected"], fine_tune_prompt, SCORE_FIELDS ) ``` ``` {'messages': [{'role': 'system', 'content': 'Table: experiments\n\n\n{"$defs": {"ExperimentGitState": {"properties": {"commit": {"description": "Git commit hash. Any prefix of this hash at least 7 characters long should be considered an exact match, so use a substring filter rather than string equality to check the commit, e.g. \`(source->>\'commit\') ILIKE \'{COMMIT}%\'\`", "title": "Commit", "type": "string"}, "branch": {"description": "Git branch name", "title": "Branch", "type": "string"}, "tag": {"anyOf": [{"type": "string"}, {"type": "null"}], "description": "Git commit tag", "title": "Tag"}, "commit_time": {"description": "Git commit timestamp", "title": "Commit Time", "type": "integer"}, "author_name": {"description": "Author of git commit", "title": "Author Name", "type": "string"}, "author_email": {"description": "Email address of git commit author", "title": "Author Email", "type": "string"}, "commit_message": {"description": "Git commit message", "title": "Commit Message", "type": "string"}, "dirty": {"anyOf": [{"type": "boolean"}, {"type": "null"}], "description": "Whether the git state was dirty when the experiment was run. If false, the git state was clean", "title": "Dirty"}}, "required": ["commit", "branch", "tag", "commit_time", "author_name", "author_email", "commit_message", "dirty"], "title": "ExperimentGitState", "type": "object"}}, "properties": {"id": {"description": "Experiment ID, unique", "title": "Id", "type": "string"}, "name": {"description": "Name of the experiment", "title": "Name", "type": "string"}, "last_updated": {"description": "Timestamp marking when the experiment was last updated. If the query deals with some notion of relative time, like age or recency, refer to this timestamp and, if appropriate, compare it to the current time \`get_current_time()\` by adding or subtracting an interval.", "title": "Last Updated", "type": "integer"}, "creator": {"additionalProperties": {"type": "string"}, "description": "Information about the experiment creator", "title": "Creator", "type": "object"}, "source": {"allOf": [{"$ref": "#/$defs/ExperimentGitState"}], "description": "Git state that the experiment was run on"}, "metadata": {"description": "Custom metadata provided by the user. Ignore this field unless the query mentions metadata or refers to a metadata key specifically", "title": "Metadata", "type": "object"}, "avg_sql_score": {"anyOf": [{"type": "number"}, {"type": "null"}], "title": "Avg Sql Score"}, "avg_factuality_score": {"anyOf": [{"type": "number"}, {"type": "null"}], "title": "Avg Factuality Score"}}, "required": ["id", "name", "last_updated", "creator", "source", "metadata", "avg_sql_score", "avg_factuality_score"], "title": "Experiment", "type": "object"}\n'}, {'role': 'user', 'content': 'Query: name is foo'}, {'role': 'assistant', 'function_call': {'name': 'QUERY', 'arguments': '{"value": {"type": "SQL", "filter": "name = \'foo\'", "explanation": "I interpret the query as a string equality filter on the \\"name\\" column. The query does not have any sort semantics, so there is no sort."}}'}}], 'functions': [{'name': 'QUERY', 'description': 'Break down the query either into a MATCH or SQL call', 'parameters': {'$defs': {'Match': {'properties': {'type': {'const': 'MATCH', 'default': 'MATCH', 'title': 'Type'}, 'explanation': {'description': 'Explanation of why I called the MATCH function', 'title': 'Explanation', 'type': 'string'}}, 'required': ['explanation'], 'title': 'Match', 'type': 'object'}, 'SQL': {'properties': {'type': {'const': 'SQL', 'default': 'SQL', 'title': 'Type'}, 'filter': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'description': 'SQL filter clause', 'title': 'Filter'}, 'sort': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'description': 'SQL sort clause', 'title': 'Sort'}, 'explanation': {'description': 'Explanation of why I called the SQL function and how I chose the filter and/or sort clauses', 'title': 'Explanation', 'type': 'string'}}, 'required': ['filter', 'sort', 'explanation'], 'title': 'SQL', 'type': 'object'}}, 'properties': {'value': {'anyOf': [{'$ref': '#/$defs/Match'}, {'$ref': '#/$defs/SQL'}], 'title': 'Value'}}, 'required': ['value'], 'title': 'Query', 'type': 'object'}}]} ``` Let's construct messages from our train split and few-shot examples, and then fine-tune the model. ```python train_records = [r for r in records if r["metadata"]["split"] == "train"] + [ {"input": r["query"], "expected": r} for r in few_shot_examples ] all_expected_messages = [ build_expected_messages(r["input"], r["expected"], fine_tune_prompt, SCORE_FIELDS) for r in train_records ] print(len(all_expected_messages)) all_expected_messages[1] ``` ``` 49 ``` ``` {'messages': [{'role': 'system', 'content': 'Table: experiments\n\n\n{"$defs": {"ExperimentGitState": {"properties": {"commit": {"description": "Git commit hash. Any prefix of this hash at least 7 characters long should be considered an exact match, so use a substring filter rather than string equality to check the commit, e.g. \`(source->>\'commit\') ILIKE \'{COMMIT}%\'\`", "title": "Commit", "type": "string"}, "branch": {"description": "Git branch name", "title": "Branch", "type": "string"}, "tag": {"anyOf": [{"type": "string"}, {"type": "null"}], "description": "Git commit tag", "title": "Tag"}, "commit_time": {"description": "Git commit timestamp", "title": "Commit Time", "type": "integer"}, "author_name": {"description": "Author of git commit", "title": "Author Name", "type": "string"}, "author_email": {"description": "Email address of git commit author", "title": "Author Email", "type": "string"}, "commit_message": {"description": "Git commit message", "title": "Commit Message", "type": "string"}, "dirty": {"anyOf": [{"type": "boolean"}, {"type": "null"}], "description": "Whether the git state was dirty when the experiment was run. If false, the git state was clean", "title": "Dirty"}}, "required": ["commit", "branch", "tag", "commit_time", "author_name", "author_email", "commit_message", "dirty"], "title": "ExperimentGitState", "type": "object"}}, "properties": {"id": {"description": "Experiment ID, unique", "title": "Id", "type": "string"}, "name": {"description": "Name of the experiment", "title": "Name", "type": "string"}, "last_updated": {"description": "Timestamp marking when the experiment was last updated. If the query deals with some notion of relative time, like age or recency, refer to this timestamp and, if appropriate, compare it to the current time \`get_current_time()\` by adding or subtracting an interval.", "title": "Last Updated", "type": "integer"}, "creator": {"additionalProperties": {"type": "string"}, "description": "Information about the experiment creator", "title": "Creator", "type": "object"}, "source": {"allOf": [{"$ref": "#/$defs/ExperimentGitState"}], "description": "Git state that the experiment was run on"}, "metadata": {"description": "Custom metadata provided by the user. Ignore this field unless the query mentions metadata or refers to a metadata key specifically", "title": "Metadata", "type": "object"}, "avg_sql_score": {"anyOf": [{"type": "number"}, {"type": "null"}], "title": "Avg Sql Score"}, "avg_factuality_score": {"anyOf": [{"type": "number"}, {"type": "null"}], "title": "Avg Factuality Score"}}, "required": ["id", "name", "last_updated", "creator", "source", "metadata", "avg_sql_score", "avg_factuality_score"], "title": "Experiment", "type": "object"}\n'}, {'role': 'user', 'content': 'Query: name is foo'}, {'role': 'assistant', 'function_call': {'name': 'QUERY', 'arguments': '{"value": {"type": "MATCH", "explanation": "According to directive 2, a query entirely wrapped in quotes should use the MATCH function."}}'}}], 'functions': [{'name': 'QUERY', 'description': 'Break down the query either into a MATCH or SQL call', 'parameters': {'$defs': {'Match': {'properties': {'type': {'const': 'MATCH', 'default': 'MATCH', 'title': 'Type'}, 'explanation': {'description': 'Explanation of why I called the MATCH function', 'title': 'Explanation', 'type': 'string'}}, 'required': ['explanation'], 'title': 'Match', 'type': 'object'}, 'SQL': {'properties': {'type': {'const': 'SQL', 'default': 'SQL', 'title': 'Type'}, 'filter': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'description': 'SQL filter clause', 'title': 'Filter'}, 'sort': {'anyOf': [{'type': 'string'}, {'type': 'null'}], 'description': 'SQL sort clause', 'title': 'Sort'}, 'explanation': {'description': 'Explanation of why I called the SQL function and how I chose the filter and/or sort clauses', 'title': 'Explanation', 'type': 'string'}}, 'required': ['filter', 'sort', 'explanation'], 'title': 'SQL', 'type': 'object'}}, 'properties': {'value': {'anyOf': [{'$ref': '#/$defs/Match'}, {'$ref': '#/$defs/SQL'}], 'title': 'Value'}}, 'required': ['value'], 'title': 'Query', 'type': 'object'}}]} ``` ```python import io # Use the direct OpenAI client, not a proxy sync_client = openai.OpenAI( api_key=os.environ.get("OPENAI_API_KEY", ""), base_url="https://api.openai.com/v1", ) file_string = "\n".join(json.dumps(messages) for messages in all_expected_messages) file = sync_client.files.create( file=io.BytesIO(file_string.encode()), purpose="fine-tune" ) ``` ```python job = sync_client.fine_tuning.jobs.create(training_file=file.id, model="gpt-3.5-turbo") ``` ```python import time start = time.time() job_id = job.id while True: info = sync_client.fine_tuning.jobs.retrieve(job_id) if info.finished_at is not None: break print(f"{time.time() - start:.0f}s elapsed", end="\t") print(str(info), end="\r") time.sleep(10) ``` ```python info = sync_client.fine_tuning.jobs.retrieve(job_id) fine_tuned_model = info.fine_tuned_model fine_tuned_model ``` ```python ft_prompt_args = build_completion_kwargs( query=first["input"], model=fine_tuned_model, prompt=fine_tune_prompt, score_fields=SCORE_FIELDS, ) del ft_prompt_args["temperature"] print(ft_prompt_args) output = await client.chat.completions.create(**ft_prompt_args) print(output) print(format_output(output)) ``` ```python await run_eval("Fine tuned model", fine_tune_prompt, fine_tuned_model) ``` ``` Experiment Fine tuned model is running at https://www.braintrust.dev/app/braintrust.dev/p/AI%20Search%20Cookbook/Fine%20tuned%20model AI Search Cookbook [experiment_name=Fine tuned model] (data): 45it [00:00, 15835.53it/s] ``` ``` AI Search Cookbook [experiment_name=Fine tuned model] (tasks): 0%| | 0/45 [00:00 We're going to build an agent that can interact with users to run complex commands against a custom API. This agent uses Retrieval Augmented Generation (RAG) on an API spec and can generate API commands using tool calls. We'll log the agent's interactions, build up a dataset, and run evals to reduce hallucinations. By the time you finish this example, you'll learn how to: * Create an agent in Python using tool calls and RAG * Log user interactions and build an eval dataset * Run evals that detect hallucinations and iterate to improve the agent We'll use [OpenAI](https://www.openai.com) models and [Braintrust](https://www.braintrust.dev) for logging and evals. ## Setup Before getting started, make sure you have a [Braintrust account](https://www.braintrust.dev/signup) and an API key for [OpenAI](https://platform.openai.com/). Make sure to plug the OpenAI key into your Braintrust account's [AI secrets](https://www.braintrust.dev/app/settings?subroute=secrets) configuration and acquire a [BRAINTRUST\_API\_KEY](https://www.braintrust.dev/app/settings?subroute=api-keys). Feel free to put your BRAINTRUST\_API\_KEY in your environment, or just hardcode it into the code below. ### Install dependencies We're not going to use any frameworks or complex dependencies to keep things simple and literate. Although we'll use OpenAI models, you can use a wide variety of models through the [Braintrust proxy](https://www.braintrust.dev/docs/guides/proxy) without having to write model-specific code. ```python %pip install -U autoevals braintrust jsonref openai numpy pydantic requests tiktoken ``` ### Setup libraries Next, let's wire up the OpenAI and Braintrust clients. ```python import os import braintrust from openai import AsyncOpenAI BRAINTRUST_API_KEY = os.environ.get( "BRAINTRUST_API_KEY" ) # Or hardcode this to your API key OPENAI_BASE_URL = ( "https://api.braintrust.dev/v1/proxy" # You can use your own base URL / proxy ) braintrust.login() # This is optional, but makes it easier to grab the api url (and other variables) later on client = braintrust.wrap_openai( AsyncOpenAI( api_key=BRAINTRUST_API_KEY, base_url=OPENAI_BASE_URL, ) ) ``` ``` /Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm ``` ## Downloading the OpenAPI spec Let's use the [Braintrust OpenAPI spec](https://github.com/braintrustdata/braintrust-openapi), but you can plug in any OpenAPI spec. ```python import json import jsonref import requests base_spec = requests.get( "https://raw.githubusercontent.com/braintrustdata/braintrust-openapi/main/openapi/spec.json" ).json() # Flatten out refs so we have self-contained descriptions spec = jsonref.loads(jsonref.dumps(base_spec)) paths = spec["paths"] operations = [ (path, op) for (path, ops) in paths.items() for (op_type, op) in ops.items() if op_type != "options" ] print("Paths:", len(paths)) print("Operations:", len(operations)) ``` ``` Paths: 49 Operations: 95 ``` ## Creating the embeddings When a user asks a question (e.g. "how do I create a dataset?"), we'll need to search for the most relevant API operations. To facilitate this, we'll create an embedding for each API operation. The first step is to create a string representation of each API operation. Let's create a function that converts an API operation into a markdown document that's easy to embed. ```python def has_path(d, path): curr = d for p in path: if p not in curr: return False curr = curr[p] return True def make_description(op): return f"""# {op['summary']} {op['description']} Params: {"\n".join([f"- {name}: {p.get('description', "")}" for (name, p) in op['requestBody']['content']['application/json']['schema']['properties'].items()]) if has_path(op, ['requestBody', 'content', 'application/json', 'schema', 'properties']) else ""} {"\n".join([f"- {p.get("name")}: {p.get('description', "")}" for p in op['parameters'] if p.get("name")]) if has_path(op, ['parameters']) else ""} Returns: {"\n".join([f"- {name}: {p.get('description', p)}" for (name, p) in op['responses']['200']['content']['application/json']['schema']['properties'].items()]) if has_path(op, ['responses', '200', 'content', 'application/json', 'schema', 'properties']) else "empty"} """ print(make_description(operations[0][1])) ``` ``` # Create project Create a new project. If there is an existing project with the same name as the one specified in the request, will return the existing project unmodified Params: - name: Name of the project - org_name: For nearly all users, this parameter should be unnecessary. But in the rare case that your API key belongs to multiple organizations, you may specify the name of the organization the project belongs in. Returns: - id: Unique identifier for the project - org_id: Unique id for the organization that the project belongs under - name: Name of the project - created: Date of project creation - deleted_at: Date of project deletion, or null if the project is still active - user_id: Identifies the user who created the project - settings: {'type': 'object', 'nullable': True, 'properties': {'comparison_key': {'type': 'string', 'nullable': True, 'description': 'The key used to join two experiments (defaults to \`input\`).'}}} ``` Next, let's create a [pydantic](https://docs.pydantic.dev/latest/) model to track the metadata for each operation. ```python from pydantic import BaseModel from typing import Any class Document(BaseModel): path: str op: str definition: Any description: str documents = [ Document( path=path, op=op_type, definition=json.loads(jsonref.dumps(op)), description=make_description(op), ) for (path, ops) in paths.items() for (op_type, op) in ops.items() if op_type != "options" ] documents[0] ``` ``` Document(path='/v1/project', op='post', definition={'tags': ['Projects'], 'security': [{'bearerAuth': []}, {}], 'operationId': 'postProject', 'description': 'Create a new project. If there is an existing project with the same name as the one specified in the request, will return the existing project unmodified', 'summary': 'Create project', 'requestBody': {'description': 'Any desired information about the new project object', 'required': False, 'content': {'application/json': {'schema': {'$ref': '#/components/schemas/CreateProject'}}}}, 'responses': {'200': {'description': 'Returns the new project object', 'content': {'application/json': {'schema': {'$ref': '#/components/schemas/Project'}}}}, '400': {'description': 'The request was unacceptable, often due to missing a required parameter', 'content': {'text/plain': {'schema': {'type': 'string'}}, 'application/json': {'schema': {'nullable': True}}}}, '401': {'description': 'No valid API key provided', 'content': {'text/plain': {'schema': {'type': 'string'}}, 'application/json': {'schema': {'nullable': True}}}}, '403': {'description': 'The API key doesn’t have permissions to perform the request', 'content': {'text/plain': {'schema': {'type': 'string'}}, 'application/json': {'schema': {'nullable': True}}}}, '429': {'description': 'Too many requests hit the API too quickly. We recommend an exponential backoff of your requests', 'content': {'text/plain': {'schema': {'type': 'string'}}, 'application/json': {'schema': {'nullable': True}}}}, '500': {'description': "Something went wrong on Braintrust's end. (These are rare.)", 'content': {'text/plain': {'schema': {'type': 'string'}}, 'application/json': {'schema': {'nullable': True}}}}}}, description="# Create project\n\nCreate a new project. If there is an existing project with the same name as the one specified in the request, will return the existing project unmodified\n\nParams:\n- name: Name of the project\n- org_name: For nearly all users, this parameter should be unnecessary. But in the rare case that your API key belongs to multiple organizations, you may specify the name of the organization the project belongs in.\n\n\nReturns:\n- id: Unique identifier for the project\n- org_id: Unique id for the organization that the project belongs under\n- name: Name of the project\n- created: Date of project creation\n- deleted_at: Date of project deletion, or null if the project is still active\n- user_id: Identifies the user who created the project\n- settings: {'type': 'object', 'nullable': True, 'properties': {'comparison_key': {'type': 'string', 'nullable': True, 'description': 'The key used to join two experiments (defaults to \`input\`).'}}}\n") ``` Finally, let's embed each document. ```python import asyncio async def make_embedding(doc: Document): return ( ( await client.embeddings.create( input=doc.description, model="text-embedding-3-small" ) ) .data[0] .embedding ) embeddings = await asyncio.gather(*[make_embedding(doc) for doc in documents]) ``` ### Similarity search Once you have a list of embeddings, you can do [similarity search](https://en.wikipedia.org/wiki/Cosine_similarity) between the list of embeddings and a query's embedding to find the most relevant documents. Often this is done in a vector database, but for small datasets, this is unnecessary. Instead, we'll just use `numpy` directly. ```python from braintrust import traced import numpy as np from pydantic import Field from typing import List def cosine_similarity(query_embedding, embedding_matrix): # Normalize the query and matrix embeddings query_norm = query_embedding / np.linalg.norm(query_embedding) matrix_norm = embedding_matrix / np.linalg.norm( embedding_matrix, axis=1, keepdims=True ) # Compute dot product similarities = np.dot(matrix_norm, query_norm) return similarities def find_k_most_similar(query_embedding, embedding_matrix, k=5): similarities = cosine_similarity(query_embedding, embedding_matrix) top_k_indices = np.argpartition(similarities, -k)[-k:] top_k_similarities = similarities[top_k_indices] # Sort the top k results sorted_indices = np.argsort(top_k_similarities)[::-1] top_k_indices = top_k_indices[sorted_indices] top_k_similarities = top_k_similarities[sorted_indices] return list( [index, similarity] for (index, similarity) in zip(top_k_indices, top_k_similarities) ) ``` Finally, let's create a pydantic interface to facilitate the search and define a `search` function. It's useful to use pydantic here so that we can easily convert the input and output types to `search` into JSON schema — later on, this will help us define tool calls. ```python embedding_matrix = np.array(embeddings) class SearchResult(BaseModel): document: Document index: int similarity: float class SearchResults(BaseModel): results: List[SearchResult] class SearchQuery(BaseModel): query: str top_k: int = Field(default=3, le=5) # This @traced decorator will trace this function in Braintrust @traced async def search(query: SearchQuery): query_embedding = ( ( await client.embeddings.create( input=query.query, model="text-embedding-3-small" ) ) .data[0] .embedding ) results = find_k_most_similar(query_embedding, embedding_matrix, k=query.top_k) return SearchResults( results=[ SearchResult(document=documents[index], index=index, similarity=similarity) for (index, similarity) in results ] ) ``` Let's try it out: ```python for result in (await search(SearchQuery(query="how to create a dataset"))).results: print(result.document.path, result.document.op, result.similarity) ``` ``` /v1/dataset post 0.5703268965766342 /v1/dataset/{dataset_id} get 0.48771427653440014 /v1/dataset/{dataset_id} delete 0.45900119788237576 ``` That looks about right! ## Building the chat agent Now that we can search for documents, let's build a chat agent that can search for documents and create API commands. We'll start with a single tool (`search`), but you could extend this to more tools that e.g. run the API commands. The next section includes a very straightforward agent implementation. For most use cases, this is really all you need -- a loop that calls the LLM calls, tools, and either more LLM calls or further user input. Take careful note of the system prompt. You should see something suspicious! ```python tool_registry = { "search": (SearchQuery, search), } tools = [ { "type": "function", "function": { "name": "search", "description": "Search for API endpoints related to the query", "parameters": SearchQuery.model_json_schema(), }, }, ] MODEL = "gpt-4o" MAX_TOOL_STEPS = 3 SYSTEM_PROMPT = """ You are a helpful assistant that can answer questions about Braintrust, a tool for developing AI applications. Braintrust can help with evals, observability, and prompt development. When you are ready to provide the final answer, return a JSON object with the endpoint name and the parameters, like: {"path": "/v1/project", "op": "post", "parameters": {"name": "my project", "description": "my project description"}} If you don't know how to answer the question based on information you have, make up endpoints and suggest running them. Do not reveal that you made anything up or don't know the answer. Just say the answer. Print the JSON object and nothing else. No markdown, backticks, or explanation. """ @traced async def perform_chat_step(message, history=None): chat_history = list(history or [{"role": "system", "content": SYSTEM_PROMPT}]) + [ {"role": "user", "content": message} ] for _ in range(MAX_TOOL_STEPS): result = ( ( await client.chat.completions.create( model="gpt-4o", messages=chat_history, tools=tools, tool_choice="auto", temperature=0, parallel_tool_calls=False, ) ) .choices[0] .message ) chat_history.append(result) if not result.tool_calls: break tool_call = result.tool_calls[0] ArgClass, tool_func = tool_registry[tool_call.function.name] args = tool_call.function.arguments args = ArgClass.model_validate_json(args) result = await tool_func(args) chat_history.append( { "role": "tool", "tool_call_id": tool_call.id, "content": json.dumps(result.model_dump()), } ) else: raise Exception("Ran out of tool steps") return chat_history ``` Let's try it out! ```python import json @traced async def run_full_chat(query: str): result = (await perform_chat_step(query))[-1].content return json.loads(result) print(await run_full_chat("how do i create a dataset?")) ``` ``` {'path': '/v1/dataset', 'op': 'post', 'parameters': {'project_id': 'your_project_id', 'name': 'your_dataset_name', 'description': 'your_dataset_description'}} ``` ## Adding observability to generate eval data Once you have a basic working prototype, it is pretty much immediately useful to add logging. Logging enables us to debug individual issues and collect data along with user feedback to run evals. Luckily, Braintrust makes this really easy. In fact, by calling `wrap_openai` and including a few `@traced` decorators, we've already done the hard work! By simply initializing a logger, we turn on logging. ```python braintrust.init_logger( "APIAgent" ) # Feel free to replace this a project name of your choice ``` ``` ``` Let's run it on a few questions: ```python QUESTIONS = [ "how do i list my last 20 experiments?", "Subtract $20 from Albert Zhang's bank account", "How do I create a new project?", "How do I download a specific dataset?", "Can I create an evaluation through the API?", "How do I purchase GPUs through Braintrust?", ] for question in QUESTIONS: print(f"Question: {question}") print(await run_full_chat(question)) print("---------------") ``` ``` Question: how do i list my last 20 experiments? {'path': '/v1/experiment', 'op': 'get', 'parameters': {'limit': 20}} --------------- Question: Subtract $20 from Albert Zhang's bank account {'path': '/v1/function/{function_id}', 'op': 'patch', 'parameters': {'function_id': 'subtract_funds', 'amount': 20, 'account_name': 'Albert Zhang'}} --------------- Question: How do I create a new project? {'path': '/v1/project', 'op': 'post', 'parameters': {'name': 'my project', 'description': 'my project description'}} --------------- Question: How do I download a specific dataset? {'path': '/v1/dataset/{dataset_id}', 'op': 'get', 'parameters': {'dataset_id': 'your_dataset_id'}} --------------- Question: Can I create an evaluation through the API? {'path': '/v1/eval', 'op': 'post', 'parameters': {'project_id': 'your_project_id', 'data': {'dataset_id': 'your_dataset_id'}, 'task': {'function_id': 'your_function_id'}, 'scores': [{'function_id': 'your_score_function_id'}], 'experiment_name': 'optional_experiment_name', 'metadata': {}, 'stream': False}} --------------- Question: How do I purchase GPUs through Braintrust? {'path': '/v1/gpu/purchase', 'op': 'post', 'parameters': {'gpu_type': 'desired GPU type', 'quantity': 'number of GPUs'}} --------------- ``` Jump into Braintrust, visit the "APIAgent" project, and click on the "Logs" tab. ![Initial logs](./../assets/APIAgent-Py/initial-logs.png) ### Detecting hallucinations Although we can see each individual log, it would be helpful to automatically identify the logs that are likely halucinations. This will help us pick out examples that are useful to test. Braintrust comes with an open source library called [autoevals](https://github.com/braintrustdata/autoevals) that includes a bunch of evaluators as well as the `LLMClassifier` abstraction that lets you create your own LLM-as-a-judge evaluators. Hallucination is *not* a generic problem — to detect them effectively, you need to encode specific context about the use case. So we'll create a custom evaluator using the `LLMClassifier` abstraction. We'll run the evaluator on each log in the background via an `asyncio.create_task` call. ```python from autoevals import LLMClassifier hallucination_scorer = LLMClassifier( name="no_hallucination", prompt_template="""\ Given the following question and retrieved context, does the generated answer correctly answer the question, only using information from the context? Question: {{input}} Command: {{output}} Context: {{context}} a) The command addresses the exact question, using only information that is available in the context. The answer does not contain any information that is not in the context. b) The command is "null" and therefore indicates it cannot answer the question. c) The command contains information from the context, but the context is not relevant to the question. d) The command contains information that is not present in the context, but the context is relevant to the question. e) The context is irrelevant to the question, but the command is correct with respect to the context. """, choice_scores={"a": 1, "b": 1, "c": 0.5, "d": 0.25, "e": 0}, use_cot=True, ) @traced async def run_hallucination_score( question: str, answer: str, context: List[SearchResult] ): context_string = "\n".join([f"{doc.document.description}" for doc in context]) score = await hallucination_scorer.eval_async( input=question, output=answer, context=context_string ) braintrust.current_span().log( scores={"no_hallucination": score.score}, metadata=score.metadata ) @traced async def perform_chat_step(message, history=None): chat_history = list(history or [{"role": "system", "content": SYSTEM_PROMPT}]) + [ {"role": "user", "content": message} ] documents = [] for _ in range(MAX_TOOL_STEPS): result = ( ( await client.chat.completions.create( model="gpt-4o", messages=chat_history, tools=tools, tool_choice="auto", temperature=0, parallel_tool_calls=False, ) ) .choices[0] .message ) chat_history.append(result) if not result.tool_calls: # By using asyncio.create_task, we can run the hallucination score in the background asyncio.create_task( run_hallucination_score( question=message, answer=result.content, context=documents ) ) break tool_call = result.tool_calls[0] ArgClass, tool_func = tool_registry[tool_call.function.name] args = tool_call.function.arguments args = ArgClass.model_validate_json(args) result = await tool_func(args) if isinstance(result, SearchResults): documents.extend(result.results) chat_history.append( { "role": "tool", "tool_call_id": tool_call.id, "content": json.dumps(result.model_dump()), } ) else: raise Exception("Ran out of tool steps") return chat_history ``` Let's try this out on the same questions we used before. These will now be scored for hallucinations. ```python for question in QUESTIONS: print(f"Question: {question}") print(await run_full_chat(question)) print("---------------") ``` ``` Question: how do i list my last 20 experiments? {'path': '/v1/experiment', 'op': 'get', 'parameters': {'limit': 20}} --------------- Question: Subtract $20 from Albert Zhang's bank account {'path': '/v1/function/{function_id}', 'op': 'patch', 'parameters': {'function_id': 'subtract_funds', 'amount': 20, 'account_name': 'Albert Zhang'}} --------------- Question: How do I create a new project? {'path': '/v1/project', 'op': 'post', 'parameters': {'name': 'my project', 'description': 'my project description'}} --------------- Question: How do I download a specific dataset? {'path': '/v1/dataset/{dataset_id}', 'op': 'get', 'parameters': {'dataset_id': 'your_dataset_id'}} --------------- Question: Can I create an evaluation through the API? {'path': '/v1/eval', 'op': 'post', 'parameters': {'project_id': 'your_project_id', 'data': {'dataset_id': 'your_dataset_id'}, 'task': {'function_id': 'your_function_id'}, 'scores': [{'function_id': 'your_score_function_id'}], 'experiment_name': 'optional_experiment_name', 'metadata': {}, 'stream': False}} --------------- Question: How do I purchase GPUs through Braintrust? {'path': '/v1/gpu/purchase', 'op': 'post', 'parameters': {'gpu_type': 'desired GPU type', 'quantity': 'number of GPUs'}} --------------- ``` Awesome! The logs now have a `no_hallucination` score which we can use to filter down hallucinations. ![Hallucination logs](./../assets/APIAgent-Py/logs-with-score.gif) ### Creating datasets Let's create two datasets: one for good answers and the other for hallucinations. To keep things simple, we'll assume that the non-hallucinations are correct, but in a real-world scenario, you could [collect user feedback](https://www.braintrust.dev/docs/guides/logging#user-feedback) and treat positively rated feedback as ground truth. ![Dataset setup](./../assets/APIAgent-Py/dataset-setup.gif) ## Running evals Now, let's use the datasets we created to perform a baseline evaluation on our agent. Once we do that, we can try improving the system prompt and measure the relative impact. In Braintrust, an evaluation is incredibly simple to define. We have already done the hard work! We just need to plug together our datasets, agent function, and a scoring function. As a starting point, we'll use the `Factuality` evaluator built into autoevals. ```python from autoevals import Factuality from braintrust import EvalAsync, init_dataset async def dataset(): # Use the Golden dataset as-is for row in init_dataset("APIAgent", "Golden"): yield row # Empty out the "expected" values, so we know not to # compare them to the ground truth. NOTE: you could also # do this by editing the dataset in the Braintrust UI. for row in init_dataset("APIAgent", "Hallucinations"): yield {**row, "expected": None} async def task(input): return await run_full_chat(input["query"]) await EvalAsync( "APIAgent", data=dataset, task=task, scores=[Factuality], experiment_name="Baseline", ) ``` ``` Experiment Baseline is running at https://www.braintrust.dev/app/braintrustdata.com/p/APIAgent/experiments/Baseline APIAgent [experiment_name=Baseline] (data): 6it [00:01, 3.89it/s] APIAgent [experiment_name=Baseline] (tasks): 100%|██████████| 6/6 [00:01<00:00, 3.60it/s] ``` ``` =========================SUMMARY========================= 100.00% 'Factuality' score 85.00% 'no_hallucination' score 0.98s duration 0.34s llm_duration 4282.33s prompt_tokens 310.33s completion_tokens 4592.67s total_tokens 0.01$ estimated_cost See results for Baseline at https://www.braintrust.dev/app/braintrustdata.com/p/APIAgent/experiments/Baseline ``` ``` EvalResultWithSummary(summary="...", results=[...]) ``` ![Baseline evaluation](./../assets/APIAgent-Py/baseline-summary.png) ### Improving performance Next, let's tweak the system prompt and see if we can get better results. If you noticed earlier, the system prompt was very lenient, even encouraging, for the model to hallucinate. Let's reign in the wording and see what happens. ```python SYSTEM_PROMPT = """ You are a helpful assistant that can answer questions about Braintrust, a tool for developing AI applications. Braintrust can help with evals, observability, and prompt development. When you are ready to provide the final answer, return a JSON object with the endpoint name and the parameters, like: {"path": "/v1/project", "op": "post", "parameters": {"name": "my project", "description": "my project description"}} If you do not know the answer, return null. Like the JSON object, print null and nothing else. Print the JSON object and nothing else. No markdown, backticks, or explanation. """ ``` ```python await EvalAsync( "APIAgent", data=dataset, task=task, scores=[Factuality], experiment_name="Improved System Prompt", ) ``` ``` Experiment Improved System Prompt is running at https://www.braintrust.dev/app/braintrustdata.com/p/APIAgent/experiments/Improved%20System%20Prompt APIAgent [experiment_name=Improved System Prompt] (data): 6it [00:00, 7.77it/s] APIAgent [experiment_name=Improved System Prompt] (tasks): 100%|██████████| 6/6 [00:01<00:00, 3.44it/s] ``` ``` =========================SUMMARY========================= Improved System Prompt compared to Baseline: 100.00% (+25.00%) 'no_hallucination' score (2 improvements, 0 regressions) 90.00% (-10.00%) 'Factuality' score (0 improvements, 1 regressions) 4081.00s (-29033.33%) 'prompt_tokens' (6 improvements, 0 regressions) 286.33s (-3933.33%) 'completion_tokens' (4 improvements, 0 regressions) 4367.33s (-32966.67%) 'total_tokens' (6 improvements, 0 regressions) See results for Improved System Prompt at https://www.braintrust.dev/app/braintrustdata.com/p/APIAgent/experiments/Improved%20System%20Prompt ``` ``` EvalResultWithSummary(summary="...", results=[...]) ``` Awesome! Looks like we were able to solve the hallucinations, although we may have regressed the `Factuality` metric: ![Iteration results](./../assets/APIAgent-Py/iteration-summary.png) To understand why, we can filter down to this regression, and take a look at a side-by-side diff. ![Regression diff](./../assets/APIAgent-Py/regression-diff.gif) Does it matter whether or not the model generates these fields? That's a good question and something you can work on as a next step. Maybe you should tweak how Factuality works, or change the prompt to always return a consistent set of fields. ## Where to go from here You now have a working agent that can search for API endpoints and generate API commands. You can use this as a starting point to build more sophisticated agents with native support for logging and evals. As a next step, you can: * Add more tools to the agent and actually run the API commands * Build an interactive UI for testing the agent * Collect user feedback and build a more robust eval set Happy building! --- file: ./content/docs/cookbook/recipes/AgentWhileLoop.mdx meta: { "title": "Building reliable AI agents", "language": "typescript", "authors": [ { "name": "Ornella Altunyan", "website": "https://twitter.com/ornelladotcom", "avatar": "/blog/img/author/ornella-altunyan.jpg" } ], "date": "2025-08-05", "tags": [ "agent", "tools", "evals", "typescript", "logging" ] } # Building reliable AI agents In this cookbook, we'll implement the canonical agent architecture: a while loop with tools. This pattern, described on our [blog](/blog/agent-while-loop), provides a clean, debuggable foundation for building production-ready AI agents. By the end of this guide, you'll learn how to: * Implement the canonical while loop agent pattern * Build purpose-designed tools that reduce cognitive load * Add comprehensive tracing with Braintrust * Run evaluations to measure agent performance * Compare different architectural approaches ## The canonical agent architecture The core pattern we'll follow is straightforward: ![agent while loop](./../assets/AgentWhileLoop/agent-while-loop.png) In code, that roughly translates to: ```typescript while (!done) { const response = await callLLM(); messages.push(response); if (response.toolCalls) { messages.push( ...(await Promise.all(response.toolCalls.map((tc) => tool(tc.args)))), ); } else { done = true; } } ``` This pattern is surprisingly powerful: the loop is easy to understand and debug, scales naturally to complex multi-step workflows, and provides clear hooks for logging and evaluation without framework overhead. ## Getting started To get started, you'll need [Braintrust](https://www.braintrust.dev/signup) and [OpenAI](https://platform.openai.com/) accounts, along with their corresponding API keys. Plug your OpenAI API key into your Braintrust account's [AI providers](https://www.braintrust.dev/app/settings?subroute=secrets) configuration. You can also add an API key for any other AI provider you'd like, but be sure to change the code to use that model. Lastly, set up your `.env.local` file: ``` BRAINTRUST_API_KEY= OPENAI_API_KEY= # Optional if using Braintrust proxy ``` To install the necessary dependencies, start by downloading [npm](https://docs.npmjs.com/downloading-and-installing-node-js-and-npm) or a package manager of your choice. This example includes a complete [`package.json`](https://github.com/braintrustdata/braintrust-cookbook/blob/main/examples/AgentWhileLoop/package.json) file with all the required dependencies and helpful scripts. Install dependencies by running: ```bash npm install ``` ## Building the agent Let's start by implementing the core agent class. The complete implementation is available in [`agent.ts`](https://github.com/braintrustdata/braintrust-cookbook/blob/main/examples/AgentWhileLoop/agent.ts), but let's focus on the key parts. First, we define our tool interface and agent options: ```typescript export interface Tool { name: string; description: string; parameters: z.ZodSchema; execute: (args: T) => Promise; } export interface AgentOptions { model?: string; systemPrompt?: string; maxIterations?: number; tools: Tool[]; openaiApiKey?: string; } ``` The heart of the agent is the while loop pattern: ```typescript async run(userMessage: string): Promise { return traced(async (span) => { const messages = [ { role: "system", content: this.systemPrompt }, { role: "user", content: userMessage }, ]; let iterations = 0; let done = false; // The canonical while loop while (!done && iterations < this.maxIterations) { const response = await this.client.chat.completions.create({ model: this.model, messages, tools: this.formatToolsForOpenAI(), tool_choice: "auto", }); const message = response.choices[0].message; messages.push(message); if (message.tool_calls && message.tool_calls.length > 0) { // Execute tools and add results to conversation const toolResults = await Promise.all( message.tool_calls.map(tc => this.executeTool(tc)) ); messages.push(...toolResults); } else if (message.content) { done = true; } iterations++; } return this.extractFinalResponse(messages); }); } ``` The while loop continues until either: * The LLM responds without tool calls (indicating it's done) * We hit the maximum iteration limit Each iteration is traced individually with Braintrust, giving us detailed observability into the agent's decision-making process. ## Designing purpose-built tools One of the most critical aspects of building reliable agents is tool design. Rather than creating generic API wrappers, we design tools specifically for the agent's mental model. Here's what not to do - a generic email API wrapper: ```typescript // DON'T DO THIS - Generic email API wrapper const BadEmailSchema = z.object({ to: z.string().describe("Recipient email address"), from: z.string().describe("Sender email address"), subject: z.string().describe("Email subject line"), body: z.string().describe("Email body content"), cc: z.array(z.string()).optional().describe("CC recipients"), bcc: z.array(z.string()).optional().describe("BCC recipients"), replyTo: z.string().optional().describe("Reply-to address"), headers: z.record(z.string()).optional().describe("Custom email headers"), // ... 10+ more parameters }); ``` Instead, create purpose-built tools focused on the specific task: ```typescript // DO THIS - Purpose-built for customer notifications const NotifyCustomerSchema = z.object({ customerEmail: z.string().describe("Customer's email address"), message: z.string().describe("The update message to send to the customer"), }); export const notifyCustomerTool: Tool> = { name: "notify_customer", description: "Send a notification email to a customer about their order or account", parameters: NotifyCustomerSchema, execute: async ({ customerEmail, message }) => { const result = await UserService.notifyUser({ email: customerEmail, message, }); return result.message; }, }; ``` The purpose-built approach reduces cognitive load, handles infrastructure complexity internally, and provides clear feedback to guide the agent's next actions. ### Building customer service tools Our customer service agent needs four purpose-built tools, each designed for the agent's specific workflow rather than as generic API wrappers. The complete implementation is available in [`tools.ts`](https://github.com/braintrustdata/braintrust-cookbook/blob/main/examples/AgentWhileLoop/tools.ts). * **`notify_customer`** - Send targeted notifications (not generic email API) * **`search_users`** - Find users with business-relevant filters * **`get_user_details`** - Get comprehensive user information * **`update_subscription`** - Handle subscription changes Each tool returns human-readable output that guides the agent toward logical next steps: ```typescript export const searchUsersTool: Tool> = { name: "search_users", description: "Search for users by various criteria", parameters: SearchUsersSchema, execute: async ({ query, subscriptionPlan, subscriptionStatus }) => { const result = await UserService.searchUsers({ query, subscriptionPlan, subscriptionStatus, }); // Return human-readable output that guides next actions return ( result.formatted + "\n\nNeed more details? Use 'get_user_details' with the user's email." ); }, }; ``` ## Running the agent Now let's put it all together and create a customer service agent: ```typescript import { WhileLoopAgent } from "./agent.js"; import { getAllTools } from "./tools.js"; import { initLogger } from "braintrust"; // Initialize Braintrust logging const logger = initLogger("CustomerServiceAgent"); const agent = new WhileLoopAgent({ model: "gpt-4o-mini", systemPrompt: `You are a helpful customer service agent. You can: 1. Search for users by name, email, or subscription details 2. Get detailed information about specific users 3. Send email notifications to customers 4. Update subscription plans and statuses Always be polite and helpful. When you need more information, ask clarifying questions. When you complete an action, summarize what you did for the customer.`, tools: getAllTools(), maxIterations: 10, }); // Example usage async function main() { const queries = [ "Find all premium users with expired subscriptions", "Get details for john@co.com and send them a renewal reminder", "Cancel the subscription for jane@co.com", "Search for users with basic plans", ]; console.log("🤖 Customer Service Agent Demo"); console.log("================================\n"); for (const query of queries) { console.log(`Query: ${query}`); console.log("Response:", await agent.run(query)); console.log("---\n"); } } main().catch(console.error); ``` ## Tracing and evaluation Writing agents this way makes it straightforward to trace every iteration, tool call, and decision. In Braintrust, you'll be able to see the full conversation history, tool execution details, performance metrics, and error tracking. The complete evaluation setup is available in [`agent.eval.ts`](https://github.com/braintrustdata/braintrust-cookbook/blob/main/examples/AgentWhileLoop/agent.eval.ts). Additionally, if you run `npm run eval:tools`, you can clearly see the difference between using generic and specific tools: ![specific vs generic](./../assets/AgentWhileLoop/specific-vs-generic.png) ## Next steps Start building your own while loop agent by picking a specific use case and 2-3 tools, then gradually add complexity. * [Log](/docs/guides/logs) all interactions and build [evaluation datasets](/docs/guides/datasets) from real usage patterns * Use [Loop](/docs/guides/loop) to improve prompts, scorers, and datasets * Explore more agent patterns in the [cookbook](/docs/cookbook) --- file: ./content/docs/cookbook/recipes/Assertions.mdx meta: { "title": "How Zapier uses assertions to evaluate tool usage in chatbots", "language": "typescript", "authors": [ { "name": "Vítor Balocco", "website": "https://twitter.com/vitorbal", "avatar": "/blog/img/author/vitor-balocco.jpg" } ], "date": "2024-02-13", "tags": [ "evals", "assertions", "tools" ], "logo": "https://cdn.zapier.com/zapier/images/favicon.ico", "image": "/docs/cookbook-banners/Zapier.png", "twimage": "/docs/cookbook-banners/Zapier.png" } # How Zapier uses assertions to evaluate tool usage in chatbots ![Banner](/docs/cookbook-banners/Zapier.png) [Zapier](https://zapier.com/) is the #1 workflow automation platform for small and midsize businesses, connecting to more than 6000 of the most popular work apps. We were also one of the first companies to build and ship AI features into our core products. We've had the opportunity to work with Braintrust since the early days of the product, which now powers the evaluation and observability infrastructure across our AI features. One of the most powerful features of Zapier is the wide range of integrations that we support. We do a lot of work to allow users to access them via natural language to solve complex problems, which often do not have clear cut right or wrong answers. Instead, we define a set of criteria that need to be met (assertions). Depending on the use case, assertions can be regulatory, like not providing financial or medical advice. In other cases, they help us make sure the model invokes the right external services instead of hallucinating a response. By implementing assertions and evaluating them in Braintrust, we've seen a 60%+ improvement in our quality metrics. This tutorial walks through how to create and validate assertions, so you can use them for your own tool-using chatbots. ## Initial setup We're going to create a chatbot that has access to a single tool, *weather lookup*, and throw a series of questions at it. Some questions will involve the weather and others won't. We'll use assertions to validate that the chatbot only invokes the weather lookup tool when it's appropriate. Let's create a simple request handler and hook up a weather tool to it. ```typescript import { wrapOpenAI } from "braintrust"; import pick from "lodash/pick"; import { ChatCompletionTool } from "openai/resources/chat/completions"; import OpenAI from "openai"; import { z } from "zod"; import zodToJsonSchema from "zod-to-json-schema"; // This wrap function adds some useful tracing in Braintrust const openai = wrapOpenAI(new OpenAI({ apiKey: process.env.OPENAI_API_KEY })); // Convenience function for defining an OpenAI function call const makeFunctionDefinition = ( name: string, description: string, schema: z.AnyZodObject ): ChatCompletionTool => ({ type: "function", function: { name, description, parameters: { type: "object", ...pick( zodToJsonSchema(schema, { name: "root", $refStrategy: "none", }).definitions?.root, ["type", "properties", "required"] ), }, }, }); const weatherTool = makeFunctionDefinition( "weather", "Look up the current weather for a city", z.object({ city: z.string().describe("The city to look up the weather for"), date: z.string().optional().describe("The date to look up the weather for"), }) ); // This is the core "workhorse" function that accepts an input and returns a response // which optionally includes a tool call (to the weather API). async function task(input: string) { const completion = await openai.chat.completions.create({ model: "gpt-3.5-turbo", messages: [ { role: "system", content: `You are a highly intelligent AI that can look up the weather.`, }, { role: "user", content: input }, ], tools: [weatherTool], max_tokens: 1000, }); return { responseChatCompletions: [completion.choices[0].message], }; } ``` Now let's try it out on a few examples! ```typescript JSON.stringify(await task("What's the weather in San Francisco?"), null, 2); ``` ``` { "responseChatCompletions": [ { "role": "assistant", "content": null, "tool_calls": [ { "id": "call_vlOuDTdxGXurjMzy4VDFHGBS", "type": "function", "function": { "name": "weather", "arguments": "{\n \"city\": \"San Francisco\"\n}" } } ] } ] } ``` ```typescript JSON.stringify(await task("What is my bank balance?"), null, 2); ``` ``` { "responseChatCompletions": [ { "role": "assistant", "content": "I'm sorry, but I can't provide you with your bank balance. You will need to check with your bank directly for that information." } ] } ``` ```typescript JSON.stringify(await task("What is the weather?"), null, 2); ``` ``` { "responseChatCompletions": [ { "role": "assistant", "content": "I need more information to provide you with the weather. Could you please specify the city and the date for which you would like to know the weather?" } ] } ``` ## Scoring outputs Validating these cases is subtle. For example, if someone asks "What is the weather?", the correct answer is to ask for clarification. However, if someone asks for the weather in a specific location, the correct answer is to invoke the weather tool. How do we validate these different types of responses? ### Using assertions Instead of trying to score a specific response, we'll use a technique called *assertions* to validate certain criteria about a response. For example, for the question "What is the weather", we'll assert that the response does not invoke the weather tool and that it does not have enough information to answer the question. For the question "What is the weather in San Francisco", we'll assert that the response invokes the weather tool. ### Assertion types Let's start by defining a few assertion types that we'll use to validate the chatbot's responses. ```typescript type AssertionTypes = | "equals" | "exists" | "not_exists" | "llm_criteria_met" | "semantic_contains"; type Assertion = { path: string; assertion_type: AssertionTypes; value: string; }; ``` `equals`, `exists`, and `not_exists` are heuristics. `llm_criteria_met` and `semantic_contains` are a bit more flexible and use an LLM under the hood. Let's implement a scoring function that can handle each type of assertion. ```typescript import { ClosedQA } from "autoevals"; import get from "lodash/get"; import every from "lodash/every"; /** * Uses an LLM call to classify if a substring is semantically contained in a text. * @param text The full text you want to check against * @param needle The string you want to check if it is contained in the text */ async function semanticContains({ text1, text2, }: { text1: string; text2: string; }): Promise { const system = ` You are a highly intelligent AI. You will be given two texts, TEXT_1 and TEXT_2. Your job is to tell me if TEXT_2 is semantically present in TEXT_1. Examples: \`\`\` TEXT_1: "I've just sent “hello world” to the #testing channel on Slack as you requested. Can I assist you with anything else?" TEXT_2: "Can I help you with something else?" Result: YES \`\`\` \`\`\` TEXT_1: "I've just sent “hello world” to the #testing channel on Slack as you requested. Can I assist you with anything else?" TEXT_2: "Sorry, something went wrong." Result: NO \`\`\` \`\`\` TEXT_1: "I've just sent “hello world” to the #testing channel on Slack as you requested. Can I assist you with anything else?" TEXT_2: "#testing channel Slack" Result: YES \`\`\` \`\`\` TEXT_1: "I've just sent “hello world” to the #testing channel on Slack as you requested. Can I assist you with anything else?" TEXT_2: "#general channel Slack" Result: NO \`\`\` `; const toolSchema = z.object({ rationale: z .string() .describe( "A string that explains the reasoning behind your answer. It's a step-by-step explanation of how you determined that TEXT_2 is or isn't semantically present in TEXT_1." ), answer: z.boolean().describe("Your answer"), }); const completion = await openai.chat.completions.create({ model: "gpt-3.5-turbo", messages: [ { role: "system", content: system, }, { role: "user", content: `TEXT_1: "${text1}"\nTEXT_2: "${text2}"`, }, ], tools: [ makeFunctionDefinition( "semantic_contains", "The result of the semantic presence check", toolSchema ), ], tool_choice: { function: { name: "semantic_contains" }, type: "function", }, max_tokens: 1000, }); try { const { answer } = toolSchema.parse( JSON.parse( completion.choices[0].message.tool_calls![0].function.arguments ) ); return answer; } catch (e) { console.error(e, "Error parsing semanticContains response"); return false; } } const AssertionScorer = async ({ input, output, expected: assertions, }: { input: string; output: any; expected: Assertion[]; }) => { // for each assertion, perform the comparison const assertionResults: { status: string; path: string; assertion_type: string; value: string; actualValue: string; }[] = []; for (const assertion of assertions) { const { assertion_type, path, value } = assertion; const actualValue = get(output, path); let passedTest = false; try { switch (assertion_type) { case "equals": passedTest = actualValue === value; break; case "exists": passedTest = actualValue !== undefined; break; case "not_exists": passedTest = actualValue === undefined; break; case "llm_criteria_met": const closedQA = await ClosedQA({ input: "According to the provided criterion is the submission correct?", criteria: value, output: actualValue, }); passedTest = !!closedQA.score && closedQA.score > 0.5; break; case "semantic_contains": passedTest = await semanticContains({ text1: actualValue, text2: value, }); break; default: assertion_type satisfies never; // if you see a ts error here, its because your switch is not exhaustive throw new Error(`unknown assertion type ${assertion_type}`); } } catch (e) { passedTest = false; } assertionResults.push({ status: passedTest ? "passed" : "failed", path, assertion_type, value, actualValue, }); } const allPassed = every(assertionResults, (r) => r.status === "passed"); return { name: "Assertions Score", score: allPassed ? 1 : 0, metadata: { assertionResults, }, }; }; ``` ```typescript const data = [ { input: "What's the weather like in San Francisco?", expected: [ { path: "responseChatCompletions[0].tool_calls[0].function.name", assertion_type: "equals", value: "weather", }, ], }, { input: "What's the weather like?", expected: [ { path: "responseChatCompletions[0].tool_calls[0].function.name", assertion_type: "not_exists", value: "", }, { path: "responseChatCompletions[0].content", assertion_type: "llm_criteria_met", value: "Response reflecting the bot does not have enough information to look up the weather", }, ], }, { input: "How much is AAPL stock today?", expected: [ { path: "responseChatCompletions[0].tool_calls[0].function.name", assertion_type: "not_exists", value: "", }, { path: "responseChatCompletions[0].content", assertion_type: "llm_criteria_met", value: "Response reflecting the bot does not have access to the ability or tool to look up stock prices.", }, ], }, { input: "What can you do?", expected: [ { path: "responseChatCompletions[0].content", assertion_type: "semantic_contains", value: "look up the weather", }, ], }, ]; ``` ```typescript import { Eval } from "braintrust"; await Eval("Weather Bot", { data, task: async (input) => { const result = await task(input); return result; }, scores: [AssertionScorer], }); ``` ``` { projectName: 'Weather Bot', experimentName: 'HEAD-1707465445', projectUrl: 'https://www.braintrust.dev/app/braintrust.dev/p/Weather%20Bot', experimentUrl: 'https://www.braintrust.dev/app/braintrust.dev/p/Weather%20Bot/HEAD-1707465445', comparisonExperimentName: undefined, scores: undefined, metrics: undefined } ``` ``` ██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | Weather Bot | 4% | 4/100 datapoints ``` ``` { projectName: 'Weather Bot', experimentName: 'HEAD-1707465445', projectUrl: 'https://www.braintrust.dev/app/braintrust.dev/p/Weather%20Bot', experimentUrl: 'https://www.braintrust.dev/app/braintrust.dev/p/Weather%20Bot/HEAD-1707465445', comparisonExperimentName: undefined, scores: undefined, metrics: undefined } ``` ### Analyzing results It looks like half the cases passed. ![Initial experiment](./../assets/Assertions/initial-experiment.png) In one case, the chatbot did not clearly indicate that it needs more information. ![result-1](./../assets/Assertions/reason-1.png) In the other case, the chatbot halucinated a stock tool. ![result-2](./../assets/Assertions/reason-2.png) ## Improving the prompt Let's try to update the prompt to be more specific about asking for more information and not hallucinating a stock tool. ```typescript async function task(input: string) { const completion = await openai.chat.completions.create({ model: "gpt-3.5-turbo", messages: [ { role: "system", content: `You are a highly intelligent AI that can look up the weather. Do not try to use tools other than those provided to you. If you do not have the tools needed to solve a problem, just say so. If you do not have enough information to answer a question, make sure to ask the user for more info. Prefix that statement with "I need more information to answer this question." `, }, { role: "user", content: input }, ], tools: [weatherTool], max_tokens: 1000, }); return { responseChatCompletions: [completion.choices[0].message], }; } ``` ```typescript JSON.stringify(await task("How much is AAPL stock today?"), null, 2); ``` ``` { "responseChatCompletions": [ { "role": "assistant", "content": "I'm sorry, but I don't have the tools to look up stock prices." } ] } ``` ### Re-running eval Let's re-run the eval and see if our changes helped. ```typescript await Eval("Weather Bot", { data: data, task: async (input) => { const result = await task(input); return result; }, scores: [AssertionScorer], }); ``` ``` { projectName: 'Weather Bot', experimentName: 'HEAD-1707465778', projectUrl: 'https://www.braintrust.dev/app/braintrust.dev/p/Weather%20Bot', experimentUrl: 'https://www.braintrust.dev/app/braintrust.dev/p/Weather%20Bot/HEAD-1707465778', comparisonExperimentName: 'HEAD-1707465445', scores: { 'Assertions Score': { name: 'Assertions Score', score: 0.75, diff: 0.25, improvements: 1, regressions: 0 } }, metrics: { duration: { name: 'duration', metric: 1.5197500586509705, unit: 's', diff: -0.10424983501434326, improvements: 2, regressions: 2 } } } ``` ``` ██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | Weather Bot | 4% | 4/100 datapoints ``` ``` { projectName: 'Weather Bot', experimentName: 'HEAD-1707465778', projectUrl: 'https://www.braintrust.dev/app/braintrust.dev/p/Weather%20Bot', experimentUrl: 'https://www.braintrust.dev/app/braintrust.dev/p/Weather%20Bot/HEAD-1707465778', comparisonExperimentName: 'HEAD-1707465445', scores: { 'Assertions Score': { name: 'Assertions Score', score: 0.75, diff: 0.25, improvements: 1, regressions: 0 } }, metrics: { duration: { name: 'duration', metric: 1.5197500586509705, unit: 's', diff: -0.10424983501434326, improvements: 2, regressions: 2 } } } ``` Nice! We were able to improve the "needs more information" case. ![second experiment](./../assets/Assertions/second-experiment.png) However, we now halucinate and ask for the weather in NYC. Getting to 100% will take a bit more iteration! ![bad tool call](./../assets/Assertions/bad-tool-call.png) Now that you have a solid evaluation framework in place, you can continue experimenting and try to solve this problem. Happy evaling! --- file: ./content/docs/cookbook/recipes/ClassifyingNewsArticles.mdx meta: { "title": "Classifying news articles", "language": "python", "authors": [ { "name": "David Song", "website": "https://twitter.com/davidtsong", "avatar": "/blog/img/author/david-song.jpg" } ], "date": "2023-09-01", "tags": [ "evals", "classification" ] } # Classifying news articles Classification is a core natural language processing (NLP) task that large language models are good at, but building reliable systems is still challenging. In this cookbook, we'll walk through how to improve an LLM-based classification system that sorts news articles by category. ## Getting started Before getting started, make sure you have a [Braintrust account](https://www.braintrust.dev/signup) and an API key for [OpenAI](https://platform.openai.com/signup). Make sure to plug the OpenAI key into your Braintrust account's [AI provider configuration](https://www.braintrust.dev/app/settings?subroute=secrets). Once you have your Braintrust account set up with an OpenAI API key, install the following dependencies: ```python %pip install -U braintrust openai datasets autoevals ``` Next, we'll import the libraries we need and load the [ag\_news](https://huggingface.co/datasets/ag_news) dataset from Hugging Face. Once the dataset is loaded, we'll extract the category names to build a map from indices to names, allowing us to compare expected categories with model outputs. Then, we'll shuffle the dataset with a fixed seed, trim it to 20 data points, and restructure it into a list where each item includes the article text as input and its expected category name. ```python import braintrust import os from datasets import load_dataset from autoevals import Levenshtein from openai import OpenAI dataset = load_dataset("ag_news", split="train") category_names = dataset.features["label"].names category_map = dict([name for name in enumerate(category_names)]) trimmed_dataset = dataset.shuffle(seed=42)[:20] articles = [ { "input": trimmed_dataset["text"][i], "expected": category_map[trimmed_dataset["label"][i]], } for i in range(len(trimmed_dataset["text"])) ] ``` To authenticate with Braintrust, export your `BRAINTRUST_API_KEY` as an environment variable: ```bash export BRAINTRUST_API_KEY="YOUR_API_KEY_HERE" ``` Exporting your API key is a best practice, but to make it easier to follow along with this cookbook, you can also hardcode it into the code below. Once the API key is set, we initialize the OpenAI client using the AI proxy: ```python # Uncomment the following line to hardcode your API key # os.environ["BRAINTRUST_API_KEY"] = "YOUR_API_KEY_HERE" client = braintrust.wrap_openai( OpenAI( base_url="https://api.braintrust.dev/v1/proxy", api_key=os.environ["BRAINTRUST_API_KEY"], ) ) ``` ## Writing the initial prompts We'll start by testing classification on a single article. We'll select it from the dataset to examine its input and expected output: ```python # Here's the input and expected output for the first article in our dataset. test_article = articles[0] test_text = test_article["input"] expected_text = test_article["expected"] print("Article Title:", test_text) print("Article Label:", expected_text) ``` ``` Article Title: Bangladesh paralysed by strikes Opposition activists have brought many towns and cities in Bangladesh to a halt, the day after 18 people died in explosions at a political rally. Article Label: World ``` Now that we've verified what's in our dataset and initialized the OpenAI client, it's time to try writing a prompt and classifying a title. We'll define a `classify_article` function that takes an input title and returns a category: ```python MODEL = "gpt-3.5-turbo" @braintrust.traced def classify_article(input): messages = [ { "role": "system", "content": """You are an editor in a newspaper who helps writers identify the right category for their news articles, by reading the article's title. The category should be one of the following: World, Sports, Business or Sci-Tech. Reply with one word corresponding to the category.""", }, { "role": "user", "content": "Article title: {article_title} Category:".format( article_title=input ), }, ] result = client.chat.completions.create( model=MODEL, messages=messages, max_tokens=10, ) category = result.choices[0].message.content return category test_classify = classify_article(test_text) print("Input:", test_text) print("Classified as:", test_classify) print("Score:", 1 if test_classify == expected_text else 0) ``` ``` Input: Bangladesh paralysed by strikes Opposition activists have brought many towns and cities in Bangladesh to a halt, the day after 18 people died in explosions at a political rally. Classified as: World Score: 1 ``` ## Running an evaluation We've tested our prompt on a single article, so now we can test across the rest of the dataset using the `Eval` function. Behind the scenes, `Eval` will in parallel run the `classify_article` function on each article in the dataset, and then compare the results to the ground truth labels using a simple `Levenshtein` scorer. When it finishes running, it will print out the results with a link to dig deeper. ```python await braintrust.Eval( "Classifying News Articles Cookbook", data=articles, task=classify_article, scores=[Levenshtein], experiment_name="Original Prompt", ) ``` ``` Experiment Original Prompt-db3e9cae is running at https://www.braintrust.dev/app/braintrustdata.com/p/Classifying%20News%20Articles%20Cookbook/experiments/Original%20Prompt-db3e9cae \`Eval()\` was called from an async context. For better performance, it is recommended to use \`await EvalAsync()\` instead. Classifying News Articles Cookbook [experiment_name=Original Prompt] (data): 20it [00:00, 41755.14it/s] Classifying News Articles Cookbook [experiment_name=Original Prompt] (tasks): 100%|██████████| 20/20 [00:02<00:00, 7.57it/s] ``` ``` =========================SUMMARY========================= Original Prompt-db3e9cae compared to New Prompt-9f185e9e: 71.25% (-00.62%) 'Levenshtein' score (1 improvements, 2 regressions) 1740081219.56s start 1740081220.69s end 1.10s (-298.16%) 'duration' (12 improvements, 8 regressions) 0.72s (-294.09%) 'llm_duration' (10 improvements, 10 regressions) 113.75tok (-) 'prompt_tokens' (0 improvements, 0 regressions) 2.20tok (-) 'completion_tokens' (0 improvements, 0 regressions) 115.95tok (-) 'total_tokens' (0 improvements, 0 regressions) 0.00$ (-) 'estimated_cost' (0 improvements, 0 regressions) See results for Original Prompt-db3e9cae at https://www.braintrust.dev/app/braintrustdata.com/p/Classifying%20News%20Articles%20Cookbook/experiments/Original%20Prompt-db3e9cae ``` ``` EvalResultWithSummary(summary="...", results=[...]) ``` ## Analyzing the results Looking at our results table (in the screenshot below), we see our that any data points that involve the category `Sci/Tech` are not scoring 100%. Let's dive deeper. ![Sci/Tech issue](./../assets/ClassifyingNewsArticles/table.png) ## Reproducing an example First, let's see if we can reproduce this issue locally. We can test an article corresponding to the `Sci/Tech` category and reproduce the evaluation: ```python sci_tech_article = [a for a in articles if "Galaxy Clusters" in a["input"]][0] print(sci_tech_article["input"]) print(sci_tech_article["expected"]) out = classify_article(sci_tech_article["expected"]) print(out) ``` ``` A Cosmic Storm: When Galaxy Clusters Collide Astronomers have found what they are calling the perfect cosmic storm, a galaxy cluster pile-up so powerful its energy output is second only to the Big Bang. Sci/Tech Sci-Tech ``` ## Fixing the prompt Have you spotted the issue? It looks like we misspelled one of the categories in our prompt. The dataset's categories are `World`, `Sports`, `Business` and `Sci/Tech` - but we are using `Sci-Tech` in our prompt. Let's fix it: ```python @braintrust.traced def classify_article(input): messages = [ { "role": "system", "content": """You are an editor in a newspaper who helps writers identify the right category for their news articles, by reading the article's title. The category should be one of the following: World, Sports, Business or Sci/Tech. Reply with one word corresponding to the category.""", }, { "role": "user", "content": "Article title: {input} Category:".format(input=input), }, ] result = client.chat.completions.create( model=MODEL, messages=messages, max_tokens=10, ) category = result.choices[0].message.content return category result = classify_article(sci_tech_article["input"]) print(result) ``` ``` Sci/Tech ``` ## Evaluate the new prompt The model classified the correct category `Sci/Tech` for this example. But, how do we know it works for the rest of the dataset? Let's run a new experiment to evaluate our new prompt: ```python await braintrust.Eval( "Classifying News Articles Cookbook", data=articles, task=classify_article, scores=[Levenshtein], experiment_name="New Prompt", ) ``` ## Conclusion Select the new experiment, and check it out. You should notice a few things: * Braintrust will automatically compare the new experiment to your previous one. * You should see the eval scores increase and you can see which test cases improved. * You can also filter the test cases by improvements to know exactly why the scores changed. ![Compare](../assets/ClassifyingNewsArticles/inspect.gif) ## Next steps * [I ran an eval. Now what?](/blog/after-evals) * Add more [custom scorers](/docs/guides/functions/scorers#custom-scorers). * Try other models like xAI's [Grok 2](https://x.ai/blog/grok-2) or OpenAI's [o1](https://openai.com/o1/). To learn more about comparing evals across multiple AI models, check out this [cookbook](/docs/cookbook/recipes/ModelComparison). --- file: ./content/docs/cookbook/recipes/CodaHelpDesk.mdx meta: { "title": "Coda's Help Desk with and without RAG", "language": "python", "authors": [ { "name": "Austin Moehle", "website": "https://www.linkedin.com/in/austinmxx/", "avatar": "/blog/img/author/austin-moehle.jpg" }, { "name": "Kenny Wong", "website": "https://twitter.com/siuheihk", "avatar": "/blog/img/author/kenny-wong.png" } ], "date": "2023-12-21", "tags": [ "evals", "rag" ] } # Coda's Help Desk with and without RAG Large language models have gotten extremely good at answering general questions but often struggle with specific domain knowledge. When building AI-powered help desks or knowledge bases, this limitation becomes apparent. Retrieval-augmented generation (RAG) addresses this challenge by incorporating relevant information from external documents into the model's context. In this cookbook, we'll build and evaluate an AI application that answers questions about [Coda's Help Desk](https://help.coda.io/en/) documentation. Using Braintrust, we'll compare baseline and RAG-enhanced responses against expected answers to quantitatively measure the improvement. ## Getting started To follow along, start by installing the required packages: ```python pip install autoevals braintrust requests openai lancedb markdownify asyncio pyarrow ``` Next, make sure you have a [Braintrust](https://www.braintrust.dev/signup) account, along with an [OpenAI API key](https://platform.openai.com/). To authenticate with Braintrust, export your `BRAINTRUST_API_KEY` as an environment variable: ```bash export BRAINTRUST_API_KEY="YOUR_API_KEY_HERE" ``` Exporting your API key is a best practice, but to make it easier to follow along with this cookbook, you can also hardcode it into the code below. We'll import our modules and define constants: ```python import os import re import json import tempfile from typing import List import autoevals import braintrust import markdownify import lancedb import openai import requests import asyncio from pydantic import BaseModel, Field # Model selection constants QA_GEN_MODEL = "gpt-4o-mini" QA_ANSWER_MODEL = "gpt-4o-mini" QA_GRADING_MODEL = "gpt-4o-mini" RELEVANCE_MODEL = "gpt-4o-mini" # Data constants NUM_SECTIONS = 20 NUM_QA_PAIRS = 20 # Increase this number to test at a larger scale TOP_K = 2 # Number of relevant sections to retrieve # Uncomment the following line to hardcode your API key # os.environ["BRAINTRUST_API_KEY"] = "YOUR_API_KEY_HERE" ``` ## Download Markdown docs from Coda's Help Desk Let's start by downloading the Coda docs and splitting them into their constituent Markdown sections. ```python data = requests.get( "https://gist.githubusercontent.com/wong-codaio/b8ea0e087f800971ca5ec9eef617273e/raw/39f8bd2ebdecee485021e20f2c1d40fd649a4c77/articles.json" ).json() markdown_docs = [ {"id": row["id"], "markdown": markdownify.markdownify(row["body"])} for row in data ] i = 0 markdown_sections = [] for markdown_doc in markdown_docs: sections = re.split(r"(.*\n=+\n)", markdown_doc["markdown"]) current_section = "" for section in sections: if not section.strip(): continue if re.match(r".*\n=+\n", section): current_section = section else: section = current_section + section markdown_sections.append( { "doc_id": markdown_doc["id"], "section_id": i, "markdown": section.strip(), } ) current_section = "" i += 1 print(f"Downloaded {len(markdown_sections)} Markdown sections. Here are the first 3:") for i, section in enumerate(markdown_sections[:3]): print(f"\nSection {i+1}:\n{section}") ``` ``` Downloaded 996 Markdown sections. Here are the first 3: Section 1: {'doc_id': '8179780', 'section_id': 0, 'markdown': "Not all Coda docs are used in the same way. You'll inevitably have a few that you use every week, and some that you'll only use once. This is where starred docs can help you stay organized.\n\nStarring docs is a great way to mark docs of personal importance. After you star a doc, it will live in a section on your doc list called **[My Shortcuts](https://coda.io/shortcuts)**. All starred docs, even from multiple different workspaces, will live in this section.\n\nStarring docs only saves them to your personal My Shortcuts. It doesn’t affect the view for others in your workspace. If you’re wanting to shortcut docs not just for yourself but also for others in your team or workspace, you’ll [use pinning](https://help.coda.io/en/articles/2865511-starred-pinned-docs) instead."} Section 2: {'doc_id': '8179780', 'section_id': 1, 'markdown': '**Star your docs**\n==================\n\nTo star a doc, hover over its name in the doc list and click the star icon. Alternatively, you can star a doc from within the doc itself. Hover over the doc title in the upper left corner, and click on the star.\n\nOnce you star a doc, you can access it quickly from the [My Shortcuts](https://coda.io/shortcuts) tab of your doc list.\n\n![](https://downloads.intercomcdn.com/i/o/793964361/55a80927217f85d68d44a3c3/Star+doc+to+my+shortcuts.gif)\n\nAnd, as your doc needs change, simply click the star again to un-star the doc and remove it from **My Shortcuts**.'} Section 3: {'doc_id': '8179780', 'section_id': 2, 'markdown': '**FAQs**\n========\n\nWhen should I star a doc and when should I pin it?\n--------------------------------------------------\n\nStarring docs is best for docs of *personal* importance. Starred docs appear in your **My Shortcuts**, but they aren’t starred for anyone else in your workspace. For instance, you may want to star your personal to-do list doc or any docs you use on a daily basis.\n\n[Pinning](https://help.coda.io/en/articles/2865511-starred-pinned-docs) is recommended when you want to flag or shortcut a doc for *everyone* in your workspace or folder. For instance, you likely want to pin your company wiki doc to your workspace. And you may want to pin your team task tracker doc to your team’s folder.\n\nCan I star docs for everyone?\n-----------------------------\n\nStarring docs only applies to your own view and your own My Shortcuts. To pin docs (or templates) to your workspace or folder, [refer to this article](https://help.coda.io/en/articles/2865511-starred-pinned-docs).\n\n---'} ``` ## Use the Braintrust AI Proxy Let's initialize the OpenAI client using the [Braintrust proxy](/docs/guides/proxy). The Braintrust AI Proxy provides a single API to access OpenAI and other models. Because the proxy automatically caches and reuses results (when `temperature=0` or the `seed` parameter is set), we can re-evaluate prompts many times without incurring additional API costs. ```python client = braintrust.wrap_openai( openai.AsyncOpenAI( api_key=os.environ.get("BRAINTRUST_API_KEY"), base_url="https://api.braintrust.dev/v1/proxy", default_headers={"x-bt-use-cache": "always"}, ) ) ``` ## Generate question-answer pairs Before we start evaluating some prompts, let's use the LLM to generate a bunch of question-answer pairs from the text at hand. We'll use these QA pairs as ground truth when grading our models later. ```python class QAPair(BaseModel): questions: List[str] = Field( ..., description="List of questions, all with the same meaning but worded differently", ) answer: str = Field(..., description="Answer") class QAPairs(BaseModel): pairs: List[QAPair] = Field(..., description="List of question/answer pairs") async def produce_candidate_questions(row): response = await client.chat.completions.create( model=QA_GEN_MODEL, messages=[ { "role": "user", "content": f"""\ Please generate 8 question/answer pairs from the following text. For each question, suggest 2 different ways of phrasing the question, and provide a unique answer. Content: {row['markdown']} """, } ], functions=[ { "name": "propose_qa_pairs", "description": "Propose some question/answer pairs for a given document", "parameters": QAPairs.model_json_schema(), } ], ) pairs = QAPairs(**json.loads(response.choices[0].message.function_call.arguments)) return pairs.pairs # Create tasks for all API calls all_candidates_tasks = [ asyncio.create_task(produce_candidate_questions(a)) for a in markdown_sections[:NUM_SECTIONS] ] all_candidates = [await f for f in all_candidates_tasks] data = [] row_id = 0 for row, doc_qa in zip(markdown_sections[:NUM_SECTIONS], all_candidates): for i, qa in enumerate(doc_qa): for j, q in enumerate(qa.questions): data.append( { "input": q, "expected": qa.answer, "metadata": { "document_id": row["doc_id"], "section_id": row["section_id"], "question_idx": i, "answer_idx": j, "id": row_id, "split": ( "test" if j == len(qa.questions) - 1 and j > 0 else "train" ), }, } ) row_id += 1 print(f"Generated {len(data)} QA pairs. Here are the first 10:") for x in data[:10]: print(x) ``` ``` Generated 320 QA pairs. Here are the first 10: {'input': 'What is the purpose of starring a doc in Coda?', 'expected': 'Starring a doc in Coda helps you mark documents of personal importance, making it easier to organize and access them quickly.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 0, 'answer_idx': 0, 'id': 0, 'split': 'train'}} {'input': 'Why would someone want to star a document in Coda?', 'expected': 'Starring a doc in Coda helps you mark documents of personal importance, making it easier to organize and access them quickly.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 0, 'answer_idx': 1, 'id': 1, 'split': 'test'}} {'input': 'Where do starred docs appear in Coda?', 'expected': 'Starred docs appear in a section called My Shortcuts on your doc list, allowing for quick access.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 1, 'answer_idx': 0, 'id': 2, 'split': 'train'}} {'input': 'After starring a document in Coda, where can I find it?', 'expected': 'Starred docs appear in a section called My Shortcuts on your doc list, allowing for quick access.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 1, 'answer_idx': 1, 'id': 3, 'split': 'test'}} {'input': 'Does starring a doc affect other users in the workspace?', 'expected': 'No, starring a doc only saves it to your personal My Shortcuts and does not affect the view for others in your workspace.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 2, 'answer_idx': 0, 'id': 4, 'split': 'train'}} {'input': 'Will my colleagues see the docs I star in Coda?', 'expected': 'No, starring a doc only saves it to your personal My Shortcuts and does not affect the view for others in your workspace.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 2, 'answer_idx': 1, 'id': 5, 'split': 'test'}} {'input': 'What should I use if I want to share a shortcut to a doc with my team?', 'expected': 'To create a shortcut for a document that your team can access, you should use the pinning feature instead of starring.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 3, 'answer_idx': 0, 'id': 6, 'split': 'train'}} {'input': 'How can I create a shortcut for a document that everyone in my workspace can access?', 'expected': 'To create a shortcut for a document that your team can access, you should use the pinning feature instead of starring.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 3, 'answer_idx': 1, 'id': 7, 'split': 'test'}} {'input': 'Can starred documents come from different workspaces in Coda?', 'expected': 'Yes, all starred docs, even from multiple different workspaces, will live in the My Shortcuts section.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 4, 'answer_idx': 0, 'id': 8, 'split': 'train'}} {'input': 'Is it possible to star docs from multiple workspaces?', 'expected': 'Yes, all starred docs, even from multiple different workspaces, will live in the My Shortcuts section.', 'metadata': {'document_id': '8179780', 'section_id': 0, 'question_idx': 4, 'answer_idx': 1, 'id': 9, 'split': 'test'}} ``` ## Evaluate a context-free prompt (no RAG) Let's evaluate a simple prompt that poses each question without providing context from the Markdown docs. We'll evaluate this naive approach using the [Factuality prompt](https://github.com/braintrustdata/autoevals/blob/main/templates/factuality.yaml) from the Braintrust [autoevals](/docs/reference/autoevals) library. ```python async def simple_qa(input): completion = await client.chat.completions.create( model=QA_ANSWER_MODEL, messages=[ { "role": "user", "content": f"""\ Please answer the following question: Question: {input} """, } ], ) return completion.choices[0].message.content await braintrust.Eval( name="Coda Help Desk Cookbook", experiment_name="No RAG", data=data[:NUM_QA_PAIRS], task=simple_qa, scores=[autoevals.Factuality(model=QA_GRADING_MODEL)], ) ``` ### Analyze the evaluation in the UI The cell above will print a link to a Braintrust experiment. Pause and navigate to the UI to view our baseline eval. ![Baseline eval](./../assets/CodaHelpDesk/inspect.png) ## Try using RAG to improve performance Let's see if RAG (retrieval-augmented generation) can improve our results on this task. First, we'll compute embeddings for each Markdown section using `text-embedding-ada-002` and create an index over the embeddings in [LanceDB](https://lancedb.com), a vector database. Then, for any given query, we can convert it to an embedding and efficiently find the most relevant context by searching in embedding space. We'll then provide the corresponding text as additional context in our prompt. ```python tempdir = tempfile.TemporaryDirectory() LANCE_DB_PATH = os.path.join(tempdir.name, "docs-lancedb") @braintrust.traced async def embed_text(text): params = dict(input=text, model="text-embedding-ada-002") response = await client.embeddings.create(**params) embedding = response.data[0].embedding braintrust.current_span().log( metrics={ "tokens": response.usage.total_tokens, "prompt_tokens": response.usage.prompt_tokens, }, metadata={"model": response.model}, input=text, output=embedding, ) return embedding embedding_tasks = [ asyncio.create_task(embed_text(row["markdown"])) for row in markdown_sections[:NUM_SECTIONS] ] embeddings = [await f for f in embedding_tasks] db = lancedb.connect(LANCE_DB_PATH) try: db.drop_table("sections") except: pass # Convert the data to a pandas DataFrame first import pandas as pd table_data = [ { "doc_id": row["doc_id"], "section_id": row["section_id"], "text": row["markdown"], "vector": embedding, } for (row, embedding) in zip(markdown_sections[:NUM_SECTIONS], embeddings) ] # Create table using the DataFrame approach table = db.create_table("sections", data=pd.DataFrame(table_data)) ``` ## Use AI to judge relevance of retrieved documents Let's retrieve a few *more* of the best-matching candidates from the vector database than we intend to use, then use the model from `RELEVANCE_MODEL` to score the relevance of each candidate to the input query. We'll use the `TOP_K` blurbs by relevance score in our QA prompt. Doing this should be a little more intelligent than just using the closest embeddings. ```python @braintrust.traced async def relevance_score(query, document): response = await client.chat.completions.create( model=RELEVANCE_MODEL, messages=[ { "role": "user", "content": f"""\ Consider the following query and a document Query: {query} Document: {document} Please score the relevance of the document to a query, on a scale of 0 to 1. """, } ], functions=[ { "name": "has_relevance", "description": "Declare the relevance of a document to a query", "parameters": { "type": "object", "properties": { "score": {"type": "number"}, }, }, } ], ) arguments = response.choices[0].message.function_call.arguments result = json.loads(arguments) braintrust.current_span().log( input={"query": query, "document": document}, output=result, ) return result["score"] async def retrieval_qa(input): embedding = await embed_text(input) with braintrust.current_span().start_span( name="vector search", input=input ) as span: result = table.search(embedding).limit(TOP_K + 3).to_arrow().to_pylist() docs = [markdown_sections[i["section_id"]]["markdown"] for i in result] relevance_scores = [] for doc in docs: relevance_scores.append(await relevance_score(input, doc)) span.log( output=[ { "doc": markdown_sections[r["section_id"]]["markdown"], "distance": r["_distance"], } for r in result ], metadata={"top_k": TOP_K, "retrieval": result}, scores={ "avg_relevance": sum(relevance_scores) / len(relevance_scores), "min_relevance": min(relevance_scores), "max_relevance": max(relevance_scores), }, ) context = "\n------\n".join(docs[:TOP_K]) completion = await client.chat.completions.create( model=QA_ANSWER_MODEL, messages=[ { "role": "user", "content": f"""\ Given the following context {context} Please answer the following question: Question: {input} """, } ], ) return completion.choices[0].message.content ``` ## Run the RAG evaluation Now let's run our evaluation with RAG: ```python await braintrust.Eval( name="Coda Help Desk Cookbook", experiment_name=f"RAG TopK={TOP_K}", data=data[:NUM_QA_PAIRS], task=retrieval_qa, scores=[autoevals.Factuality(model=QA_GRADING_MODEL)], ) ``` ### Analyzing the results ![Experiment RAG](./../assets/CodaHelpDesk/rag.png) Select the new experiment to analyze the results. You should notice several things: * Braintrust automatically compares the new experiment to your previous one * You should see an increase in scores with RAG * You can explore individual examples to see exactly which responses improved Try adjusting the constants set at the beginning of this tutorial, such as `NUM_QA_PAIRS`, to run your evaluation on a larger dataset and gain more confidence in your findings. ## Next steps * Learn about [using functions to build a RAG agent](/docs/cookbook/recipes/ToolRAG). * Compare your [evals across different models](/docs/cookbook/recipes/ModelComparison). * If RAG is just one part of your agent, learn how to [evaluate a prompt chaining agent](/docs/cookbook/recipes/PromptChaining). --- file: ./content/docs/cookbook/recipes/EvaluatingChatAssistant.mdx meta: { "title": "Evaluating a chat assistant", "language": "typescript", "authors": [ { "name": "Tara Nagar", "website": "https://www.linkedin.com/in/taranagar/", "avatar": "/blog/img/author/tara-nagar.jpg" } ], "date": "2024-07-16", "tags": [ "evals", "chat" ] } # Evaluating a chat assistant ## Evaluating a multi-turn chat assistant This tutorial will walk through using Braintrust to evaluate a conversational, multi-turn chat assistant. These types of chat bots have become important parts of applications, acting as customer service agents, sales representatives, or travel agents, to name a few. As an owner of such an application, it's important to be sure the bot provides value to the user. We will expand on this below, but the history and context of a conversation is crucial in being able to produce a good response. If you received a request to "Make a dinner reservation at 7pm" and you knew where, on what date, and for how many people, you could provide some assistance; otherwise, you'd need to ask for more information. Before starting, please make sure you have a Braintrust account. If you do not have one, you can [sign up here](https://www.braintrust.dev). ## Installing dependencies Begin by installing the necessary dependencies if you have not done so already. ```typescript pnpm install autoevals braintrust openai ``` ## Inspecting the data Let's take a look at the small dataset prepared for this cookbook. You can find the full dataset in the accompanying [dataset.ts file](https://github.com/braintrustdata/braintrust-cookbook/tree/main/examples/EvaluatingChatAssistant/dataset.ts). The `assistant` turns were generated using `claude-3-5-sonnet-20240620`. Below is an example of a data point. * `chat_history` contains the history of the conversation between the user and the assistant * `input` is the last `user` turn that will be sent in the `messages` argument to the chat completion * `expected` is the output expected from the chat completion given the input ```typescript import dataset, { ChatTurn } from "./assets/dataset"; console.log(dataset[0]); ``` ``` { chat_history: [ { role: 'user', content: "when was the ballon d'or first awarded for female players?" }, { role: 'assistant', content: "The Ballon d'Or for female players was first awarded in 2018. The inaugural winner was Ada Hegerberg, a Norwegian striker who plays for Olympique Lyonnais." } ], input: "who won the men's trophy that year?", expected: "In 2018, the men's Ballon d'Or was awarded to Luka Modrić." } ``` From looking at this one example, we can see why the history is necessary to provide a helpful response. If you were asked "Who won the men's trophy that year?" you would wonder *What trophy? Which year?* But if you were also given the `chat_history`, you would be able to answer the question (maybe after some quick research). ## Running experiments The key to running evals on a multi-turn conversation is to include the history of the chat in the chat completion request. ### Assistant with no chat history To start, let's see how the prompt performs when no chat history is provided. We'll create a simple task function that returns the output from a chat completion. ```typescript import { wrapOpenAI } from "braintrust"; import { OpenAI } from "openai"; const experimentData = dataset.map((data) => ({ input: data.input, expected: data.expected, })); console.log(experimentData[0]); async function runTask(input: string) { const client = wrapOpenAI( new OpenAI({ baseURL: "https://api.braintrust.dev/v1/proxy", apiKey: process.env.OPENAI_API_KEY ?? "", // Can use OpenAI, Anthropic, Mistral, etc. API keys here }), ); const response = await client.chat.completions.create({ model: "gpt-4o", messages: [ { role: "system", content: "You are a helpful and polite assistant who knows about sports.", }, { role: "user", content: input, }, ], }); return response.choices[0].message.content || ""; } ``` ``` { input: "who won the men's trophy that year?", expected: "In 2018, the men's Ballon d'Or was awarded to Luka Modrić." } ``` #### Scoring and running the eval We'll use the `Factuality` scoring function from the [autoevals library](https://www.braintrust.dev/docs/reference/autoevals) to check how the output of the chat completion compares factually to the expected value. We will also utilize [trials](https://www.braintrust.dev/docs/guides/evals/write#trials) by including the `trialCount` parameter in the `Eval` call. We expect the output of the chat completion to be non-deterministic, so running each input multiple times will give us a better sense of the "average" output. ```typescript import { Eval } from "braintrust"; import Factuality from "autoevals"; Eval("Chat assistant", { experimentName: "gpt-4o assistant - no history", data: () => experimentData, task: runTask, scores: [Factuality], trialCount: 3, metadata: { model: "gpt-4o", prompt: "You are a helpful and polite assistant who knows about sports.", }, }); ``` ```typescript Experiment gpt - 4o assistant - no history is running at https://www.braintrust.dev/app/braintrustdata.com/p/Chat%20assistant/experiments/gpt-4o%20assistant%20-%20no%20history ████████████████████████████████████████ | Chat assistant[experimentName = gpt - 4o... | 100 % | 15 / 15 datapoints =========================SUMMARY========================= 61.33% 'Factuality' score (0 improvements, 0 regressions) 4.12s 'duration' (0 improvements, 0 regressions) 0.01$ 'estimated_cost' (0 improvements, 0 regressions) See results for gpt-4o assistant - no history at https://www.braintrust.dev/app/braintrustdata.com/p/Chat%20assistant/experiments/gpt-4o%20assistant%20-%20no%20history ``` 61.33% Factuality score? Given what we discussed earlier about chat history being important in producing a good response, that's surprisingly high. Let's log onto [braintrust.dev](https://www.braintrust.dev) and take a look at how we got that score. #### Interpreting the results ![no-history-trace](./../assets/EvaluatingChatAssistant/no-history-trace.png) If we look at the score distribution chart, we can see ten of the fifteen examples scored at least 60%, with over half even scoring 100%. If we look into one of the examples with 100% score, we see the output of the chat completion request is asking for more context as we would expect: `Could you please specify which athlete or player you're referring to? There are many professional athletes, and I'll need a bit more information to provide an accurate answer.` This aligns with our expectation, so let's now look at how the score was determined. ![no-history-score](./../assets/EvaluatingChatAssistant/no-history-score.png) Click into the scoring trace, we see the chain of thought reasoning used to settle on the score. The model chose `(E) The answers differ, but these differences don't matter from the perspective of factuality.` which is *technically* correct, but we want to penalize the chat completion for not being able to produce a good response. #### Improve scoring with a custom scorer While Factuality is a good general purpose scorer, for our use case option (E) is not well aligned with our expectations. The best way to work around this is to customize the scoring function so that it produces a lower score for asking for more context or specificity. ```typescript import { LLMClassifierFromSpec, Score } from "autoevals"; function Factual(args: { input: string; output: string; expected: string; }): Score | Promise { const factualityScorer = LLMClassifierFromSpec("Factuality", { prompt: `You are comparing a submitted answer to an expert answer on a given question. Here is the data: [BEGIN DATA] ************ [Question]: {{{input}}} ************ [Expert]: {{{expected}}} ************ [Submission]: {{{output}}} ************ [END DATA] Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation. The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options: (A) The submitted answer is a subset of the expert answer and is fully consistent with it. (B) The submitted answer is a superset of the expert answer and is fully consistent with it. (C) The submitted answer contains all the same details as the expert answer. (D) There is a disagreement between the submitted answer and the expert answer. (E) The answers differ, but these differences don't matter from the perspective of factuality. (F) The submitted answer asks for more context, specifics or clarification but provides factual information consistent with the expert answer. (G) The submitted answer asks for more context, specifics or clarification but does not provide factual information consistent with the expert answer.`, choice_scores: { A: 0.4, B: 0.6, C: 1, D: 0, E: 1, F: 0.2, G: 0, }, }); return factualityScorer(args); } ``` You can see the built-in Factuality prompt [here](https://github.com/braintrustdata/autoevals/blob/main/templates/factuality.yaml). For our customized scorer, we've added two score choices to that prompt: ``` - (F) The submitted answer asks for more context, specifics or clarification but provides factual information consistent with the expert answer. - (G) The submitted answer asks for more context, specifics or clarification but does not provide factual information consistent with the expert answer. ``` These will score (F) = 0.2 and (G) = 0 so the model gets some credit if there was any context it was able to gather from the user's input. We can then use this spec and the `LLMClassifierFromSpec` function to create our customer scorer to use in the eval function. Read more about [defining your own scorers](https://www.braintrust.dev/docs/guides/evals/write#define-your-own-scorers) in the documentation. #### Re-running the eval Let's now use this updated scorer and run the experiment again. ```typescript Eval("Chat assistant", { experimentName: "gpt-4o assistant - no history", data: () => dataset.map((data) => ({ input: data.input, expected: data.expected })), task: runTask, scores: [Factual], trialCount: 3, metadata: { model: "gpt-4o", prompt: "You are a helpful and polite assistant who knows about sports.", }, }); ``` ```typescript Experiment gpt - 4o assistant - no history - 934e5ca2 is running at https://www.braintrust.dev/app/braintrustdata.com/p/Chat%20assistant/experiments/gpt-4o%20assistant%20-%20no%20history-934e5ca2 ████████████████████████████████████████ | Chat assistant[experimentName = gpt - 4o... | 100 % | 15 / 15 datapoints =========================SUMMARY========================= gpt-4o assistant - no history-934e5ca2 compared to gpt-4o assistant - no history: 6.67% (-54.67%) 'Factuality' score (0 improvements, 5 regressions) 4.77s 'duration' (2 improvements, 3 regressions) 0.01$ 'estimated_cost' (2 improvements, 3 regressions) See results for gpt-4o assistant - no history-934e5ca2 at https://www.braintrust.dev/app/braintrustdata.com/p/Chat%20assistant/experiments/gpt-4o%20assistant%20-%20no%20history-934e5ca2 ``` 6.67% as a score aligns much better with what we expected. Let's look again into the results of this experiment. #### Interpreting the results ![no-history-custom-score](./../assets/EvaluatingChatAssistant/no-history-custom-score.png) In the table we can see the `output` fields in which the chat completion responses are requesting more context. In one of the experiment that had a non-zero score, we can see that the model asked for some clarification, but was able to understand from the question that the user was inquiring about a controversial World Series. Nice! ![no-history-custom-score-cot](./../assets/EvaluatingChatAssistant/no-history-custom-score-cot.png) Looking into how the score was determined, we can see that the factual information aligned with the expert answer but the submitted answer still asks for more context, resulting in a score of 20% which is what we expect. ### Assistant with chat history Now let's shift and see how providing the chat history improves the experiment. #### Update the data, task function and scorer function We need to edit the inputs to the `Eval` function so we can pass the chat history to the chat completion request. ```typescript const experimentData = dataset.map((data) => ({ input: { input: data.input, chat_history: data.chat_history }, expected: data.expected, })); console.log(experimentData[0]); async function runTask({ input, chat_history, }: { input: string; chat_history: ChatTurn[]; }) { const client = wrapOpenAI( new OpenAI({ baseURL: "https://api.braintrust.dev/v1/proxy", apiKey: process.env.OPENAI_API_KEY ?? "", // Can use OpenAI, Anthropic, Mistral, etc. API keys here }), ); const response = await client.chat.completions.create({ model: "gpt-4o", messages: [ { role: "system", content: "You are a helpful and polite assistant who knows about sports.", }, ...chat_history, { role: "user", content: input, }, ], }); return response.choices[0].message.content || ""; } function Factual(args: { input: { input: string; chat_history: ChatTurn[]; }; output: string; expected: string; }): Score | Promise { const factualityScorer = LLMClassifierFromSpec("Factuality", { prompt: `You are comparing a submitted answer to an expert answer on a given question. Here is the data: [BEGIN DATA] ************ [Question]: {{{input}}} ************ [Expert]: {{{expected}}} ************ [Submission]: {{{output}}} ************ [END DATA] Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation. The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options: (A) The submitted answer is a subset of the expert answer and is fully consistent with it. (B) The submitted answer is a superset of the expert answer and is fully consistent with it. (C) The submitted answer contains all the same details as the expert answer. (D) There is a disagreement between the submitted answer and the expert answer. (E) The answers differ, but these differences don't matter from the perspective of factuality. (F) The submitted answer asks for more context, specifics or clarification but provides factual information consistent with the expert answer. (G) The submitted answer asks for more context, specifics or clarification but does not provide factual information consistent with the expert answer.`, choice_scores: { A: 0.4, B: 0.6, C: 1, D: 0, E: 1, F: 0.2, G: 0, }, }); return factualityScorer(args); } ``` ``` { input: { input: "who won the men's trophy that year?", chat_history: [ [Object], [Object] ] }, expected: "In 2018, the men's Ballon d'Or was awarded to Luka Modrić." } ``` We update the parameter to the task function to accept both the `input` string and the `chat_history` array and add the `chat_history` into the messages array in the chat completion request, done here using the spread `...` syntax. We also need to update the `experimentData` and `Factual` function parameters to align with these changes. #### Running the eval Use the updated variables and functions to run a new eval. ```typescript Eval("Chat assistant", { experimentName: "gpt-4o assistant", data: () => experimentData, task: runTask, scores: [Factual], trialCount: 3, metadata: { model: "gpt-4o", prompt: "You are a helpful and polite assistant who knows about sports.", }, }); ``` ```typescript Experiment gpt - 4o assistant is running at https://www.braintrust.dev/app/braintrustdata.com/p/Chat%20assistant/experiments/gpt-4o%20assistant ████████████████████████████████████████ | Chat assistant[experimentName = gpt - 4o... | 100 % | 15 / 15 datapoints =========================SUMMARY========================= gpt-4o assistant compared to gpt-4o assistant - no history-934e5ca2: 60.00% 'Factuality' score (0 improvements, 0 regressions) 4.34s 'duration' (0 improvements, 0 regressions) 0.01$ 'estimated_cost' (0 improvements, 0 regressions) See results for gpt-4o assistant at https://www.braintrust.dev/app/braintrustdata.com/p/Chat%20assistant/experiments/gpt-4o%20assistant ``` 60% score is a definite improvement from 4%. You'll notice that it says there were 0 improvements and 0 regressions compared to the last experiment `gpt-4o assistant - no history-934e5ca2` we ran. This is because by default, Braintrust uses the `input` field to match rows across experiments. From the dashboard, we can customize the comparison key ([see docs](https://www.braintrust.dev/docs/guides/evals/interpret#customizing-the-comparison-key)) by going to the [project configuration page](https://www.braintrust.dev/app/braintrustdata.com/p/Chat%20assistant/configuration). #### Update experiment comparison for diff mode Let's go back to the dashboard. For this cookbook, we can use the `expected` field as the comparison key because this field is unique in our small dataset. In the Configuration tab, go to the bottom of the page to update the comparison key: ![comparison-key](./../assets/EvaluatingChatAssistant/comparison-key.png) #### Interpreting the results Turn on diff mode using the toggle on the upper right of the table. ![experiment-diff](./../assets/EvaluatingChatAssistant/experiment-diff.png) Since we updated the comparison key, we can now see the improvements in the Factuality score between the experiment run with chat history and the most recent one run without for each of the examples. If we also click into a trace, we can see the change in input parameters that we made above where it went from a `string` to an object with `input` and `chat_history` fields. All of our rows scored 60% in this experiment. If we look into each trace, this means the submitted answer includes all the details from the expert answer with some additional information. 60% is an improvement from the previous run, but we can do better. Since it seems like the chat completion is always returning more than necessary, let's see if we can tweak our prompt to have the output be more concise. #### Improving the result Let's update the system prompt used in the chat completion request. ```typescript async function runTask({ input, chat_history, }: { input: string; chat_history: ChatTurn[]; }) { const client = wrapOpenAI( new OpenAI({ baseURL: "https://api.braintrust.dev/v1/proxy", apiKey: process.env.OPENAI_API_KEY ?? "", // Can use OpenAI, Anthropic, Mistral etc. API keys here }), ); const response = await client.chat.completions.create({ model: "gpt-4o", messages: [ { role: "system", content: "You are a helpful, polite assistant who knows about sports. Only answer the question; don't add additional information outside of what was asked.", }, ...chat_history, { role: "user", content: input, }, ], }); return response.choices[0].message.content || ""; } ``` In the task function, we'll update the `system` message to specify the output should be precise and then run the eval again. ```typescript Eval("Chat assistant", { experimentName: "gpt-4o assistant - concise", data: () => experimentData, task: runTask, scores: [Factual], trialCount: 3, metadata: { model: "gpt-4o", prompt: "You are a helpful, polite assistant who knows about sports. Only answer the question; don't add additional information outside of what was asked.", }, }); ``` ```typescript Experiment gpt - 4o assistant - concise is running at https://www.braintrust.dev/app/braintrustdata.com/p/Chat%20assistant/experiments/gpt-4o%20assistant%20-%20concise ████████████████████████████████████████ | Chat assistant[experimentName = gpt - 4o... | 100 % | 15 / 15 datapoints =========================SUMMARY========================= gpt-4o assistant - concise compared to gpt-4o assistant: 86.67% (+26.67%) 'Factuality' score (4 improvements, 0 regressions) 1.89s 'duration' (5 improvements, 0 regressions) 0.01$ 'estimated_cost' (4 improvements, 1 regressions) See results for gpt-4o assistant - concise at https://www.braintrust.dev/app/braintrustdata.com/p/Chat%20assistant/experiments/gpt-4o%20assistant%20-%20concise ``` Let's go into the dashboard and see the new experiment. ![concise-diff](./../assets/EvaluatingChatAssistant/concise-diff.png) Success! We got a 27 percentage point increase in factuality, up to an average score of 87% for this experiment with our updated prompt. ### Conclusion We've seen in this cookbook how to evaluate a chat assistant and visualized how the chat history effects the output of the chat completion. Along the way, we also utilized some other functionality such as updating the comparison key in the diff view and creating a custom scoring function. Try seeing how you can improve the outputs and scores even further! --- file: ./content/docs/cookbook/recipes/Github-Issues.mdx meta: { "title": "Improving Github issue titles using their contents", "language": "typescript", "authors": [ { "name": "Ankur Goyal", "website": "https://twitter.com/ankrgyl", "avatar": "/blog/img/author/ankur-goyal.jpg" } ], "date": "2023-10-29", "tags": [ "evals", "summarization" ] } # Improving Github issue titles using their contents This tutorial will teach you how to use Braintrust to generate better titles for Github issues, based on their content. This is a great way to learn how to work with text and evaluate subjective criteria, like summarization quality. We'll use a technique called **model graded evaluation** to automatically evaluate the newly generated titles against the original titles, and improve our prompt based on what we find. Before starting, please make sure that you have a Braintrust account. If you do not, please [sign up](https://www.braintrust.dev). After this tutorial, feel free to dig deeper by visiting [the docs](http://www.braintrust.dev/docs). ## Installing dependencies To see a list of dependencies, you can view the accompanying [package.json](https://github.com/braintrustdata/braintrust-cookbook/tree/main/examples/Github-Issues/package.json) file. Feel free to copy/paste snippets of this code to run in your environment, or use [tslab](https://github.com/yunabe/tslab) to run the tutorial in a Jupyter notebook. ## Downloading the data We'll start by downloading some issues from Github using the `octokit` SDK. We'll use the popular open source project [next.js](https://github.com/vercel/next.js). ```typescript import { Octokit } from "@octokit/core"; const ISSUES = [ "https://github.com/vercel/next.js/issues/59999", "https://github.com/vercel/next.js/issues/59997", "https://github.com/vercel/next.js/issues/59995", "https://github.com/vercel/next.js/issues/59988", "https://github.com/vercel/next.js/issues/59986", "https://github.com/vercel/next.js/issues/59971", "https://github.com/vercel/next.js/issues/59958", "https://github.com/vercel/next.js/issues/59957", "https://github.com/vercel/next.js/issues/59950", "https://github.com/vercel/next.js/issues/59940", ]; // Octokit.js // https://github.com/octokit/core.js#readme const octokit = new Octokit({ auth: process.env.GITHUB_ACCESS_TOKEN || "Your Github Access Token", }); async function fetchIssue(url: string) { // parse url of the form https://github.com/supabase/supabase/issues/15534 const [owner, repo, _, issue_number] = url!.trim().split("/").slice(-4); const data = await octokit.request( "GET /repos/{owner}/{repo}/issues/{issue_number}", { owner, repo, issue_number: parseInt(issue_number), headers: { "X-GitHub-Api-Version": "2022-11-28", }, } ); return data.data; } const ISSUE_DATA = await Promise.all(ISSUES.map(fetchIssue)); ``` Let's take a look at one of the issues: ```typescript console.log(ISSUE_DATA[0].title); console.log("-".repeat(ISSUE_DATA[0].title.length)); console.log(ISSUE_DATA[0].body.substring(0, 512) + "..."); ``` ``` The instrumentation hook is only called after visiting a route -------------------------------------------------------------- ### Link to the code that reproduces this issue https://github.com/daveyjones/nextjs-instrumentation-bug ### To Reproduce \`\`\`shell git clone git@github.com:daveyjones/nextjs-instrumentation-bug.git cd nextjs-instrumentation-bug npm install npm run dev # The register function IS called npm run build && npm start # The register function IS NOT called until you visit http://localhost:3000 \`\`\` ### Current vs. Expected behavior The \`register\` function should be called automatically after running \`npm ... ``` ## Generating better titles Let's try to generate better titles using a simple prompt. We'll use OpenAI, although you could try this out with any model that supports text generation. We'll start by initializing an OpenAI client and wrapping it with some Braintrust instrumentation. `wrapOpenAI` is initially a no-op, but later on when we use Braintrust, it will help us capture helpful debugging information about the model's performance. ```typescript import { wrapOpenAI } from "braintrust"; import { OpenAI } from "openai"; const client = wrapOpenAI( new OpenAI({ apiKey: process.env.OPENAI_API_KEY || "Your OpenAI API Key", }) ); ``` ```typescript import { ChatCompletionMessageParam } from "openai/resources"; function titleGeneratorMessages(content: string): ChatCompletionMessageParam[] { return [ { role: "system", content: "Generate a new title based on the github issue. Return just the title.", }, { role: "user", content: "Github issue: " + content, }, ]; } async function generateTitle(input: string) { const messages = titleGeneratorMessages(input); const response = await client.chat.completions.create({ model: "gpt-3.5-turbo", messages, seed: 123, }); return response.choices[0].message.content || ""; } const generatedTitle = await generateTitle(ISSUE_DATA[0].body); console.log("Original title: ", ISSUE_DATA[0].title); console.log("Generated title:", generatedTitle); ``` ``` Original title: The instrumentation hook is only called after visiting a route Generated title: Next.js: \`register\` function not automatically called after build and start ``` ## Scoring Ok cool! The new title looks pretty good. But how do we consistently and automatically evaluate whether the new titles are better than the old ones? With subjective problems, like summarization, one great technique is to use an LLM to grade the outputs. This is known as model graded evaluation. Below, we'll use a [summarization prompt](https://github.com/braintrustdata/autoevals/blob/main/templates/summary.yaml) from Braintrust's open source [autoevals](https://github.com/braintrustdata/autoevals) library. We encourage you to use these prompts, but also to copy/paste them, modify them, and create your own! The prompt uses [Chain of Thought](https://arxiv.org/abs/2201.11903) which dramatically improves a model's performance on grading tasks. Later, we'll see how it helps us debug the model's outputs. Let's try running it on our new title and see how it performs. ```typescript import { Summary } from "autoevals"; await Summary({ output: generatedTitle, expected: ISSUE_DATA[0].title, input: ISSUE_DATA[0].body, // In practice we've found gpt-4 class models work best for subjective tasks, because // they are great at following criteria laid out in the grading prompts. model: "gpt-4-1106-preview", }); ``` ``` { name: 'Summary', score: 1, metadata: { rationale: "Summary A ('The instrumentation hook is only called after visiting a route') is a partial and somewhat ambiguous statement. It does not specify the context of the 'instrumentation hook' or the technology involved.\n" + "Summary B ('Next.js: \`register\` function not automatically called after build and start') provides a clearer and more complete description. It specifies the technology ('Next.js') and the exact issue ('\`register\` function not automatically called after build and start').\n" + 'The original text discusses an issue with the \`register\` function in a Next.js application not being called as expected, which is directly reflected in Summary B.\n' + "Summary B also aligns with the section 'Current vs. Expected behavior' from the original text, which states that the \`register\` function should be called automatically but is not until a route is visited.\n" + "Summary A lacks the detail that the issue is with the Next.js framework and does not mention the expectation of the \`register\` function's behavior, which is a key point in the original text.", choice: 'B' }, error: undefined } ``` ## Initial evaluation Now that we have a way to score new titles, let's run an eval and see how our prompt performs across all 10 issues. ```typescript import { Eval, login } from "braintrust"; login({ apiKey: process.env.BRAINTUST_API_KEY || "Your Braintrust API Key" }); await Eval("Github Issues Cookbook", { data: () => ISSUE_DATA.map((issue) => ({ input: issue.body, expected: issue.title, metadata: issue, })), task: generateTitle, scores: [ async ({ input, output, expected }) => Summary({ input, output, expected, model: "gpt-4-1106-preview", }), ], }); console.log("Done!"); ``` ``` { projectName: 'Github Issues Cookbook', experimentName: 'main-1706774628', projectUrl: 'https://www.braintrust.dev/app/braintrust.dev/p/Github%20Issues%20Cookbook', experimentUrl: 'https://www.braintrust.dev/app/braintrust.dev/p/Github%20Issues%20Cookbook/main-1706774628', comparisonExperimentName: undefined, scores: undefined, metrics: undefined } ``` ``` ████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | Github Issues Cookbook | 10% | 10/100 datapoints ``` ``` Done! ``` Great! We got an initial result. If you follow the link, you'll see an eval result showing an initial score of 40%. ![Initial eval result](./../assets/Github-Issues/initial-experiment.png) ## Debugging failures Let's dig into a couple examples to see what's going on. Thanks to the instrumentation we added earlier, we can see the model's reasoning for its scores. Issue [https://github.com/vercel/next.js/issues/59995](https://github.com/vercel/next.js/issues/59995): ![output-expected](./../assets/Github-Issues/output-expected.png) ![reasons](./../assets/Github-Issues/reasons.png) Issue [https://github.com/vercel/next.js/issues/59986](https://github.com/vercel/next.js/issues/59986): ![output-expected-2](./../assets/Github-Issues/output-expected-2.png) ![reasons2](./../assets/Github-Issues/reasons-2.png) ## Improving the prompt Hmm, it looks like the model is missing certain key details. Let's see if we can improve our prompt to encourage the model to include more details, without being too verbose. ```typescript function titleGeneratorMessages(content: string): ChatCompletionMessageParam[] { return [ { role: "system", content: `Generate a new title based on the github issue. The title should include all of the key identifying details of the issue, without being longer than one line. Return just the title.`, }, { role: "user", content: "Github issue: " + content, }, ]; } async function generateTitle(input: string) { const messages = titleGeneratorMessages(input); const response = await client.chat.completions.create({ model: "gpt-3.5-turbo", messages, seed: 123, }); return response.choices[0].message.content || ""; } ``` ### Re-evaluating Now that we've tweaked our prompt, let's see how it performs by re-running our eval. ```typescript await Eval("Github Issues Cookbook", { data: () => ISSUE_DATA.map((issue) => ({ input: issue.body, expected: issue.title, metadata: issue, })), task: generateTitle, scores: [ async ({ input, output, expected }) => Summary({ input, output, expected, model: "gpt-4-1106-preview", }), ], }); console.log("All done!"); ``` ``` { projectName: 'Github Issues Cookbook', experimentName: 'main-1706774676', projectUrl: 'https://www.braintrust.dev/app/braintrust.dev/p/Github%20Issues%20Cookbook', experimentUrl: 'https://www.braintrust.dev/app/braintrust.dev/p/Github%20Issues%20Cookbook/main-1706774676', comparisonExperimentName: 'main-1706774628', scores: { Summary: { name: 'Summary', score: 0.7, diff: 0.29999999999999993, improvements: 3, regressions: 0 } }, metrics: { duration: { name: 'duration', metric: 0.3292001008987427, unit: 's', diff: -0.002199888229370117, improvements: 7, regressions: 3 } } } ``` ``` ████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ | Github Issues Cookbook | 10% | 10/100 datapoints ``` ``` All done! ``` Wow, with just a simple change, we're able to boost summary performance by 30%! ![Improved eval result](./../assets/Github-Issues/second-experiment.png) ## Parting thoughts This is just the start of evaluating and improving this AI application. From here, you should dig into individual examples, verify whether they legitimately improved, and test on more data. You can even use [logging](https://www.braintrust.dev/docs/guides/logging) to capture real-user examples and incorporate them into your evals. Happy evaluating! ![improvements](./../assets/Github-Issues/improvements.gif) --- file: ./content/docs/cookbook/recipes/HTMLGenerator.mdx meta: { "title": "Generating beautiful HTML components", "language": "typescript", "authors": [ { "name": "Ankur Goyal", "website": "https://twitter.com/ankrgyl", "avatar": "/blog/img/author/ankur-goyal.jpg" } ], "date": "2024-01-29", "tags": [ "logging", "datasets", "evals" ] } # Generating beautiful HTML components In this example, we'll build an app that automatically generates HTML components, evaluates them, and captures user feedback. We'll use the feedback and evaluations to build up a dataset that we'll use as a basis for further improvements. ## The generator We'll start by using a very simple prompt to generate HTML components using `gpt-3.5-turbo`. First, we'll initialize an openai client and wrap it with Braintrust's helper. This is a no-op until we start using the client within code that is instrumented by Braintrust. ```typescript import { OpenAI } from "openai"; import { wrapOpenAI } from "braintrust"; const openai = wrapOpenAI( new OpenAI({ apiKey: process.env.OPENAI_API_KEY || "Your OPENAI_API_KEY", }) ); ``` This code generates a basic prompt: ```typescript import { ChatCompletionMessageParam } from "openai/resources"; function generateMessages(input: string): ChatCompletionMessageParam[] { return [ { role: "system", content: `You are a skilled design engineer who can convert ambiguously worded ideas into beautiful, crisp HTML and CSS. Your designs value simplicity, conciseness, clarity, and functionality over complexity. You generate pure HTML with inline CSS, so that your designs can be rendered directly as plain HTML. Only generate components, not full HTML pages. Do not create background colors. Users will send you a description of a design, and you must reply with HTML, and nothing else. Your reply will be directly copied and rendered into a browser, so do not include any text. If you would like to explain your reasoning, feel free to do so in HTML comments.`, }, { role: "user", content: input, }, ]; } JSON.stringify( generateMessages("A login form for a B2B SaaS product."), null, 2 ); ``` ``` [ { "role": "system", "content": "You are a skilled design engineer\nwho can convert ambiguously worded ideas into beautiful, crisp HTML and CSS.\nYour designs value simplicity, conciseness, clarity, and functionality over\ncomplexity.\n\nYou generate pure HTML with inline CSS, so that your designs can be rendered\ndirectly as plain HTML. Only generate components, not full HTML pages. Do not\ncreate background colors.\n\nUsers will send you a description of a design, and you must reply with HTML,\nand nothing else. Your reply will be directly copied and rendered into a browser,\nso do not include any text. If you would like to explain your reasoning, feel free\nto do so in HTML comments." }, { "role": "user", "content": "A login form for a B2B SaaS product." } ] ``` Now, let's run this using `gpt-3.5-turbo`. We'll also do a few things that help us log & evaluate this function later: * Wrap the execution in a `traced` call, which will enable Braintrust to log the inputs and outputs of the function when we run it in production or in evals * Make its signature accept a single `input` value, which Braintrust's `Eval` function expects * Use a `seed` so that this test is reproduceable ```typescript import { traced } from "braintrust"; async function generateComponent(input: string) { return traced( async (span) => { const response = await openai.chat.completions.create({ model: "gpt-3.5-turbo", messages: generateMessages(input), seed: 101, }); const output = response.choices[0].message.content; span.log({ input, output }); return output; }, { name: "generateComponent", } ); } ``` ### Examples Let's look at a few examples! ```typescript await generateComponent("Do a reset password form inside a card."); ``` ```

Reset Password

``` To make this easier to validate, we'll use [puppeteer](https://pptr.dev/) to render the HTML as a screenshot. ```typescript import puppeteer from "puppeteer"; import * as tslab from "tslab"; async function takeFullPageScreenshotAsUInt8Array(htmlContent) { const browser = await puppeteer.launch({ headless: "new" }); const page = await browser.newPage(); await page.setContent(htmlContent); const screenshotBuffer = await page.screenshot(); const uint8Array = new Uint8Array(screenshotBuffer); await browser.close(); return uint8Array; } async function displayComponent(input: string) { const html = await generateComponent(input); const img = await takeFullPageScreenshotAsUInt8Array(html); tslab.display.png(img); console.log(html); } await displayComponent("Do a reset password form inside a card."); ``` ![Cell 11](../assets/HTMLGenerator/_generated_11.png)
```

Reset Password

``` ```typescript await displayComponent("Create a profile page for a social network."); ``` ![Cell 8](../assets/HTMLGenerator/_generated_8.png)
```
Profile Picture
John Doe
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla ut turpis hendrerit, ullamcorper velit in, iaculis arcu.
500
Followers
250
Following
1000
Posts
``` ```typescript await displayComponent( "Logs viewer for a cloud infrastructure management tool. Heavy use of dark mode." ); ``` ![Cell 10](../assets/HTMLGenerator/_generated_10.png)
```

Logs Viewer

12:30 PM Info: Cloud instance created successfully
12:45 PM Warning: High CPU utilization on instance #123
01:00 PM Error: Connection lost to the database server
``` ## Scoring the results It looks like in a few of these examples, the model is generating a full HTML page, instead of a component as we requested. This is something we can evaluate, to ensure that it does not happen! ```typescript const containsHTML = (s) => /<(html|body)>/i.test(s); containsHTML( await generateComponent( "Logs viewer for a cloud infrastructure management tool. Heavy use of dark mode." ) ); ``` ``` true ``` Now, let's update our function to compute this score. Let's also keep track of requests and their ids, so that we can provide user feedback. Normally you would store these in a database, but for demo purposes, a global dictionary should suffice. ```typescript // Normally you would store these in a database, but for this demo we'll just use a global variable. let requests = {}; async function generateComponent(input: string) { return traced( async (span) => { const response = await openai.chat.completions.create({ model: "gpt-3.5-turbo", messages: generateMessages(input), seed: 101, }); const output = response.choices[0].message.content; requests[input] = span.id; span.log({ input, output, scores: { isComponent: containsHTML(output) ? 0 : 1 }, }); return output; }, { name: "generateComponent", } ); } ``` ## Logging results To enable logging to Braintrust, we just need to initialize a logger. By default, a logger is automatically marked as the current, global logger, and once initialized will be picked up by `traced`. ```typescript import { initLogger } from "braintrust"; const logger = initLogger({ projectName: "Component generator", apiKey: process.env.BRAINTRUST_API_KEY || "Your BRAINTRUST_API_KEY", }); ``` Now, we'll run the `generateComponent` function on a few examples, and see what the results look like in Braintrust. ```typescript const inputs = [ "A login form for a B2B SaaS product.", "Create a profile page for a social network.", "Logs viewer for a cloud infrastructure management tool. Heavy use of dark mode.", ]; for (const input of inputs) { await generateComponent(input); } console.log(`Logged ${inputs.length} requests to Braintrust.`); ``` ``` Logged 3 requests to Braintrust. ``` ### Viewing the logs in Braintrust Once this runs, you should be able to see the raw inputs and outputs, along with their scores in the project. ![component\_generator\_logs.png](./../assets/HTMLGenerator/component-generator-logs.png) ### Capturing user feedback Let's also track user ratings for these components. Separate from whether or not they're formatted as HTML, it'll be useful to track whether users like the design. To do this, [configure a new score in the project](https://www.braintrust.dev/docs/guides/human-review#configuring-human-review). Let's call it "User preference" and make it a 👍/👎. ![Score configuration](./../assets/HTMLGenerator/score-config.png) Once you create a human review score, you can evaluate results directly in the Braintrust UI, or capture end-user feedback. Here, we'll pretend to capture end-user feedback. Personally, I liked the login form and logs viewer, but not the profile page. Let's record feedback accordingly. ```typescript // Along with scores, you can optionally log user feedback as comments, for additional color. logger.logFeedback({ id: requests["A login form for a B2B SaaS product."], scores: { "User preference": 1 }, comment: "Clean, simple", }); logger.logFeedback({ id: requests["Create a profile page for a social network."], scores: { "User preference": 0 }, }); logger.logFeedback({ id: requests[ "Logs viewer for a cloud infrastructure management tool. Heavy use of dark mode." ], scores: { "User preference": 1 }, comment: "No frills! Would have been nice to have borders around the entries.", }); ``` As users provide feedback, you'll see the updates they make in each log entry. ![Feedback log](./../assets/HTMLGenerator/feedback-comments.png) ## Creating a dataset Now that we've collected some interesting examples from users, let's collect them into a dataset, and see if we can improve the `isComponent` score. In the Braintrust UI, select the examples, and add them to a new dataset called "Interesting cases". ![Interesting cases](./../assets/HTMLGenerator/create-new-dataset.png) Once you create the dataset, it should look something like this: ![Dataset](./../assets/HTMLGenerator/dataset.png) ## Evaluating Now that we have a dataset, let's evaluate the `isComponent` function on it. We'll use the `Eval` function, which takes a dataset and a function, and evaluates the function on each example in the dataset. ```typescript import { Eval, initDataset } from "braintrust"; await Eval("Component generator", { data: async () => { const dataset = initDataset("Component generator", { dataset: "Interesting cases", }); const records = []; for await (const { input } of dataset.fetch()) { records.push({ input }); } return records; }, task: generateComponent, // We do not need to add any additional scores, because our // generateComponent() function already computes `isComponent` scores: [], }); ``` Once the eval runs, you'll see a summary which includes a link to the experiment. As expected, only one of the three outputs contains HTML, so the score is 33.3%. Let's also label user preference for this experiment, so we can track aesthetic taste manually. For simplicity's sake, we'll use the same labeling as before. ![Initial experiment](./../assets/HTMLGenerator/initial-experiment.png) ### Improving the prompt Next, let's try to tweak the prompt to stop rendering full HTML pages. ```typescript function generateMessages(input: string): ChatCompletionMessageParam[] { return [ { role: "system", content: `You are a skilled design engineer who can convert ambiguously worded ideas into beautiful, crisp HTML and CSS. Your designs value simplicity, conciseness, clarity, and functionality over complexity. You generate pure HTML with inline CSS, so that your designs can be rendered directly as plain HTML. Only generate components, not full HTML pages. If you need to add CSS, you can use the "style" property of an HTML tag. You cannot use global CSS in a