Module: Braintrust::Eval

Defined in:
lib/braintrust/eval.rb,
lib/braintrust/eval/case.rb,
lib/braintrust/eval/cases.rb,
lib/braintrust/eval/trace.rb,
lib/braintrust/eval/result.rb,
lib/braintrust/eval/runner.rb,
lib/braintrust/eval/scorer.rb,
lib/braintrust/eval/context.rb,
lib/braintrust/eval/summary.rb,
lib/braintrust/eval/evaluator.rb,
lib/braintrust/eval/formatter.rb,
lib/braintrust/eval/functions.rb

Overview

Evaluation framework for testing AI systems with custom test cases and scoring functions.

The Eval module provides tools for running systematic evaluations of your AI systems. An evaluation consists of:

  • Cases: Test inputs with optional expected outputs

  • Task: The code/model being evaluated (a Task or callable)

  • Scorers: Functions that judge the quality of outputs (String name, Scorer, or callable)

Tasks and scorers use keyword arguments. Only declare the keywords you need —extra kwargs are automatically filtered out.

When using multiple scorers, each must have a unique name — scores are keyed by name, so duplicates overwrite each other. Use Scorer.new(“name”) or a Scorer subclass to assign names. Anonymous lambdas default to “scorer”.

Examples:

Basic evaluation with inline cases

require "braintrust"

Braintrust.init

Braintrust::Eval.run(
  project: "my-project",
  experiment: "food-classifier",
  cases: [
    {input: "apple", expected: "fruit"},
    {input: "carrot", expected: "vegetable"},
    {input: "banana", expected: "fruit"}
  ],
  task: ->(input:) { input.include?("a") ? "fruit" : "vegetable" },
  scorers: [
    ->(expected:, output:) { output == expected ? 1.0 : 0.0 }
  ]
)

Different ways to define scorers

# String — references a scorer defined in your Braintrust project
scorers: ["accuracy-scorer", "relevance-scorer"]

# Lambda — declare only the kwargs you need (input:, expected:, output:, metadata:, tags:)
exact = ->(expected:, output:) { output == expected ? 1.0 : 0.0 }

# Named scorer with Scorer.new
named = Braintrust::Scorer.new("case_insensitive") { |expected:, output:| output.downcase == expected.downcase ? 1.0 : 0.0 }

# Class-based pattern (auto-derives name from class: "fuzzy_match")
class FuzzyMatch
  include Braintrust::Scorer
  def call(expected:, output:)
    # scoring logic here
    1.0
  end
end

Different ways to define tasks

# Lambda with keyword args
task = ->(input:) { process(input) }

# Named task with Task.new
task = Braintrust::Task.new("my_task") { |input:| process(input) }

# Class-based pattern
class MyTask
  include Braintrust::Task
  def call(input:)
    process(input)
  end
end

# Legacy lambdas (positional args) are also accepted for backwards compatibility
legacy_task = ->(input) { process(input) }

Using datasets instead of inline cases

Braintrust::Eval.run(
  project: "my-project",
  experiment: "with-dataset",
  dataset: "my-dataset-name", # fetches from same project
  task: ->(input:) { input.upcase },
  scorers: [->(expected:, output:) { output == expected ? 1.0 : 0.0 }]
)

# Or with more options
Braintrust::Eval.run(
  project: "my-project",
  experiment: "with-dataset-options",
  dataset: { name: "my-dataset", project: "other-project", version: "1.0", limit: 100 },
  task: ->(input:) { input.upcase },
  scorers: [->(expected:, output:) { output == expected ? 1.0 : 0.0 }]
)

Using parameters for configurable tasks

# Tasks and scorers that declare `parameters:` receive it automatically.
# Those that don't are unaffected — KeywordFilter strips unknown kwargs.
Braintrust::Eval.run(
  project: "my-project",
  experiment: "with-params",
  cases: [{input: "hello", expected: "HELLO!"}],
  task: ->(input:, parameters:) {
    suffix = parameters["suffix"] || ""
    input.upcase + suffix
  },
  scorers: [->(expected:, output:) { output == expected ? 1.0 : 0.0 }],
  parameters: {"suffix" => "!"}
)

Using metadata and tags

Braintrust::Eval.run(
  project: "my-project",
  experiment: "with-metadata",
  cases: [
    {
      input: "apple",
      expected: "fruit",
      tags: ["tropical", "sweet"],
      metadata: {threshold: 0.9, category: "produce"}
    }
  ],
  task: ->(input:) { "fruit" },
  scorers: [
    ->(expected:, output:, metadata:) {
      threshold = [:threshold] || 0.5
      # scoring logic using threshold
      1.0
    }
  ],
  tags: ["v1", "production"],
  metadata: { model: "gpt-4", temperature: 0.7, version: "1.0.0" }
)

Defined Under Namespace

Modules: Formatter, Functions, Scorer Classes: Case, Cases, Context, Evaluator, ExperimentSummary, Result, Runner, ScorerStats, Trace

Class Method Summary collapse

Class Method Details

.run(task:, scorers:, project: nil, experiment: nil, cases: nil, dataset: nil, on_progress: nil, parallelism: 1, tags: nil, metadata: nil, update: false, quiet: false, state: nil, tracer_provider: nil, project_id: nil, parent: nil, parameters: nil) ⇒ Result

Run an evaluation

Parameters:

  • project (String, nil) (defaults to: nil)

    The project name (triggers full API mode: creates project + experiment)

  • experiment (String, nil) (defaults to: nil)

    The experiment name

  • cases (Array, Enumerable, nil) (defaults to: nil)

    The test cases (mutually exclusive with dataset)

  • dataset (String, Hash, nil) (defaults to: nil)

    Dataset to fetch (mutually exclusive with cases)

    • String: dataset name (fetches from same project)

    • Hash: id:, project:, version:, limit:

  • task (#call)

    The task to evaluate (must be callable)

  • scorers (Array<String, Scorer, #call>)

    The scorers to use (String names, Scorer objects, or callables)

  • on_progress (#call, nil) (defaults to: nil)

    Optional callback fired after each test case. Receives a Hash: => output, “scores” => {name => value} on success, or => message on failure.

  • parallelism (Integer) (defaults to: 1)

    Number of parallel workers (default: 1). When parallelism > 1, test cases are executed concurrently using a thread pool. The task and scorers MUST be thread-safe when using parallelism > 1.

  • tags (Array<String>) (defaults to: nil)

    Optional experiment tags

  • metadata (Hash) (defaults to: nil)

    Optional experiment metadata

  • update (Boolean) (defaults to: false)

    If true, allow reusing existing experiment (default: false)

  • quiet (Boolean) (defaults to: false)

    If true, suppress result output (default: false)

  • state (State, nil) (defaults to: nil)

    Braintrust state (defaults to global state)

  • tracer_provider (TracerProvider, nil) (defaults to: nil)

    OpenTelemetry tracer provider (defaults to global)

  • project_id (String, nil) (defaults to: nil)

    Project UUID (skips project creation when provided)

  • parent (Hash, nil) (defaults to: nil)

    Parent span context (object_id:, generation:)

  • parameters (Hash, nil) (defaults to: nil)

    Runtime parameters passed to task and scorers as a ‘parameters:` keyword argument

Returns:



180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
# File 'lib/braintrust/eval.rb', line 180

def run(task:, scorers:, project: nil, experiment: nil,
  cases: nil, dataset: nil, on_progress: nil,
  parallelism: 1, tags: nil, metadata: nil, update: false, quiet: false,
  state: nil, tracer_provider: nil, project_id: nil, parent: nil,
  parameters: nil)
  # Validate required parameters
  validate_params!(task: task, scorers: scorers, cases: cases, dataset: dataset)

  experiment_id = nil
  project_name = project

  # Full API mode: project name or project_id provided, resolve via API
  if project || project_id
    state ||= Braintrust.current_state
    state.

    if dataset
      resolved = resolve_dataset(dataset, project, state)
      cases = resolved[:cases]
    end

    # Skip experiment creation for remote evals (parent present).
    # The OTLP backend creates experiments from ingested spans.
    unless parent
      project_id, project_name = resolve_project(state, project, project_id)
      experiment_id = create_experiment(
        state, experiment, project_id,
        update: update, tags: tags, metadata: ,
        dataset_id: resolved&.dig(:dataset_id),
        dataset_version: resolved&.dig(:dataset_version)
      )
      parent = {object_type: "experiment_id", object_id: experiment_id}
    end
  end

  # Build normalized context and run
  context = Context.build(
    task: task,
    scorers: scorers,
    cases: cases,
    experiment_id: experiment_id,
    experiment_name: experiment,
    project_id: project_id,
    project_name: project_name,
    state: state,
    tracer_provider: tracer_provider,
    on_progress: on_progress,
    parent: parent,
    parameters: parameters
  )
  result = Runner.new(context).run(parallelism: parallelism)

  # Print result summary unless quiet
  print_result(result) unless quiet

  result
end

.scorer(name, callable = nil, &block) ⇒ Object

Deprecated.

Use Scorer.new instead



149
150
151
152
153
# File 'lib/braintrust/eval.rb', line 149

def scorer(name, callable = nil, &block)
  Log.warn_once(:eval_scorer, "Braintrust::Eval.scorer is deprecated: use Braintrust::Scorer.new instead.")
  block = callable.method(:call) if callable && !block
  Scorer.new(name, &block)
end