Class: SkillBench::Evaluation::Runner

Inherits:

Object

Object
SkillBench::Evaluation::Runner

show all

Defined in:: lib/skill_bench/evaluation/runner.rb

Overview

Orchestrates the evaluation pipeline.

Coordinates blind judging of baseline and context agent outputs, then computes deltas and determines the final verdict.

Class Method Summary collapse

.call(task:, criteria:, skill_context:, baseline_output:, context_output:, judge_params: {}) ⇒ Hash

Runs the evaluation pipeline.

Instance Method Summary collapse

#call ⇒ Hash

Orchestrates judging and delta computation.
#initialize(task:, criteria:, skill_context:, baseline_output:, context_output:, judge_params: {}) ⇒ Runner constructor

A new instance of Runner.

Constructor Details

#initialize(task:, criteria:, skill_context:, baseline_output:, context_output:, judge_params: {}) ⇒ `Runner`

Returns a new instance of Runner.

Parameters:

task (String) —

The task description.
criteria (SkillBench::Criteria) —

The eval criteria.
skill_context (String) —

The skill context XML.
baseline_output (String) —

The baseline agent output.
context_output (String) —

The context agent output.
judge_params (Hash) (defaults to: {}) —

Provider config passed to the Judge as client_params.

# File 'lib/skill_bench/evaluation/runner.rb', line 29

def initialize(task:, criteria:, skill_context:, baseline_output:, context_output:, judge_params: {})
  @task = task
  @criteria = criteria
  @skill_context = skill_context
  @baseline_output = baseline_output
  @context_output = context_output
  @judge_params = judge_params.is_a?(Hash) ? judge_params : {}
end

Class Method Details

.call(task:, criteria:, skill_context:, baseline_output:, context_output:, judge_params: {}) ⇒ `Hash`

Runs the evaluation pipeline.

Parameters:

task (String) —

The task description.
criteria (SkillBench::Criteria) —

The eval criteria.
skill_context (String) —

The skill context XML.
baseline_output (String) —

The baseline agent output.
context_output (String) —

The context agent output.
judge_params (Hash) (defaults to: {}) —

Provider config passed to the Judge as client_params (api_key, model, provider).

Returns:

(Hash) —

Service response with report or error.



19
20
21

# File 'lib/skill_bench/evaluation/runner.rb', line 19

def self.call(task:, criteria:, skill_context:, baseline_output:, context_output:, judge_params: {})
  new(task:, criteria:, skill_context:, baseline_output:, context_output:, judge_params:).call
end

Instance Method Details

#call ⇒ `Hash`

Orchestrates judging and delta computation.