Class: SkillBench::Evaluation::Runner

Inherits:
Object
  • Object
show all
Defined in:
lib/skill_bench/evaluation/runner.rb

Overview

Orchestrates the evaluation pipeline.

Coordinates blind judging of baseline and context agent outputs, then computes deltas and determines the final verdict.

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(task:, criteria:, skill_context:, baseline_output:, context_output:, judge_params: {}) ⇒ Runner

Returns a new instance of Runner.

Parameters:

  • task (String)

    The task description.

  • criteria (SkillBench::Criteria)

    The eval criteria.

  • skill_context (String)

    The skill context XML.

  • baseline_output (String)

    The baseline agent output.

  • context_output (String)

    The context agent output.

  • judge_params (Hash) (defaults to: {})

    Provider config passed to the Judge as client_params.



29
30
31
32
33
34
35
36
# File 'lib/skill_bench/evaluation/runner.rb', line 29

def initialize(task:, criteria:, skill_context:, baseline_output:, context_output:, judge_params: {})
  @task = task
  @criteria = criteria
  @skill_context = skill_context
  @baseline_output = baseline_output
  @context_output = context_output
  @judge_params = judge_params.is_a?(Hash) ? judge_params : {}
end

Class Method Details

.call(task:, criteria:, skill_context:, baseline_output:, context_output:, judge_params: {}) ⇒ Hash

Runs the evaluation pipeline.

Parameters:

  • task (String)

    The task description.

  • criteria (SkillBench::Criteria)

    The eval criteria.

  • skill_context (String)

    The skill context XML.

  • baseline_output (String)

    The baseline agent output.

  • context_output (String)

    The context agent output.

  • judge_params (Hash) (defaults to: {})

    Provider config passed to the Judge as client_params (api_key, model, provider).

Returns:

  • (Hash)

    Service response with report or error.



19
20
21
# File 'lib/skill_bench/evaluation/runner.rb', line 19

def self.call(task:, criteria:, skill_context:, baseline_output:, context_output:, judge_params: {})
  new(task:, criteria:, skill_context:, baseline_output:, context_output:, judge_params:).call
end

Instance Method Details

#callHash

Orchestrates judging and delta computation.

Returns:

  • (Hash)

    Service response with report or error.



41
42
43
44
45
46
47
48
49
50
51
52
# File 'lib/skill_bench/evaluation/runner.rb', line 41

def call
  baseline_judge = judge_run(baseline_output, nil)
  return baseline_judge unless baseline_judge[:success]

  context_judge = judge_run(context_output, skill_context)
  return context_judge unless context_judge[:success]

  compute_deltas(baseline_judge, context_judge)
rescue StandardError => e
  SkillBench::ErrorLogger.log_error(e, 'Evaluation::Runner Error')
  { success: false, response: { error: { message: e.message } } }
end