Module: LlmConductor::Eval

Defined in:
lib/llm_conductor/eval.rb,
lib/llm_conductor/eval/spec.rb,
lib/llm_conductor/eval/judge.rb,
lib/llm_conductor/eval/report.rb,
lib/llm_conductor/eval/result.rb,
lib/llm_conductor/eval/runner.rb,
lib/llm_conductor/eval/verdict.rb,
lib/llm_conductor/eval/store/base.rb,
lib/llm_conductor/eval/json_parser.rb,
lib/llm_conductor/eval/model_runner.rb,
lib/llm_conductor/eval/report_builder.rb,
lib/llm_conductor/eval/store/in_memory.rb,
lib/llm_conductor/eval/store/file_store.rb

Overview

Opt-in model-evaluation harness. ‘require ’llm_conductor/eval’‘ to load it; core `require ’llm_conductor’‘ users pay nothing.

Runs the same prompt across N (model, vendor) pairs over M caller-supplied inputs, then compares them on cost, latency, tokens, and LLM-judged quality. The engine is feature-agnostic; everything feature-specific lives in a Spec.

require 'llm_conductor/eval'

report = LlmConductor::Eval.run(
  spec:   MyFeatureSpec.new,
  inputs: my_inputs,                       # any enumerable; engine never selects/queries
  models: [{ model: 'gpt-4o-mini', vendor: :openai },
           { model: 'gemini-2.5-flash', vendor: :gemini }],
  judge:  { model: 'llama-3.3-70b-versatile', vendor: :groq }
)
report.summary       # per-model aggregates
report.to_markdown   # decision-aid report (caller persists)
report.to_csv        # per-row data
report.needs_review  # rows flagged for human eyeball

Defined Under Namespace

Modules: JsonParser, Store Classes: Judge, ModelRunner, Report, ReportBuilder, Result, Runner, Spec, Verdict

Constant Summary collapse

BORDERLINE_RANGE =

Scores in this range are “borderline” — the judge is uncertain enough that the row is flagged for human review. Tuned in the Rails prototype.

(50..70)

Class Method Summary collapse

Class Method Details

.default_loggerObject



67
68
69
# File 'lib/llm_conductor/eval.rb', line 67

def default_logger
  LlmConductor.configuration.logger || Logger.new($stdout)
end

.generate_run_idObject



71
72
73
# File 'lib/llm_conductor/eval.rb', line 71

def generate_run_id
  "run_#{Time.now.utc.strftime('%Y%m%d_%H%M%S')}"
end

.judge_only(run_id:, spec:, store:, judge: {}, logger: nil) ⇒ Object

Re-judge stored candidate outputs without recalling the candidate models.



58
59
60
# File 'lib/llm_conductor/eval.rb', line 58

def judge_only(run_id:, spec:, store:, judge: {}, logger: nil)
  Runner.judge_only(run_id:, spec:, store:, judge:, logger: logger || default_logger)
end

.report_only(run_id:, spec:, store:) ⇒ Object

Rebuild the Report from a stored manifest, no model or judge calls.



63
64
65
# File 'lib/llm_conductor/eval.rb', line 63

def report_only(run_id:, spec:, store:)
  Runner.report_only(run_id:, spec:, store:)
end

.run(spec:, inputs:, models:, judge: {}, store: nil, logger: nil, run_id: nil) ⇒ Object

The single entrypoint. spec implements Eval::Spec; inputs is any enumerable of opaque objects the spec knows how to interpret; models is the caller-owned list of { model:, vendor: } candidate pairs.



48
49
50
51
52
53
54
55
# File 'lib/llm_conductor/eval.rb', line 48

def run(spec:, inputs:, models:, judge: {}, store: nil, logger: nil, run_id: nil)
  Runner.new(
    spec:, inputs:, models:, judge:,
    store: store || Store::InMemory.new,
    logger: logger || default_logger,
    run_id: run_id || generate_run_id
  ).run
end