Module: LlmConductor::Eval
- Defined in:
- lib/llm_conductor/eval.rb,
lib/llm_conductor/eval/spec.rb,
lib/llm_conductor/eval/judge.rb,
lib/llm_conductor/eval/report.rb,
lib/llm_conductor/eval/result.rb,
lib/llm_conductor/eval/runner.rb,
lib/llm_conductor/eval/verdict.rb,
lib/llm_conductor/eval/store/base.rb,
lib/llm_conductor/eval/json_parser.rb,
lib/llm_conductor/eval/model_runner.rb,
lib/llm_conductor/eval/report_builder.rb,
lib/llm_conductor/eval/store/in_memory.rb,
lib/llm_conductor/eval/store/file_store.rb
Overview
Opt-in model-evaluation harness. ‘require ’llm_conductor/eval’‘ to load it; core `require ’llm_conductor’‘ users pay nothing.
Runs the same prompt across N (model, vendor) pairs over M caller-supplied inputs, then compares them on cost, latency, tokens, and LLM-judged quality. The engine is feature-agnostic; everything feature-specific lives in a Spec.
require 'llm_conductor/eval'
report = LlmConductor::Eval.run(
spec: MyFeatureSpec.new,
inputs: my_inputs, # any enumerable; engine never selects/queries
models: [{ model: 'gpt-4o-mini', vendor: :openai },
{ model: 'gemini-2.5-flash', vendor: :gemini }],
judge: { model: 'llama-3.3-70b-versatile', vendor: :groq }
)
report.summary # per-model aggregates
report.to_markdown # decision-aid report (caller persists)
report.to_csv # per-row data
report.needs_review # rows flagged for human eyeball
Defined Under Namespace
Modules: JsonParser, Store Classes: Judge, ModelRunner, Report, ReportBuilder, Result, Runner, Spec, Verdict
Constant Summary collapse
- BORDERLINE_RANGE =
Scores in this range are “borderline” — the judge is uncertain enough that the row is flagged for human review. Tuned in the Rails prototype.
(50..70)
Class Method Summary collapse
- .default_logger ⇒ Object
- .generate_run_id ⇒ Object
-
.judge_only(run_id:, spec:, store:, judge: {}, logger: nil) ⇒ Object
Re-judge stored candidate outputs without recalling the candidate models.
-
.report_only(run_id:, spec:, store:) ⇒ Object
Rebuild the Report from a stored manifest, no model or judge calls.
-
.run(spec:, inputs:, models:, judge: {}, store: nil, logger: nil, run_id: nil) ⇒ Object
The single entrypoint.
Class Method Details
.default_logger ⇒ Object
67 68 69 |
# File 'lib/llm_conductor/eval.rb', line 67 def default_logger LlmConductor.configuration.logger || Logger.new($stdout) end |
.generate_run_id ⇒ Object
71 72 73 |
# File 'lib/llm_conductor/eval.rb', line 71 def generate_run_id "run_#{Time.now.utc.strftime('%Y%m%d_%H%M%S')}" end |
.judge_only(run_id:, spec:, store:, judge: {}, logger: nil) ⇒ Object
Re-judge stored candidate outputs without recalling the candidate models.
58 59 60 |
# File 'lib/llm_conductor/eval.rb', line 58 def judge_only(run_id:, spec:, store:, judge: {}, logger: nil) Runner.judge_only(run_id:, spec:, store:, judge:, logger: logger || default_logger) end |
.report_only(run_id:, spec:, store:) ⇒ Object
Rebuild the Report from a stored manifest, no model or judge calls.
63 64 65 |
# File 'lib/llm_conductor/eval.rb', line 63 def report_only(run_id:, spec:, store:) Runner.report_only(run_id:, spec:, store:) end |
.run(spec:, inputs:, models:, judge: {}, store: nil, logger: nil, run_id: nil) ⇒ Object
The single entrypoint. spec implements Eval::Spec; inputs is any enumerable of opaque objects the spec knows how to interpret; models is the caller-owned list of { model:, vendor: } candidate pairs.
48 49 50 51 52 53 54 55 |
# File 'lib/llm_conductor/eval.rb', line 48 def run(spec:, inputs:, models:, judge: {}, store: nil, logger: nil, run_id: nil) Runner.new( spec:, inputs:, models:, judge:, store: store || Store::InMemory.new, logger: logger || default_logger, run_id: run_id || generate_run_id ).run end |