Class: LlmConductor::Eval::Judge
- Inherits:
-
Object
- Object
- LlmConductor::Eval::Judge
- Defined in:
- lib/llm_conductor/eval/judge.rb
Overview
LLM-as-judge for one candidate (input, model) output.
Sends the judge model the original input data, the spec’s rubric excerpt, and the candidate’s parsed output (or raw text on parse failure), and expects strict JSON back with a quality_score + per-dimension scores.
Judge defaults to Groq’s llama-3.3-70b-versatile: it sits OUTSIDE the Gemini/OpenAI/Ollama families that dominate most candidate lists (avoiding self-judge bias — Gemini grades its own output ~10pts high) and Groq’s free tier offers far more throughput than Gemini Pro’s ~2 RPM. Override via the judge: config. It needs Groq credentials configured; rows where the judged model == the judge model are flagged self_judge in the report.
Constant Summary collapse
- DEFAULT_MODEL =
'llama-3.3-70b-versatile'- DEFAULT_VENDOR =
:groq
Class Method Summary collapse
Instance Method Summary collapse
-
#initialize(spec:, store:, run_id:, logger:, judge_model: DEFAULT_MODEL, judge_vendor: DEFAULT_VENDOR, rate_limit_retries: 3, rate_limit_backoff_seconds: 20) ⇒ Judge
constructor
A new instance of Judge.
-
#judge(model_result:, input_data:) ⇒ Object
model_resultis an Eval::Result.
Constructor Details
#initialize(spec:, store:, run_id:, logger:, judge_model: DEFAULT_MODEL, judge_vendor: DEFAULT_VENDOR, rate_limit_retries: 3, rate_limit_backoff_seconds: 20) ⇒ Judge
Returns a new instance of Judge.
30 31 32 33 34 35 36 37 38 39 40 41 |
# File 'lib/llm_conductor/eval/judge.rb', line 30 def initialize(spec:, store:, run_id:, logger:, judge_model: DEFAULT_MODEL, judge_vendor: DEFAULT_VENDOR, rate_limit_retries: 3, rate_limit_backoff_seconds: 20) @spec = spec @store = store @run_id = run_id @logger = logger @judge_model = judge_model @judge_vendor = judge_vendor.to_sym @rate_limit_retries = rate_limit_retries @rate_limit_backoff_seconds = rate_limit_backoff_seconds end |
Class Method Details
.borderline?(score) ⇒ Boolean
26 27 28 |
# File 'lib/llm_conductor/eval/judge.rb', line 26 def self.borderline?(score) Verdict.borderline?(score) end |
Instance Method Details
#judge(model_result:, input_data:) ⇒ Object
model_result is an Eval::Result. input_data is the spec’s data Hash for the input being judged.
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
# File 'lib/llm_conductor/eval/judge.rb', line 45 def judge(model_result:, input_data:) prompt = build_prompt(model_result:, input_data:) response, latency_ms = call_with_rate_limit_retry(prompt) unless response&.success? error = response&.&.dig(:error) || 'judge LLM call failed' return failure_verdict(latency_ms:, response:, error:) end parsed = JsonParser.parse(response.output) if parsed.nil? return failure_verdict(latency_ms:, response:, error: "judge output not valid JSON: #{response.output.to_s[0, 200]}") end build_verdict(parsed:, latency_ms:, response:) rescue StandardError => e @logger.error("[Eval::Judge] #{@judge_model}: #{e.class}: #{e.}") Verdict.new(judge_model: @judge_model, judge_error: "#{e.class}: #{e.}") end |