Class: LlmConductor::Eval::Judge

Inherits:
Object
  • Object
show all
Defined in:
lib/llm_conductor/eval/judge.rb

Overview

LLM-as-judge for one candidate (input, model) output.

Sends the judge model the original input data, the spec’s rubric excerpt, and the candidate’s parsed output (or raw text on parse failure), and expects strict JSON back with a quality_score + per-dimension scores.

Judge defaults to Groq’s llama-3.3-70b-versatile: it sits OUTSIDE the Gemini/OpenAI/Ollama families that dominate most candidate lists (avoiding self-judge bias — Gemini grades its own output ~10pts high) and Groq’s free tier offers far more throughput than Gemini Pro’s ~2 RPM. Override via the judge: config. It needs Groq credentials configured; rows where the judged model == the judge model are flagged self_judge in the report.

Constant Summary collapse

DEFAULT_MODEL =
'llama-3.3-70b-versatile'
DEFAULT_VENDOR =
:groq

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(spec:, store:, run_id:, logger:, judge_model: DEFAULT_MODEL, judge_vendor: DEFAULT_VENDOR, rate_limit_retries: 3, rate_limit_backoff_seconds: 20) ⇒ Judge

Returns a new instance of Judge.



30
31
32
33
34
35
36
37
38
39
40
41
# File 'lib/llm_conductor/eval/judge.rb', line 30

def initialize(spec:, store:, run_id:, logger:, judge_model: DEFAULT_MODEL,
               judge_vendor: DEFAULT_VENDOR, rate_limit_retries: 3,
               rate_limit_backoff_seconds: 20)
  @spec = spec
  @store = store
  @run_id = run_id
  @logger = logger
  @judge_model = judge_model
  @judge_vendor = judge_vendor.to_sym
  @rate_limit_retries = rate_limit_retries
  @rate_limit_backoff_seconds = rate_limit_backoff_seconds
end

Class Method Details

.borderline?(score) ⇒ Boolean

Returns:

  • (Boolean)


26
27
28
# File 'lib/llm_conductor/eval/judge.rb', line 26

def self.borderline?(score)
  Verdict.borderline?(score)
end

Instance Method Details

#judge(model_result:, input_data:) ⇒ Object

model_result is an Eval::Result. input_data is the spec’s data Hash for the input being judged.



45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# File 'lib/llm_conductor/eval/judge.rb', line 45

def judge(model_result:, input_data:)
  prompt = build_prompt(model_result:, input_data:)
  response, latency_ms = call_with_rate_limit_retry(prompt)

  unless response&.success?
    error = response&.&.dig(:error) || 'judge LLM call failed'
    return failure_verdict(latency_ms:, response:, error:)
  end

  parsed = JsonParser.parse(response.output)
  if parsed.nil?
    return failure_verdict(latency_ms:, response:,
                           error: "judge output not valid JSON: #{response.output.to_s[0, 200]}")
  end

  build_verdict(parsed:, latency_ms:, response:)
rescue StandardError => e
  @logger.error("[Eval::Judge] #{@judge_model}: #{e.class}: #{e.message}")
  Verdict.new(judge_model: @judge_model, judge_error: "#{e.class}: #{e.message}")
end