Class: Phronomy::Eval::Scorer::LlmJudge
- Defined in:
- lib/phronomy/eval/scorer/llm_judge.rb
Overview
LLM-as-a-Judge scorer. Sends a structured prompt to an LLM and interprets its numeric reply as a quality score in [0.0, 1.0].
The prompt template accepts three named placeholders:
%s — the original input question
%
The LLM is expected to reply with a single decimal number; any extra text is stripped and the value is clamped to [0.0, 1.0]. If parsing fails the scorer returns 0.0 rather than raising.
Constant Summary collapse
- DEFAULT_PROMPT =
<<~PROMPT You are an impartial judge evaluating the quality of an AI assistant response. Rate the response on a scale from 0.0 (completely wrong or unhelpful) to 1.0 (perfect). Respond with ONLY a single decimal number between 0.0 and 1.0 — no other text. Question: %<input>s Expected answer: %<expected>s Actual response: %<actual>s Score: PROMPT
Instance Method Summary collapse
-
#initialize(model:, prompt_template: DEFAULT_PROMPT, raise_on_error: false) ⇒ LlmJudge
constructor
A new instance of LlmJudge.
-
#score(actual:, expected:, input: nil) ⇒ Float
mutant:disable - multiple genuine equivalent mutations: actual.to_str / actual: (shorthand) are genuine (callers pass String); expected.to_str / expected: are genuine (String); response.content.strip (no to_s) is genuine (content is String); lstrip/rstrip/no-strip are genuine (whitespace doesn't affect number scanning); scan(/-?\d.?\d*/) is genuine (for [0,1] range responses, single-digit-before-decimal matches are the same after clamp); response.content.to_str.strip is genuine (String); all warn variations (warn no-arg, warn(nil), warn(e), warn(nil literal), nil-replacing-warn, warn-deletion) are genuine because the rescue block still returns 0.0 — warn is a side-effect not tested by value assertions.
Constructor Details
#initialize(model:, prompt_template: DEFAULT_PROMPT, raise_on_error: false) ⇒ LlmJudge
Returns a new instance of LlmJudge.
40 41 42 43 44 |
# File 'lib/phronomy/eval/scorer/llm_judge.rb', line 40 def initialize(model:, prompt_template: DEFAULT_PROMPT, raise_on_error: false) @model = model @prompt_template = prompt_template @raise_on_error = raise_on_error end |
Instance Method Details
#score(actual:, expected:, input: nil) ⇒ Float
mutant:disable - multiple genuine equivalent mutations: actual.to_str / actual: (shorthand) are genuine (callers pass String); expected.to_str / expected: are genuine (String); response.content.strip (no to_s) is genuine (content is String); lstrip/rstrip/no-strip are genuine (whitespace doesn't affect number scanning); scan(/-?\d.?\d*/) is genuine (for [0,1] range responses, single-digit-before-decimal matches are the same after clamp); response.content.to_str.strip is genuine (String); all warn variations (warn no-arg, warn(nil), warn(e), warn(nil literal), nil-replacing-warn, warn-deletion) are genuine because the rescue block still returns 0.0 — warn is a side-effect not tested by value assertions
59 60 61 62 63 64 65 66 67 68 |
# File 'lib/phronomy/eval/scorer/llm_judge.rb', line 59 def score(actual:, expected:, input: nil) prompt = format(@prompt_template, input: input.to_s, expected: expected.to_s, actual: actual.to_s) response = Phronomy::Runtime.instance.blocking_io.submit { RubyLLM.chat(model: @model).ask(prompt) }.await response.content.to_s.strip.scan(/-?\d+\.?\d*/).first.to_f.clamp(0.0, 1.0) rescue => e raise if @raise_on_error warn "[LlmJudge] Scoring failed: #{e.}" 0.0 end |