Class: Phronomy::Eval::Scorer::LlmJudge

Inherits:

Base

Object
Base
Phronomy::Eval::Scorer::LlmJudge

show all

Defined in:: lib/phronomy/eval/scorer/llm_judge.rb

Overview

LLM-as-a-Judge scorer. Sends a structured prompt to an LLM and interprets its numeric reply as a quality score in [0.0, 1.0].

The prompt template accepts three named placeholders: %s — the original input question %s — the ground-truth / reference answer %s — the output being evaluated

The LLM is expected to reply with a single decimal number; any extra text is stripped and the value is clamped to [0.0, 1.0]. If parsing fails the scorer returns 0.0 rather than raising.

Examples:

judge = LlmJudge.new(model: "gpt-4o-mini")
judge.score(actual: "Paris", expected: "Paris", input: "Capital of France?")

Constant Summary collapse

DEFAULT_PROMPT =

<<~PROMPT
  You are an impartial judge evaluating the quality of an AI assistant response.
  Rate the response on a scale from 0.0 (completely wrong or unhelpful) to 1.0 (perfect).
  Respond with ONLY a single decimal number between 0.0 and 1.0 — no other text.

  Question: %<input>s
  Expected answer: %<expected>s
  Actual response: %<actual>s

  Score:
PROMPT

Instance Method Summary collapse

#initialize(model:, prompt_template: DEFAULT_PROMPT, raise_on_error: false) ⇒ LlmJudge constructor
A new instance of LlmJudge.
#score(actual:, expected:, input: nil) ⇒ Float
mutant:disable - multiple genuine equivalent mutations: actual.to_str / actual: (shorthand) are genuine (callers pass String); expected.to_str / expected: are genuine (String); response.content.strip (no to_s) is genuine (content is String); lstrip/rstrip/no-strip are genuine (whitespace doesn't affect number scanning); scan(/-?\d.?\d*/) is genuine (for [0,1] range responses, single-digit-before-decimal matches are the same after clamp); response.content.to_str.strip is genuine (String); all warn variations (warn no-arg, warn(nil), warn(e), warn(nil literal), nil-replacing-warn, warn-deletion) are genuine because the rescue block still returns 0.0 — warn is a side-effect not tested by value assertions.

Constructor Details

#initialize(model:, prompt_template: DEFAULT_PROMPT, raise_on_error: false) ⇒ `LlmJudge`

Returns a new instance of LlmJudge.

Parameters:

model (String) —
RubyLLM model identifier
prompt_template (String) (defaults to: DEFAULT_PROMPT) —
format string with %s, %s, %s
raise_on_error (Boolean) (defaults to: false) —
when true, re-raises scoring exceptions instead of returning 0.0. Use this in batch eval pipelines where silent failures are unacceptable.

# File 'lib/phronomy/eval/scorer/llm_judge.rb', line 40

def initialize(model:, prompt_template: DEFAULT_PROMPT, raise_on_error: false)
  @model = model
  @prompt_template = prompt_template
  @raise_on_error = raise_on_error
end

Instance Method Details

#score(actual:, expected:, input: nil) ⇒ `Float`

mutant:disable - multiple genuine equivalent mutations: actual.to_str / actual: (shorthand) are genuine (callers pass String); expected.to_str / expected: are genuine (String); response.content.strip (no to_s) is genuine (content is String); lstrip/rstrip/no-strip are genuine (whitespace doesn't affect number scanning); scan(/-?\d.?\d*/) is genuine (for [0,1] range responses, single-digit-before-decimal matches are the same after clamp); response.content.to_str.strip is genuine (String); all warn variations (warn no-arg, warn(nil), warn(e), warn(nil literal), nil-replacing-warn, warn-deletion) are genuine because the rescue block still returns 0.0 — warn is a side-effect not tested by value assertions

Returns:

(Float) —
score in [0.0, 1.0]; 0.0 on error when raise_on_error is false

# File 'lib/phronomy/eval/scorer/llm_judge.rb', line 59

def score(actual:, expected:, input: nil)
  prompt = format(@prompt_template, input: input.to_s, expected: expected.to_s, actual: actual.to_s)
  response = Phronomy::Runtime.instance.blocking_io.submit { RubyLLM.chat(model: @model).ask(prompt) }.await
  response.content.to_s.strip.scan(/-?\d+\.?\d*/).first.to_f.clamp(0.0, 1.0)
rescue => e
  raise if @raise_on_error

  warn "[LlmJudge] Scoring failed: #{e.message}"
  0.0
end