Class: Phronomy::Eval::Scorer::LlmJudge

Inherits:
Base
  • Object
show all
Defined in:
lib/phronomy/eval/scorer/llm_judge.rb

Overview

LLM-as-a-Judge scorer. Sends a structured prompt to an LLM and interprets its numeric reply as a quality score in [0.0, 1.0].

The prompt template accepts three named placeholders: %s — the original input question %s — the ground-truth / reference answer %s — the output being evaluated

The LLM is expected to reply with a single decimal number; any extra text is stripped and the value is clamped to [0.0, 1.0]. If parsing fails the scorer returns 0.0 rather than raising.

Examples:

judge = LlmJudge.new(model: "gpt-4o-mini")
judge.score(actual: "Paris", expected: "Paris", input: "Capital of France?")

Constant Summary collapse

DEFAULT_PROMPT =
<<~PROMPT
  You are an impartial judge evaluating the quality of an AI assistant response.
  Rate the response on a scale from 0.0 (completely wrong or unhelpful) to 1.0 (perfect).
  Respond with ONLY a single decimal number between 0.0 and 1.0 — no other text.

  Question: %<input>s
  Expected answer: %<expected>s
  Actual response: %<actual>s

  Score:
PROMPT

Instance Method Summary collapse

Constructor Details

#initialize(model:, prompt_template: DEFAULT_PROMPT) ⇒ LlmJudge

Returns a new instance of LlmJudge.

Parameters:

  • model (String)

    RubyLLM model identifier

  • prompt_template (String) (defaults to: DEFAULT_PROMPT)

    format string with %s, %s, %s



37
38
39
40
# File 'lib/phronomy/eval/scorer/llm_judge.rb', line 37

def initialize(model:, prompt_template: DEFAULT_PROMPT)
  @model = model
  @prompt_template = prompt_template
end

Instance Method Details

#score(actual:, expected:, input: nil) ⇒ Float

Returns score in [0.0, 1.0]; 0.0 on any error.

Returns:

  • (Float)

    score in [0.0, 1.0]; 0.0 on any error



43
44
45
46
47
48
49
50
# File 'lib/phronomy/eval/scorer/llm_judge.rb', line 43

def score(actual:, expected:, input: nil)
  prompt = format(@prompt_template, input: input.to_s, expected: expected.to_s, actual: actual.to_s)
  response = RubyLLM.chat(model: @model).ask(prompt)
  response.content.to_s.strip.scan(/-?\d+\.?\d*/).first.to_f.clamp(0.0, 1.0)
rescue => e
  warn "[LlmJudge] Scoring failed: #{e.message}"
  0.0
end