ask-eval

LLM evaluation framework for Ruby. Minitest-native assertions for testing LLM outputs. LLM-as-judge for faithfulness, hallucination, bias, and toxicity. Deterministic assertions for basic checks. CI-native output.

Installation

gem "ask-eval"

Quick Start

require "ask/eval"
require "ask/eval/dsl"

class MyEvalTest < Minitest::Test
  include Ask::Eval::DSL

  test "response is faithful to context" do
    response = my_rag_app.query("What's the return policy?")
    assert_faithful response, context: [my_docs]
  end

  test "response contains expected info" do
    response = my_app.generate_email("Order confirmation")
    assert_contains response, "Thank you for your order"
    assert_regex response, /order #\d{5}/
  end
end

Deterministic Assertions

assert_contains output, "substring"
assert_not_contains output, "bad word"
assert_regex output, /pattern/
assert_json output                     # valid JSON?
assert_max_tokens output, 500
assert_starts_with output, "Hello"
assert_ends_with output, "Goodbye"
assert_equals output, "exact string"
assert_min_length output, 10
assert_max_length output, 500
assert_url output
assert_email output

LLM-as-Judge Assertions

assert_faithful response, context: docs        # faithful to source?
assert_not_hallucinating response, context: docs # made-up info?
refute_bias response
refute_toxicity response
assert_correctness response, expected: expected

These require a judge model. Pass one per assertion or configure globally:

# Configure a default judge model
Ask::Eval.configure do |c|
  c.default_judge = model  # any callable, Ask::Provider instance, or model string
end

Or pass a model directly to each assertion:

assert_faithful response, context: docs, model: my_model

The model can be:

A callable (lambda/proc) that accepts messages and returns a response
An Ask::Provider instance (e.g., Ask::Providers::OpenAI.new)
A model string (e.g., "openai/gpt-4o-mini" — requires ask-llm-providers)

Using a lambda for testing

require "json"

model = ->(messages) {
  { content: JSON.generate({ passed: true, score: 0.95, reason: "OK" }) }
}
assert_faithful response, context: docs, model: model

Minitest Plugin

For automatic inclusion in all Minitest tests, use the plugin:

# test/test_helper.rb
require "ask/eval/minitest"
# Now ALL test classes have assert_faithful, assert_contains, etc.

CI Integration

JUnit XML (works with Jenkins, CircleCI, GitLab CI):

results = runner.summary[:results]
xml = Ask::Eval::Reporters::JUnit.new(results).to_xml
File.write("eval-results.xml", xml)

GitHub Actions — annotations on PRs:

reporter = Ask::Eval::Reporters::GitHub.new(results)
reporter.report  # prints ::warning and ::error annotations

Cost Tracking

Ask::Eval.configure do |c|
  c.track_cost = true
end
# Access accumulated costs
puts Ask::Eval.cost_report
# => { total: 0.00015, by_judge: { faithful: { calls: 2, total_cost: 0.00015 } } }

Running Tests

bundle exec rake test

Design Philosophy

This gem is NOT a port of ruby_llm-tribunal. See the comparison below:

ruby_llm-tribunal	ask-eval
Standalone evaluator with its own API	Minitest-native assertions — drops into existing tests
10 judges (including niche: jailbreak, PII, refusal)	5 essential judges — faithful, hallucination, bias, toxicity, correctness
6 reporters (console, text, JSON, HTML, JUnit, GitHub)	3 reporters — console (dev), JUnit (CI), GitHub Actions (annotations)
Dataset management, red teaming, custom judges	No datasets, no red teaming. Focus on what matters for 80% of users.
Tied to RubyLLM for judge model	Any model as judge — cheap gpt-4o-mini, accurate claude, or local
Cost tracking: none	Cost tracking per evaluation
Snapshot testing: none	Eval snapshots for regression detection (v0.2.0)
Test framework integration: requires include	Minitest plugin — auto-loads with `require "ask/eval/minitest"`

License

MIT

Custom Judges

The 5 built-in judges cover common cases, but you can create your own by subclassing Ask::Eval::Judge:

class BrandVoiceJudge < Ask::Eval::Judge
  def call(tc)
    query_judge(tc)
  end

  private

  def system_prompt
    <<~PROMPT
      You are a brand voice evaluator. Determine if the response matches our guidelines:
      - Friendly but professional tone
      - No jargon or technical terms
      - Empathetic and helpful

      Respond in JSON format:
      { "passed": true/false, "score": 0.0-1.0, "reason": "..." }
    PROMPT
  end

  def user_message(tc)
    "Response to evaluate: " + tc.actual_output
  end
end

# Use it directly
judge = BrandVoiceJudge.new(model: my_model)
result = judge.call(Ask::Eval::TestCase.new(actual_output: response))
puts result.reason if result.passed?

Using a lambda for custom evaluation

For simple checks, pass a callable directly as the model: parameter -- you do not need a full judge class:

assert_faithful response, context: docs, model: ->(messages) {
  { content: JSON.generate({ passed: true, score: 1.0, reason: "All good" }) }
}

No registration system needed. Subclassing Judge and implementing #call, #system_prompt, and #user_message is the entire API.