ask-eval
LLM evaluation framework for Ruby. Minitest-native assertions for testing LLM outputs. LLM-as-judge for faithfulness, hallucination, bias, and toxicity. Deterministic assertions for basic checks. CI-native output.
Installation
gem "ask-eval"
Quick Start
require "ask/eval"
require "ask/eval/dsl"
class MyEvalTest < Minitest::Test
include Ask::Eval::DSL
test "response is faithful to context" do
response = my_rag_app.query("What's the return policy?")
assert_faithful response, context: [my_docs]
end
test "response contains expected info" do
response = my_app.generate_email("Order confirmation")
assert_contains response, "Thank you for your order"
assert_regex response, /order #\d{5}/
end
end
Deterministic Assertions
assert_contains output, "substring"
assert_not_contains output, "bad word"
assert_regex output, /pattern/
assert_json output # valid JSON?
assert_max_tokens output, 500
assert_starts_with output, "Hello"
assert_ends_with output, "Goodbye"
assert_equals output, "exact string"
assert_min_length output, 10
assert_max_length output, 500
assert_url output
assert_email output
LLM-as-Judge Assertions
assert_faithful response, context: docs # faithful to source?
assert_not_hallucinating response, context: docs # made-up info?
refute_bias response
refute_toxicity response
assert_correctness response, expected: expected
These require a judge model. Pass one per assertion or configure globally:
# Configure a default judge model
Ask::Eval.configure do |c|
c.default_judge = model # any callable, Ask::Provider instance, or model string
end
Or pass a model directly to each assertion:
assert_faithful response, context: docs, model: my_model
The model can be:
- A callable (lambda/proc) that accepts messages and returns a response
- An Ask::Provider instance (e.g.,
Ask::Providers::OpenAI.new) - A model string (e.g.,
"openai/gpt-4o-mini"— requires ask-llm-providers)
Using a lambda for testing
require "json"
model = ->() {
{ content: JSON.generate({ passed: true, score: 0.95, reason: "OK" }) }
}
assert_faithful response, context: docs, model: model
Minitest Plugin
For automatic inclusion in all Minitest tests, use the plugin:
# test/test_helper.rb
require "ask/eval/minitest"
# Now ALL test classes have assert_faithful, assert_contains, etc.
CI Integration
JUnit XML (works with Jenkins, CircleCI, GitLab CI):
results = runner.summary[:results]
xml = Ask::Eval::Reporters::JUnit.new(results).to_xml
File.write("eval-results.xml", xml)
GitHub Actions — annotations on PRs:
reporter = Ask::Eval::Reporters::GitHub.new(results)
reporter.report # prints ::warning and ::error annotations
Cost Tracking
Ask::Eval.configure do |c|
c.track_cost = true
end
# Access accumulated costs
puts Ask::Eval.cost_report
# => { total: 0.00015, by_judge: { faithful: { calls: 2, total_cost: 0.00015 } } }
Running Tests
bundle exec rake test
Design Philosophy
This gem is NOT a port of ruby_llm-tribunal. See the comparison below:
| ruby_llm-tribunal | ask-eval |
|---|---|
| Standalone evaluator with its own API | Minitest-native assertions — drops into existing tests |
| 10 judges (including niche: jailbreak, PII, refusal) | 5 essential judges — faithful, hallucination, bias, toxicity, correctness |
| 6 reporters (console, text, JSON, HTML, JUnit, GitHub) | 3 reporters — console (dev), JUnit (CI), GitHub Actions (annotations) |
| Dataset management, red teaming, custom judges | No datasets, no red teaming. Focus on what matters for 80% of users. |
| Tied to RubyLLM for judge model | Any model as judge — cheap gpt-4o-mini, accurate claude, or local |
| Cost tracking: none | Cost tracking per evaluation |
| Snapshot testing: none | Eval snapshots for regression detection (v0.2.0) |
| Test framework integration: requires include | Minitest plugin — auto-loads with require "ask/eval/minitest" |
License
MIT
Custom Judges
The 5 built-in judges cover common cases, but you can create your own by
subclassing Ask::Eval::Judge:
class BrandVoiceJudge < Ask::Eval::Judge
def call(tc)
query_judge(tc)
end
private
def system_prompt
<<~PROMPT
You are a brand voice evaluator. Determine if the response matches our guidelines:
- Friendly but professional tone
- No jargon or technical terms
- Empathetic and helpful
Respond in JSON format:
{ "passed": true/false, "score": 0.0-1.0, "reason": "..." }
PROMPT
end
def (tc)
"Response to evaluate: " + tc.actual_output
end
end
# Use it directly
judge = BrandVoiceJudge.new(model: my_model)
result = judge.call(Ask::Eval::TestCase.new(actual_output: response))
puts result.reason if result.passed?
Using a lambda for custom evaluation
For simple checks, pass a callable directly as the model: parameter --
you do not need a full judge class:
assert_faithful response, context: docs, model: ->() {
{ content: JSON.generate({ passed: true, score: 1.0, reason: "All good" }) }
}
No registration system needed. Subclassing Judge and implementing
#call, #system_prompt, and #user_message is the entire API.