ask-eval
LLM evaluation framework for Ruby. Minitest-native assertions for testing LLM outputs. LLM-as-judge for faithfulness, hallucination, bias, and toxicity. Deterministic assertions for basic checks. CI-native output.
Installation
gem "ask-eval"
Quick Start
require "ask/eval"
require "ask/eval/dsl"
class MyEvalTest < Minitest::Test
include Ask::Eval::DSL
test "response is faithful to context" do
response = my_rag_app.query("What's the return policy?")
assert_faithful response, context: [my_docs]
end
test "response contains expected info" do
response = my_app.generate_email("Order confirmation")
assert_contains response, "Thank you for your order"
assert_regex response, /order #\d{5}/
end
end
Deterministic Assertions
assert_contains output, "substring"
assert_not_contains output, "bad word"
assert_regex output, /pattern/
assert_json output # valid JSON?
assert_max_tokens output, 500
assert_starts_with output, "Hello"
assert_ends_with output, "Goodbye"
assert_equals output, "exact string"
assert_min_length output, 10
assert_max_length output, 500
assert_url output
assert_email output
LLM-as-Judge Assertions
assert_faithful response, context: docs # faithful to source?
assert_not_hallucinating response, context: docs # made-up info?
refute_bias response
refute_toxicity response
assert_correctness response, expected: expected
These require a judge model. Pass one per assertion or configure globally:
# Configure a default judge model
Ask::Eval.configure do |c|
c.default_judge = model # any callable, Ask::Provider instance, or model string
end
Or pass a model directly to each assertion:
assert_faithful response, context: docs, model: my_model
The model can be:
- A callable (lambda/proc) that accepts messages and returns a response
- An Ask::Provider instance (e.g.,
Ask::Providers::OpenAI.new) - A model string (e.g.,
"openai/gpt-4o-mini"— requires ask-llm-providers)
Using a lambda for testing
require "json"
model = ->() {
{ content: JSON.generate({ passed: true, score: 0.95, reason: "OK" }) }
}
assert_faithful response, context: docs, model: model
Minitest Plugin
For automatic inclusion in all Minitest tests, use the plugin:
# test/test_helper.rb
require "ask/eval/minitest"
# Now ALL test classes have assert_faithful, assert_contains, etc.
CI Integration
JUnit XML (works with Jenkins, CircleCI, GitLab CI):
results = runner.summary[:results]
xml = Ask::Eval::Reporters::JUnit.new(results).to_xml
File.write("eval-results.xml", xml)
GitHub Actions — annotations on PRs:
reporter = Ask::Eval::Reporters::GitHub.new(results)
reporter.report # prints ::warning and ::error annotations
Cost Tracking
Ask::Eval.configure do |c|
c.track_cost = true
end
# Access accumulated costs
puts Ask::Eval.cost_report
# => { total: 0.00015, by_judge: { faithful: { calls: 2, total_cost: 0.00015 } } }
Running Tests
bundle exec rake test
Design Philosophy
This gem should NOT be a port of ruby_llm-tribunal. See the comparison:
| ruby_llm-tribunal (~500 lines, 25+ files) | ask-eval (~300 lines, 10 files) |
|---|---|
| Standalone evaluator with its own API | Minitest-native assertions — drops into existing tests |
| 10 judges (including niche: jailbreak, PII, refusal) | 5 essential judges — faithful, hallucination, bias, toxicity, correctness |
| 6 reporters (console, text, JSON, HTML, JUnit, GitHub) | 3 reporters — console (dev), JUnit (CI), GitHub Actions (annotations) |
| Dataset management, red teaming, custom judges | No datasets, no red teaming. Focus on what matters for 80% of users. |
| Tied to RubyLLM for judge model | Any model as judge — cheap gpt-4o-mini, accurate claude, or local |
| Cost tracking: none | Cost tracking per evaluation |
| Snapshot testing: none | Eval snapshots for regression detection |
| Test framework integration: requires include | Minitest plugin — auto-loads with require "ask/eval/minitest" |
License
MIT