Ruby Skill Bench
A high-fidelity evaluation engine for benchmarking AI agent skills across any stack (Rails-first, but extensible).
Features
- Side-by-Side Evaluation: Quantify the "ROI of Context" by comparing baseline vs. skill-enhanced agent runs.
- Isolated Git Sandboxes: Every run operates in a temporary repo. Clean diffs, zero side-effects, 100% reproducibility.
- Blind Judging with Dimensions: LLM judge scores baseline and context independently across 5 canonical dimensions (Correctness, Skill Adherence, Code Quality, Test Coverage, Documentation). Eval authors configure weights and thresholds via
criteria.json. - Sophisticated ReAct Loop: Employs a robust
Thought → Tool → Observationloop to handle complex, multi-step engineering tasks. - Multi-Provider Ecosystem: Native support for OpenAI, Anthropic, Google Gemini, Azure OpenAI, Ollama, Groq, DeepSeek, and OpenCode.
- Standardized Intelligence: Consistent reporting format regardless of the underlying LLM provider.
Architecture Overview
The system decoupling allows the reasoning engine to remain agnostic of the execution environment.
CLI / API → RunnerService → Sandbox + ReAct Agent → LLM Client Layer → Provider
↓
EvaluationRunner (baseline + context)
↓
Judge (blind scoring)
↓
DeltaReport
Configuration & Orchestration
Environment Variable Mapping
| Provider | Required Env Variables | Registry Key |
|---|---|---|
| OpenAI | SKILL_BENCH_OPENAI_API_KEY |
:openai |
| Anthropic | SKILL_BENCH_ANTHROPIC_API_KEY |
:anthropic |
| Gemini | SKILL_BENCH_GEMINI_API_KEY |
:gemini |
| Azure | SKILL_BENCH_AZURE_API_KEY |
:azure |
| Ollama | — | :ollama |
| Groq | SKILL_BENCH_GROQ_API_KEY |
:groq |
| DeepSeek | SKILL_BENCH_DEEPSEEK_API_KEY |
:deepseek |
| OpenCode | SKILL_BENCH_OPENCODE_API_KEY, SKILL_BENCH_OPENCODE_BASE_URL |
:opencode |
Note: Environment variables are loaded automatically. You can also configure provider settings in
skill-bench.json(created byskill-bench init).OpenCode requires a custom
base_url: OpenCode does not host a public LLM API. You must provide your own OpenAI-compatible endpoint (e.g. a LiteLLM proxy, self-hosted vLLM, or company gateway) via thebase_urlconfig key. Without it, the provider will fail with "Base URL not set for Opencode".
Command Allowlist
By default, no shell commands are permitted. You must configure allowed_commands in skill-bench.json:
{
"provider": "openai",
"max_execution_time": 30,
"allowed_commands": ["rspec", "bundle", "ruby", "git"],
"config": {
"api_key": null,
"model": "gpt-4o"
}
}
Security: The agent can only execute commands on this list. Dangerous commands (bash, curl, sudo, etc.) are always blocked regardless of configuration.
Configuration Hierarchy
Configuration is loaded in this order (later sources override earlier ones):
- Code defaults — built-in defaults for provider, model, and timeout
- Home JSON —
~/.skill-bench.jsonfor user-wide settings - Local JSON —
./skill-bench.jsonfor project-specific settings - Environment variables — provider API keys and models from
ENV
Getting Started
Installation
gem install ruby-skill-bench
Or add to your Gemfile:
gem 'ruby-skill-bench'
Usage: The 4-Step Flow
Each command creates specific files. Here is exactly what lands on disk after each step.
1. Initialize Configuration
skill-bench init --openai
Creates: skill-bench.json (provider configuration)
{
"provider": "openai",
"max_execution_time": 30,
"allowed_commands": ["rspec", "bundle", "ruby", "git"],
"config": {
"api_key": null,
"model": "gpt-4o"
}
}
Available providers: --openai, --anthropic, --gemini, --ollama, --azure, --groq, --deepseek, --opencode
Use --force to overwrite an existing config.
2. Create a Skill
skill-bench skill new my-service --mode=rails --template=service_object
Creates:
skills/
└── my-service/
└── SKILL.md # <- Your skill instructions go here
SKILL.md is free-form Markdown. It typically contains:
- What pattern the skill implements (e.g., "Service Object with
.call") - Hard rules the agent must follow
- Code examples
- Response format expectations
Example SKILL.md:
# Service Object Skill
## Pattern
All service objects use the `.call` class method and return a standardized hash:
```ruby
{ success: true, response: { data: ... } }
Hard Rules
- Every
.rbfile begins with# frozen_string_literal: true - Every public method has YARD docs (
@param,@return,@raise) rescue StandardErrorblocks must log backtrace ```
Using TemplateRegistry for Rapid Eval Scaffolding
For programmatic eval creation, use SkillBench::Services::TemplateRegistry to generate scaffolding from pre-built templates. This is ideal for automating eval creation or building tools on top of SkillBench.
Basic Usage:
require 'skill_bench'
# Generate a task template for a CRUD service
task_content = SkillBench::Services::TemplateRegistry.call(
:task_md,
:crud,
skill_name: "UserCreator"
)
# Generate criteria JSON for an API client
criteria_content = SkillBench::Services::TemplateRegistry.call(:criteria_json, :api)
# Generate skill instructions for a background job
skill_content = SkillBench::Services::TemplateRegistry.call(
:skill_md,
:background_job,
skill_name: "OrderProcessor"
)
Available Template Types:
| Type | Output | Purpose |
|---|---|---|
task_md |
Markdown | Agent prompt with requirements |
criteria_json |
JSON | Scoring rules and dimensions |
skill_md |
Markdown | Skill instructions for the agent |
Supported Categories:
| Category | Use Case |
|---|---|
crud |
Service Objects with Create, Read, Update, Delete |
api |
API clients with authentication and error handling |
background_job |
ActiveJob/Sidekiq workers with retry logic |
controller |
RESTful controllers with strong parameters |
model |
ActiveRecord models with validations |
migration |
Database migrations with indexes |
concern |
ActiveSupport::Concern modules |
policy |
Authorization policies (Pundit-style) |
form_object |
Form objects with validations |
view_component |
ViewComponent components with previews |
Variable Interpolation:
Templates support {{variable_name}} syntax for dynamic content:
# Custom variables are interpolated into templates
task = SkillBench::Services::TemplateRegistry.call(
:task_md,
:api,
skill_name: "PaymentGateway",
endpoint: "/api/v1/payments"
)
Complete Workflow Example:
require 'fileutils'
require 'skill_bench'
# Define your skill name
skill_name = "OrderService"
# Generate all eval scaffolding
task_md = SkillBench::Services::TemplateRegistry.call(:task_md, :crud, skill_name: skill_name)
criteria_json = SkillBench::Services::TemplateRegistry.call(:criteria_json, :crud)
skill_md = SkillBench::Services::TemplateRegistry.call(:skill_md, :crud, skill_name: skill_name)
# Write to disk
FileUtils.mkdir_p("evals/order-service")
File.write("evals/order-service/task.md", task_md)
File.write("evals/order-service/criteria.json", criteria_json)
FileUtils.mkdir_p("skills/order-service")
File.write("skills/order-service/SKILL.md", skill_md)
puts "Eval scaffolding created for #{skill_name}!"
Note:
TemplateRegistryis a pure function with no side effects. It returns template strings that you can customize before writing to disk.
3. Create an Eval
You have two options: manual or auto-generated.
Option A — Manual (full control):
skill-bench eval new my-first-eval --runtime=rails
Creates:
evals/
└── my-first-eval/
├── task.md # <- The task description for the agent
└── criteria.json # <- Scoring rules and dimension weights
task.md tells the agent what to build. Be specific — the agent receives this as its user prompt.
Example task.md:
Create a `UserRegistrationService` that:
1. Accepts `email` and `password`
2. Validates email format with a regex
3. Validates password length (minimum 8 characters)
4. Returns `{ success: true, response: { user_id: ... } }` on success
5. Returns `{ success: false, response: { error: { message: ... } } }` on failure
6. Includes YARD documentation for every public method
7. Includes RSpec tests that cover both success and failure paths
criteria.json tells the judge how to score the agent's output. See the Scoring Engine section for the full format.
Option B — Auto-Generated (from a skill):
skill-bench eval generate my-service --name my-first-eval
Reads skills/my-service/SKILL.md, sends it to the LLM, and auto-generates task.md + criteria.json. The generated eval is immediately validated against the same rules as manual evals.
4. Run the Eval
skill-bench run my-first-eval --skill=my-service
What happens internally:
- Resolve — Load eval (
task.md+criteria.json), skill (SKILL.md), and provider config - Baseline run — Agent receives
task.mdas a prompt, no skill context → produces output A - Context run — Agent receives
task.md+SKILL.mdas prompt → produces output B - Blind judging — LLM judge scores output A and output B independently across the dimensions defined in
criteria.json - Delta computation — Compare scores, compute deltas, apply pass/fail logic
- History recording — Store result in
.skill-bench-history.jsonfor trend tracking
Provider is read from skill-bench.json — no --provider flag needed.
Run with multiple skills (skill chaining):
skill-bench run my-first-eval --skill=skill-a --skill=skill-b
Both skill contexts are concatenated and sent to the agent. The judge evaluates whether the combined context improves results.
Output Formats:
- Human-readable (default)
- JSON:
--format json - JUnit XML:
--format junit
File Reference: What Lives on Disk
SkillBench creates and manages three files in your project. Understanding them helps you iterate faster.
skill-bench.json — Your Configuration
What it is: The config file you create with skill-bench init. It tells SkillBench which LLM provider to use, your API key, timeout settings, and which shell commands the agent is allowed to run.
Who edits it: You. This is the only file SkillBench expects you to write by hand.
Typical contents:
{
"provider": "openai",
"max_execution_time": 300,
"allowed_commands": ["rspec", "bundle", "ruby", "git"],
"config": {
"api_key": "sk-...",
"model": "gpt-4o",
"max_iterations": 25
}
}
Key rules:
- Configuration is loaded in this order: code defaults →
~/.skill-bench.json(user-wide) →./skill-bench.json(local) → environment variables. Later sources override earlier ones. - If
api_keyisnull, SkillBench looks for the matching environment variable (e.g.SKILL_BENCH_OPENAI_API_KEY). allowed_commandsis a safeguard, not a convenience. By default the agent cannot run any shell command. Add only what your evals need.
.skill-bench-history.json — Evaluation History (Auto-Generated)
What it is: A JSON array that records every successful eval run. SkillBench appends to it automatically. It stores the timestamp, eval name, skill names, scores, and deltas so you can track improvement over time.
Who edits it: Nobody. SkillBench writes it; you read it. If you delete it, you lose your trend data.
Example entry:
[
{
"timestamp": "2026-05-12T10:30:00Z",
"eval_name": "my-first-eval",
"skill_names": ["my-service"],
"verdict": true,
"baseline_total": 32,
"context_total": 87,
"deltas": {
"correctness": 16,
"skill_adherence": 17,
"code_quality": 6,
"test_coverage": 10,
"documentation": 6
}
}
]
Why it matters: This file powers the TREND line you see in human-readable output:
TREND: baseline ↑ (+2), context ↑ (+7)
The trend compares the current run against the previous run of the same eval + skill. This tells you at a glance whether your latest skill edit made things better or worse.
Pro tip: Commit .skill-bench-history.json to git if you want to share trend data with your team. Add it to .gitignore if you prefer to keep scores private.
.skill-bench-history.json.bak — Backup (Auto-Generated)
What it is: A copy of .skill-bench-history.json created every time SkillBench writes a new entry. If the main file gets corrupted (e.g. you kill the process mid-write), SkillBench automatically falls back to the .bak file.
Who edits it: Nobody. It is a safety net.
When to care: Almost never. If you see a "History file corrupted" warning, SkillBench has already recovered from the .bak for you.
Iterating on Skills: A Practical Workflow
Writing a good skill is rarely a one-shot process. Here is a tested workflow that uses the history file to guide your improvements.
Step 1: Write a V1 Skill
Create a skill and an eval that exercises it:
skill-bench skill new my-service --mode=rails --template=service_object
skill-bench eval new my-first-eval --runtime=rails
# ... edit SKILL.md, task.md, and criteria.json ...
Step 2: Run the Eval (Baseline + Context)
skill-bench run my-first-eval --skill=my-service
This executes the full evaluation pipeline: a baseline run (agent receives the task without the skill) and a context run (agent receives the task with the skill). The two outputs are scored independently by the judge and compared.
Read the output carefully. Look at two things:
- Verdict: Did it pass? If not, which dimension failed?
- Delta: Which dimensions improved the most? Which improved the least?
Step 3: Inspect the History
cat .skill-bench-history.json | jq '.[-1]'
This shows the latest entry. Focus on the dimension with the smallest delta — that is where your skill is weakest.
Step 4: Edit the Skill
Suppose test_coverage only improved by +3. Open skills/my-service/SKILL.md and add a concrete rule:
## Hard Rules
... existing rules ...
5. Every service must include RSpec tests with at least:
- One happy-path test
- One error-path test
- Use of `let` and `subject` blocks
Step 5: Re-run and Compare Trends
skill-bench run my-first-eval --skill=my-service
Watch the TREND line:
TREND: baseline → (0), context ↑ (+5)
The context score went up by 5 points compared to the previous run. If test_coverage delta jumped from +3 to +8, your edit worked.
Step 6: Iterate Until Stable
Repeat steps 4-5 until:
- The eval passes consistently (2-3 runs in a row)
- Deltas are stable (not swinging wildly)
- The trend line shows
context → (0)or small positive deltas
When to Stop Iterating
| Situation | Action |
|---|---|
| Context score is ~95+ and deltas are flat | Your skill is mature. Move on. |
| Context score is stuck below threshold | Your eval task might be too hard, or your skill rules are too vague. Rewrite task.md with clearer acceptance criteria. |
| Baseline score is already high | The task is too easy. Make task.md harder so the skill has room to show value. |
| One dimension is always low | Add a specific rule to SKILL.md targeting that dimension. |
Scoring Engine
The engine runs every eval twice — once without skill context (baseline) and once with skill context — then uses an LLM judge to score both outputs independently across configurable dimensions.
How It Works (Visual Walkthrough)
┌────────────────────────────────────────────────────────────────────────┐
│ EVALUATION PIPELINE │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ Step 1: Baseline Run │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ task.md │───→│ Agent │───→│ Output A │ │
│ └─────────────┘ │ (no skill) │ │ (git diff) │ │
│ └─────────────┘ └─────────────┘ │
│ │
│ Step 2: Context Run │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ task.md │───→│ Agent │───→│ Output B │ │
│ │ SKILL.md │───→│ (+ skill) │───→│ (git diff) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ Step 3: Blind Judging (two independent calls) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Output A │───→│ Judge │───→│ Score A │ │
│ │ criteria │ │ (baseline) │ │ per dim │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Output B │───→│ Judge │───→│ Score B │ │
│ │ criteria │ │ (context) │ │ per dim │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ Step 4: Verdict │
│ Delta = Score B - Score A │
│ Pass if: Score B >= pass_threshold AND Delta >= minimum_delta │
│ │
└────────────────────────────────────────────────────────────────────────┘
Key principle: The judge never sees both outputs in the same call. This eliminates "halo effect" bias — the judge scores each output on its own merits, not by direct comparison.
Canonical Dimensions
These 5 dimensions are mandatory in every criteria.json. You can add custom dimensions beyond these, but you cannot remove any of the core 5.
| Dimension | Default Description | Typical Weight |
|---|---|---|
| Correctness | Does the output fulfill the task requirements? Are all specified behaviors present and correct? | 25-35 |
| Skill Adherence | Did the agent follow the specific patterns, hard gates, and workflows defined in the skill? | 20-30 |
| Code Quality | Is the code clean, well-structured, free of smells, follows SRP, and avoids duplication? | 15-25 |
| Test Coverage | Are there meaningful tests? Do they test the right things? Are they following TDD/best practices? | 10-20 |
| Documentation | Is there adequate YARD documentation, clear intent, and helpful inline comments where needed? | 5-15 |
Why these weights? Correctness and Skill Adherence are usually the highest because they directly measure "did the agent do the right thing" and "did the skill help." Test Coverage and Documentation are lower because they are supporting qualities.
criteria.json Format
{
"context": "Evaluate whether the skill helps build a proper API REST collection",
"dimensions": [
{ "name": "correctness", "max_score": 30 },
{ "name": "skill_adherence", "max_score": 25 },
{ "name": "code_quality", "max_score": 20 },
{ "name": "test_coverage", "max_score": 15 },
{ "name": "documentation", "max_score": 10 }
],
"pass_threshold": 70,
"minimum_delta": 10
}
Field-by-field breakdown:
| Field | Type | Required | Description |
|---|---|---|---|
context |
string | Yes | Human-readable description of what this eval measures. Shown in the judge prompt. |
dimensions |
array | Yes | List of dimension objects. Must include all 5 canonical dimensions. Each needs name and max_score. max_score values must sum to exactly 100. |
pass_threshold |
integer | No | Minimum total context score (0-100) to pass. Default: 70. |
minimum_delta |
integer | No | Minimum total improvement (context - baseline) required to pass. Default: 10. |
Rules:
- Sum to 100:
dimensionsmax_scorevalues must sum to exactly 100. The engine rejects any eval where they don't. - All 5 core dimensions required: You cannot omit
correctness,skill_adherence,code_quality,test_coverage, ordocumentation. - Custom dimensions allowed: You can add dimensions beyond the core 5. Their
max_scorevalues still count toward the 100 total. - Pass/fail logic: Both conditions must be true:
context_total >= pass_threshold(the agent with skill scored high enough)total_delta >= minimum_delta(the skill made a meaningful difference)
Example with custom dimension descriptions:
{
"context": "Evaluate REST API collection skill",
"dimensions": [
{ "name": "correctness", "max_score": 30 },
{ "name": "skill_adherence", "max_score": 25, "description": "Did the agent use the `.call` pattern and return the standardized hash?" },
{ "name": "code_quality", "max_score": 20 },
{ "name": "test_coverage", "max_score": 15 },
{ "name": "documentation", "max_score": 10 }
],
"pass_threshold": 70,
"minimum_delta": 10
}
Example with a custom dimension (6 total, still summing to 100):
{
"context": "Evaluate with performance considerations",
"dimensions": [
{ "name": "correctness", "max_score": 25 },
{ "name": "skill_adherence", "max_score": 20 },
{ "name": "code_quality", "max_score": 15 },
{ "name": "test_coverage", "max_score": 15 },
{ "name": "documentation", "max_score": 10 },
{ "name": "performance", "max_score": 15, "description": "Is the solution performant? Are N+1 queries avoided?" }
],
"pass_threshold": 70,
"minimum_delta": 10
}
Understanding the Output
Human-readable format:
═══════════════════════════════════════════════════════
Eval: my-first-eval
Skill: my-service
Provider: openai
═══════════════════════════════════════════════════════
=== BASELINE ITERATIONS ===
Step 1: Read task → Tool: read_file → Observation: content...
Step 2: Plan changes → Tool: write_file → Observation: Success...
Step 3: Run tests → Tool: run_command → Observation: 3 runs, 0 failures
Step 4: Final answer
=== CONTEXT ITERATIONS ===
Step 1: Read task → Tool: read_file → Observation: content...
Step 2: Apply skill pattern → Tool: write_file, run_command → Observation: Success...
Step 3: Final answer
DIMENSION BASELINE CONTEXT DELTA
──────────────────────── ───────── ───────── ───────
Correctness (30) 12 28 +16
Skill Adherence (25) 5 22 +17
Code Quality (20) 10 16 +6
Test Coverage (15) 3 13 +10
Documentation (10) 2 8 +6
──────────────────────── ───────── ───────── ───────
TOTAL 32/100 87/100 +55
TREND: baseline ↑ (+2), context ↑ (+7)
VERDICT: PASS (threshold: 70, minimum delta: 10)
═══════════════════════════════════════════════════════
=== WHAT WENT WELL ===
Correctness (28/30, baseline: 12/30)
The agent correctly implemented all required behaviors.
Skill Adherence (22/25, baseline: 5/25)
Followed the service object pattern and hard gates.
=== WHAT WENT WRONG ===
Test Coverage (13/15, baseline: 3/15)
Tests exist but edge cases are missing.
Advice: Are there meaningful tests? Do they test the right things?
What each column means:
- BASELINE: The agent's score without the skill. This is the "unaided" performance.
- CONTEXT: The agent's score with the skill. This is the "aided" performance.
- DELTA:
CONTEXT - BASELINE. How much the skill helped. - TOTAL: Sum of all dimension scores. Max possible is 100.
- TREND: Comparison against the previous run of the same eval + skill (from
.skill-bench-history.json). Shows whether scores are improving over time. - VERDICT:
PASSonly ifCONTEXT >= pass_thresholdANDDELTA >= minimum_delta.
Iteration timeline:
Each run (baseline and context) shows the ReAct loop steps the agent took: thinking, calling tools, and observing results. This helps you understand how the agent worked through the task. Observations are truncated to keep the output readable. If the timeline is empty, the agent finished in a single LLM call without using tools.
Feedback sections:
- WHAT WENT WELL — Dimensions where the context score is ≥ 80% of the max, with the judge's reasoning. These are the strengths of your skill.
- WHAT WENT WRONG — Dimensions where the context score is < 80% of the max, with the judge's reasoning and the baseline score for comparison. These are where your skill needs work.
- ADVICE — Each low-scoring dimension shows its description from
criteria.jsonas actionable guidance. If the description is empty, no advice line appears.
Verdict Decision Matrix
Your eval result depends on both conditions. Here is every scenario:
| Context Score | Delta | Verdict | Why |
|---|---|---|---|
| 87 | +55 | PASS | Context >= 70 and delta >= 10. The skill helped a lot. |
| 87 | -2 | FAIL | Context >= 70 but delta < 10. The skill made things worse. |
| 65 | +15 | FAIL | Context < 70 even though delta >= 10. Absolute score too low. |
| 65 | +5 | FAIL | Context < 70 and delta < 10. Both conditions failed. |
Negative delta means the skill hurt performance. If baseline=89 and context=87, your skill confused the agent or added noise. This is the most common "unexpected FAIL" — the skill reads well to humans but backfires with the LLM.
FAIL example — skill made things worse:
DIMENSION BASELINE CONTEXT DELTA
──────────────────────── ───────── ───────── ───────
Correctness (30) 28 25 -3
Skill Adherence (25) 23 22 -1
Code Quality (20) 18 18 +0
Test Coverage (15) 12 13 +1
Documentation (10) 8 9 +1
──────────────────────── ───────── ───────── ───────
TOTAL 89/100 87/100 -2
VERDICT: FAIL (threshold: 70, minimum delta: 10)
Why this FAILs: Context score (87) is above the threshold (70), but the delta is negative (-2). The agent scored 89 without the skill and only 87 with it. The skill actively hurt performance. Common causes:
- Skill is too long or contradictory — the agent ignores the task to follow the skill
- Skill prescribes patterns that conflict with the task requirements
- Skill adds boilerplate that the judge penalizes (over-engineering)
Fix: Remove rules that don't directly improve the dimension with the lowest delta. Shorter skills usually beat longer ones.
Reliability & Security
- Safe-by-Design: No code execution occurs on the host system; everything happens in the sandbox.
- Command Blocklist: Dangerous commands (
bash,sh,python,curl, etc.) are always blocked, even if listed inallowed_commands. - Path Validation: Eval paths are validated to prevent directory traversal attacks.
- Atomic History Writes: Benchmark history uses file locking to prevent corruption from concurrent writes.
- URL Sanitization: All provider URL parameters are CGI-escaped to prevent injection.
- YAML Safety: Config loading uses
permitted_classes: []to prevent symbol DoS attacks. - Traceability: Every thought and tool call is logged with full backtrace for post-mortem analysis.
- Robust Error Recovery: Handles provider outages and rate limits gracefully with standardized error logging.
- XML-Safe Output: JUnit XML output is properly escaped to prevent injection attacks.
- Test Coverage: 373+ tests covering core engine, CLI commands, and all provider clients.
Testing
The project uses Minitest with WebMock for HTTP stubbing.
# Run all tests
bundle exec rake test
# Run with coverage
bundle exec rake test COVERAGE=true
# Run specific test file
bundle exec ruby -Itest test/integration_test.rb
Test Structure:
test/evaluator/— Core evaluation engine teststest/agent_eval/— CLI, models, and service teststest/clients/— Provider client tests
CI/CD Integration
GitHub Actions workflow included (.github/workflows/ci.yml):
- Runs on push and pull requests
- Tests against Ruby 3.3 and 3.4
- Executes rubocop, reek, and minitest
- Outputs JUnit XML for test reporting
# Run locally with CI output
skill-bench run my-eval --skill=my-skill --format json
License
The gem is available as open source under the terms of the MIT License.