rspec-llm
RSpec matchers, helpers, and a thin DSL for testing LLM-backed code in Ruby.
rspec-llm provides:
- Custom matchers for asserting on LLM outputs:
pass_llm_judge("criterion")— LLM-as-judge boolean evaluationmatch_llm_intent("the response is a polite apology")— judge framed as intent matchingmatch_json_schema(schema)— JSON Schema validationbe_semantically_similar_to("reference text")— cosine similarity over embeddings
- Adapters for the
ruby_llmandlangchainrbgems, plus a programmable in-memory fake for hermetic unit specs. - A
describe_llm/evaluateDSL for running batches of prompt → expectation pairs. - VCR-friendly: deterministic playback is delegated to your existing VCR config (see VCR pattern).
Installation
bundle add rspec-llm
You only need to install one of ruby_llm or langchainrb (or both) depending on which adapter you intend to use. rspec-llm doesn't pull either in as a hard dependency.
bundle add ruby_llm # if you use ruby_llm
bundle add langchainrb # if you use langchainrb
Configuration
Configure once in spec/spec_helper.rb:
require "rspec/llm"
require "ruby_llm"
RSpec::LLM.configure do |c|
c.client = RubyLLM.chat(model: "claude-sonnet-4-6")
c.judge = RubyLLM.chat(model: "claude-haiku-4-5")
c. = ->(text) { RubyLLM.(text).vectors }
c.similarity_threshold = 0.8
end
With langchainrb it looks like:
require "rspec/llm"
require "langchain"
llm = Langchain::LLM::OpenAI.new(api_key: ENV.fetch("OPENAI_API_KEY"))
RSpec::LLM.configure do |c|
c.client = llm
c.judge = llm
c. = ->(text) { llm.(text: text). }
end
Matchers
RSpec.describe "Summarizer" do
let(:response) { MyApp.summarize(article) }
it { expect(response).to pass_llm_judge("contains a single-sentence summary") }
it { expect(response).to match_llm_intent("a summary of the article") }
it { expect(response).to match_json_schema(MyApp::SUMMARY_SCHEMA) }
it { expect(response).to be_semantically_similar_to(expected_gist).within(0.85) }
end
pass_llm_judge
Sends the response and a criterion to the configured judge model. When the ruby_llm gem is loaded, the judge uses a RubyLLM::Schema-backed structured contract that forces the model to return { passed: boolean, reason: string } — eliminating brittle first-token YES/NO parsing and always surfacing a rich explanation in the failure message. For all other adapters the matcher falls back to parsing the first YES/NO token transparently.
expect(reply).to pass_llm_judge("is polite and apologetic")
To use a different judge for a single matcher, chain .using(some_client).
match_llm_intent
Same machinery as pass_llm_judge, framed as intent matching — useful when the "criterion" is naturally a description of what the response should say.
expect(reply).to match_llm_intent("a refund confirmation for order #12345")
match_json_schema
Parses the actual value as JSON (or accepts a Hash/Array directly) and validates against a schema via the json-schema gem. The schema argument can be:
A raw JSON Schema hash (original behaviour — fully backward-compatible):
schema = {
"type" => "object",
"required" => ["summary"],
"properties" => { "summary" => { "type" => "string" } }
}
expect(response).to match_json_schema(schema)
A Ruby class — Data.define, Struct, or any PORO with attr_accessor. The matcher introspects the class and derives the required fields automatically:
# Data.define (Ruby >= 3.2)
UserProfile = Data.define(:full_name, :verified_email)
expect(response).to match_json_schema(UserProfile)
# Struct
Point = Struct.new(:x, :y)
expect(response).to match_json_schema(Point)
# PORO
class OrderSummary
attr_accessor :order_id, :total, :status
end
expect(response).to match_json_schema(OrderSummary)
be_semantically_similar_to
Embeds both sides via the configured embedder, computes cosine similarity, and compares to the threshold. Override the threshold per-matcher with .within(0.9).
expect(response).to be_semantically_similar_to("the cat sat on the mat").within(0.9)
Fake adapter
For fast, hermetic unit tests, stub the client with the built-in fake:
RSpec.describe "Greeter" do
it "returns the canned greeting" do
stub_llm do |fake|
fake.respond_to("Say hi to Alice").with("Hi, Alice!")
fake.respond_to_pattern(/^Say hi to/).with { |prompt| "Hi! (from #{prompt})" }
fake.default("…")
end
expect(MyApp.greet("Alice")).to eq("Hi, Alice!")
end
end
Use stub_llm_judge to stub the judge model separately — handy when testing your own code that wraps pass_llm_judge. Stub with a JSON string to exercise the structured-output path, or with a YES/NO string to exercise the text-parsing fallback:
# Structured output (recommended — matches real ruby_llm behaviour)
stub_llm_judge do |fake|
fake.default('{"passed":true,"reason":"Looks good to me."}')
end
# Legacy text format (still works for backward compatibility)
stub_llm_judge do |fake|
fake.default("YES\nLooks good to me.")
end
DSL
describe_llm is a thin wrapper around RSpec.describe that adds an evaluate group helper:
RSpec.describe_llm "Summarizer evals" do
evaluate "single sentence",
prompt: "Summarize: #{ARTICLE}",
expect: [
pass_llm_judge("is one sentence"),
match_llm_intent("a summary of the article")
]
end
Each evaluate call defines one RSpec example: it calls the configured client with the prompt and applies every matcher in expect:.
VCR pattern
For integration tests against real APIs, record once and replay deterministically using VCR + WebMock:
# spec/spec_helper.rb
require "vcr"
VCR.configure do |c|
c.cassette_library_dir = "spec/cassettes"
c.hook_into :webmock
c.filter_sensitive_data("<OPENAI_API_KEY>") { ENV["OPENAI_API_KEY"] }
c.
end
RSpec.describe "Summarizer", vcr: true do
it "summarizes" do
expect(MyApp.summarize(article)).to pass_llm_judge("is concise")
end
end
Development
bin/setup # install deps
bundle exec rspec # run tests
bundle exec rubocop # lint
bin/console # interactive prompt
Contributing
Bug reports and pull requests welcome on GitHub at https://github.com/washu/rspec-llm.
License
MIT — see LICENSE.txt.