rspec-llm

RSpec matchers, helpers, and a thin DSL for testing LLM-backed code in Ruby.

rspec-llm provides:

Custom matchers for asserting on LLM outputs:
- pass_llm_judge("criterion") — LLM-as-judge boolean evaluation
- match_llm_intent("the response is a polite apology") — judge framed as intent matching
- match_json_schema(schema) — JSON Schema validation
- be_semantically_similar_to("reference text") — cosine similarity over embeddings
Adapters for the ruby_llm and langchainrb gems, plus a programmable in-memory fake for hermetic unit specs.
A describe_llm / evaluate DSL for running batches of prompt → expectation pairs.
VCR-friendly: deterministic playback is delegated to your existing VCR config (see VCR pattern).

Installation

bundle add rspec-llm

You only need to install one of ruby_llm or langchainrb (or both) depending on which adapter you intend to use. rspec-llm doesn't pull either in as a hard dependency.

bundle add ruby_llm        # if you use ruby_llm
bundle add langchainrb     # if you use langchainrb

Configuration

Configure once in spec/spec_helper.rb:

require "rspec/llm"
require "ruby_llm"

RSpec::LLM.configure do |c|
  c.client   = RubyLLM.chat(model: "claude-sonnet-4-6")
  c.judge    = RubyLLM.chat(model: "claude-haiku-4-5")
  c.embedder = ->(text) { RubyLLM.embed(text).vectors }
  c.similarity_threshold = 0.8
end

With langchainrb it looks like:

require "rspec/llm"
require "langchain"

llm = Langchain::LLM::OpenAI.new(api_key: ENV.fetch("OPENAI_API_KEY"))

RSpec::LLM.configure do |c|
  c.client   = llm
  c.judge    = llm
  c.embedder = ->(text) { llm.embed(text: text).embedding }
end

Matchers

RSpec.describe "Summarizer" do
  let(:response) { MyApp.summarize(article) }

  it { expect(response).to pass_llm_judge("contains a single-sentence summary") }
  it { expect(response).to match_llm_intent("a summary of the article") }
  it { expect(response).to match_json_schema(MyApp::SUMMARY_SCHEMA) }
  it { expect(response).to be_semantically_similar_to(expected_gist).within(0.85) }
end

`pass_llm_judge`

Sends the response and a criterion to the configured judge model and parses a YES/NO verdict from the first token of the reply. The judge's reasoning is surfaced in the failure message.

expect(reply).to pass_llm_judge("is polite and apologetic")

To use a different judge for a single matcher, chain .using(some_client).

`match_llm_intent`

Same machinery as pass_llm_judge, framed as intent matching — useful when the "criterion" is naturally a description of what the response should say.

expect(reply).to match_llm_intent("a refund confirmation for order #12345")

`match_json_schema`

Parses the actual value as JSON (or accepts a Hash/Array directly) and validates against the given JSON Schema via the json-schema gem.

schema = {
  "type" => "object",
  "required" => ["summary"],
  "properties" => { "summary" => { "type" => "string" } }
}
expect(response).to match_json_schema(schema)

`be_semantically_similar_to`

Embeds both sides via the configured embedder, computes cosine similarity, and compares to the threshold. Override the threshold per-matcher with .within(0.9).

expect(response).to be_semantically_similar_to("the cat sat on the mat").within(0.9)

Fake adapter

For fast, hermetic unit tests, stub the client with the built-in fake:

RSpec.describe "Greeter" do
  it "returns the canned greeting" do
    stub_llm do |fake|
      fake.respond_to("Say hi to Alice").with("Hi, Alice!")
      fake.respond_to_pattern(/^Say hi to/).with { |prompt| "Hi! (from #{prompt})" }
      fake.default("…")
    end

    expect(MyApp.greet("Alice")).to eq("Hi, Alice!")
  end
end

Use stub_llm_judge to stub the judge model separately — handy when testing your own code that wraps pass_llm_judge:

stub_llm_judge do |fake|
  fake.default("YES\nLooks good to me.")
end

DSL

describe_llm is a thin wrapper around RSpec.describe that adds an evaluate group helper:

RSpec.describe_llm "Summarizer evals" do
  evaluate "single sentence",
    prompt: "Summarize: #{ARTICLE}",
    expect: [
      pass_llm_judge("is one sentence"),
      match_llm_intent("a summary of the article")
    ]
end

Each evaluate call defines one RSpec example: it calls the configured client with the prompt and applies every matcher in expect:.

VCR pattern

For integration tests against real APIs, record once and replay deterministically using VCR + WebMock:

# spec/spec_helper.rb
require "vcr"

VCR.configure do |c|
  c.cassette_library_dir = "spec/cassettes"
  c.hook_into :webmock
  c.filter_sensitive_data("<OPENAI_API_KEY>") { ENV["OPENAI_API_KEY"] }
  c.configure_rspec_metadata!
end

RSpec.describe "Summarizer", vcr: true do
  it "summarizes" do
    expect(MyApp.summarize(article)).to pass_llm_judge("is concise")
  end
end

Development

bin/setup            # install deps
bundle exec rspec    # run tests
bundle exec rubocop  # lint
bin/console          # interactive prompt

Contributing

Bug reports and pull requests welcome on GitHub at https://github.com/salscotto/rspec-llm.

License

MIT — see LICENSE.txt.