Class: Rubino::LLM::CacheBreakpointMiddleware

Inherits:
Faraday::Middleware
  • Object
show all
Defined in:
lib/rubino/llm/cache_breakpoint_middleware.rb

Overview

Faraday request middleware that stamps an Anthropic prompt-cache breakpoint on the GROWING conversation tail of EVERY outgoing /messages request — the last content block of the last message, advancing one block each turn.

Why a Faraday middleware (and not the adapter’s turn-boundary code): ruby_llm 1.16 runs the WHOLE model<->tool loop inside a single ask(), so the intermediate tool round-trips never re-enter rubino’s per-turn code. The ONLY seam that sees every actual outgoing request — including those intermediate tool round-trips — is a Faraday request middleware on the Anthropic connection (it runs after ruby_llm has serialized the body). This is exactly the round-trip #532’s load_history tail-stamping missed: that ran once per ask(), never on the tool turns inside it.

Wire shape (Anthropic “growing conversation” / longest-cached-prefix): cache_control: ephemeral is a wire-valid SIBLING key on a content block. We add it to the LAST block of the LAST message UNCONDITIONALLY —whether that block is text, tool_use (assistant tail) or tool_result (user tail). We never restructure the block; a bare-string message content is skipped (no block to stamp).

Breakpoint budget: Anthropic allows at most 4 cache_control breakpoints per request (across tools + system + messages). rubino already places 2 static ones (last tool schema, system prefix). This middleware adds 1 (the moving tail) and, on a very long turn, an optional 2nd “leapfrog” breakpoint ~15 blocks behind the tail so a long burst of tool round-trips still gets a cache READ of the earlier blocks. If stamping would exceed 4, we evict the OLDEST message-level breakpoint first — never a system or tools breakpoint.

The middleware is installed by RubyLLMAdapter ONLY on the anthropic-family path and only when prompts.prompt_cache is on; openai/ollama connections never carry it. It is fully defensive: any parse/shape surprise leaves the body byte-identical.

Constant Summary collapse

EPHEMERAL =
{ "type" => "ephemeral" }.freeze
MAX_BREAKPOINTS =

Anthropic hard cap on cache_control breakpoints per request.

4
LEAPFROG_THRESHOLD =

A turn longer than this (content blocks in the messages array) earns the optional second “leapfrog” breakpoint, placed LOOKBACK blocks behind the tail so a long run of tool round-trips still reads the earlier prefix.

20
LEAPFROG_LOOKBACK =
15

Instance Method Summary collapse

Instance Method Details

#call(env) ⇒ Object



52
53
54
55
56
57
58
# File 'lib/rubino/llm/cache_breakpoint_middleware.rb', line 52

def call(env)
  stamp!(env)
  @app.call(env)
rescue StandardError
  # Never let cache bookkeeping break a real request — forward untouched.
  @app.call(env)
end