Class: Rubino::LLM::CacheBreakpointMiddleware
- Inherits:
-
Faraday::Middleware
- Object
- Faraday::Middleware
- Rubino::LLM::CacheBreakpointMiddleware
- Defined in:
- lib/rubino/llm/cache_breakpoint_middleware.rb
Overview
Faraday request middleware that stamps an Anthropic prompt-cache breakpoint on the GROWING conversation tail of EVERY outgoing /messages request — the last content block of the last message, advancing one block each turn.
Why a Faraday middleware (and not the adapter’s turn-boundary code): ruby_llm 1.16 runs the WHOLE model<->tool loop inside a single ask(), so the intermediate tool round-trips never re-enter rubino’s per-turn code. The ONLY seam that sees every actual outgoing request — including those intermediate tool round-trips — is a Faraday request middleware on the Anthropic connection (it runs after ruby_llm has serialized the body). This is exactly the round-trip #532’s load_history tail-stamping missed: that ran once per ask(), never on the tool turns inside it.
Wire shape (Anthropic “growing conversation” / longest-cached-prefix): cache_control: ephemeral is a wire-valid SIBLING key on a content block. We add it to the LAST block of the LAST message UNCONDITIONALLY —whether that block is text, tool_use (assistant tail) or tool_result (user tail). We never restructure the block; a bare-string message content is skipped (no block to stamp).
Breakpoint budget: Anthropic allows at most 4 cache_control breakpoints per request (across tools + system + messages). rubino already places 2 static ones (last tool schema, system prefix). This middleware adds 1 (the moving tail) and, on a very long turn, an optional 2nd “leapfrog” breakpoint ~15 blocks behind the tail so a long burst of tool round-trips still gets a cache READ of the earlier blocks. If stamping would exceed 4, we evict the OLDEST message-level breakpoint first — never a system or tools breakpoint.
The middleware is installed by RubyLLMAdapter ONLY on the anthropic-family path and only when prompts.prompt_cache is on; openai/ollama connections never carry it. It is fully defensive: any parse/shape surprise leaves the body byte-identical.
Constant Summary collapse
- EPHEMERAL =
{ "type" => "ephemeral" }.freeze
- MAX_BREAKPOINTS =
Anthropic hard cap on cache_control breakpoints per request.
4- LEAPFROG_THRESHOLD =
A turn longer than this (content blocks in the messages array) earns the optional second “leapfrog” breakpoint, placed LOOKBACK blocks behind the tail so a long run of tool round-trips still reads the earlier prefix.
20- LEAPFROG_LOOKBACK =
15
Instance Method Summary collapse
Instance Method Details
#call(env) ⇒ Object
52 53 54 55 56 57 58 |
# File 'lib/rubino/llm/cache_breakpoint_middleware.rb', line 52 def call(env) stamp!(env) @app.call(env) rescue StandardError # Never let cache bookkeeping break a real request — forward untouched. @app.call(env) end |