Class: Rubino::LLM::RubyLLMAdapter

Inherits:

Object

Object
Rubino::LLM::RubyLLMAdapter

show all

Defined in:: lib/rubino/llm/ruby_llm_adapter.rb

Overview

Adapter wrapping ruby_llm to isolate all LLM integration details. The rest of the application never calls ruby_llm directly.

Constant Summary collapse

OUTPUT_LIMIT_BY_PROVIDER = Per-provider max OUTPUT-token ceilings for the fallback default, mirroring Hermes’ _ANTHROPIC_OUTPUT_LIMITS. thinking tokens count TOWARD max_tokens, so a flat 16_384 default starves a thinking-enabled model: with an 8_000 thinking budget only ~8_384 remained for visible output. A heavy turn whose assistant emits a large single-shot tool_use (e.g. writing a whole file) overran that mid-stream and MiniMax terminated the request with a generic “invalid params” (reproduced: ~30s of silent generation, then a failed-response error — NOT a request-shape rejection; the same body replays 200). MiniMax’s real output ceiling is 131_072 (Hermes uses the same), so give it room. Only providers in this table change; every other provider keeps the conservative 16_384 default (a model whose hard cap is lower — e.g. a native Anthropic 3.5 at 8_192 — must not be over-asked).

{ "minimax" => 131_072 }.freeze

DEFAULT_OUTPUT_LIMIT =

16_384

Instance Attribute Summary collapse

#model_id ⇒ Object readonly

Returns the value of attribute model_id.
#provider ⇒ Object readonly

Returns the value of attribute provider.

Instance Method Summary collapse

#call(request) ⇒ Object

The single LLM boundary entry: take one LLM::Request, dispatch to the streaming vs non-streaming transport based on request.stream, and return a normalized AdapterResponse.
#chat(messages:, tools: nil, response_format: nil, image_paths: [], prefill: nil, on_intermediate_message: nil, on_round_trip: nil, budget_exhausted: nil) ⇒ Object

Sends a chat completion request (non-streaming).
#initialize(model_id: nil, provider: nil, config: nil, ui: nil, event_bus: nil, tool_executor: nil, cancel_token: nil, isolate_config: false) ⇒ RubyLLMAdapter constructor

A new instance of RubyLLMAdapter.
#stream(messages:, tools: nil, response_format: nil, image_paths: [], prefill: nil, on_intermediate_message: nil, on_round_trip: nil, budget_exhausted: nil) ⇒ Object

Sends a streaming chat request, yielding chunks.

Constructor Details

#initialize(model_id: nil, provider: nil, config: nil, ui: nil, event_bus: nil, tool_executor: nil, cancel_token: nil, isolate_config: false) ⇒ `RubyLLMAdapter`

Returns a new instance of RubyLLMAdapter.

# File 'lib/rubino/llm/ruby_llm_adapter.rb', line 47

def initialize(model_id: nil, provider: nil, config: nil, ui: nil, event_bus: nil,
               tool_executor: nil, cancel_token: nil, isolate_config: false)
  @config        = config || Rubino.configuration
  @model_id      = model_id || @config.dig("model", "default")
  @provider      = provider || resolve_provider
  @temperature   = @config.dig("model", "temperature")
  @ui            = ui || Rubino.ui
  @event_bus     = event_bus || Rubino.event_bus
  @tool_executor = tool_executor # nil = ToolBridge falls back to direct tool.call
  @cancel_token  = cancel_token

  # SLICE-7: when built as a FallbackChain entry, scope provider config
  # (api keys / base_url / timeout) into a per-adapter RubyLLM::Context
  # instead of the process-global RubyLLM.configure. This is the heart of
  # the global-config hazard fix: switching providers
  # for a fallback must NOT mutate the global, or concurrent sessions on the
  # API/server path corrupt each other's provider config. The primary
  # adapter (isolate_config: false) keeps writing the global exactly as
  # before, so existing single-provider setups are byte-identical.
  if isolate_config
    @context = RubyLLM.context { |c| apply_provider_config!(c) }
  else
    configure_ruby_llm!
  end
end

Instance Attribute Details

#model_id ⇒ `Object` (readonly)

Returns the value of attribute model_id.



30
31
32

# File 'lib/rubino/llm/ruby_llm_adapter.rb', line 30

def model_id
  @model_id
end

#provider ⇒ `Object` (readonly)

Returns the value of attribute provider.



30
31
32

# File 'lib/rubino/llm/ruby_llm_adapter.rb', line 30

def provider
  @provider
end

Instance Method Details

#call(request) ⇒ `Object`

The single LLM boundary entry: take one LLM::Request, dispatch to the streaming vs non-streaming transport based on request.stream, and return a normalized AdapterResponse. The streaming variant yields chunks to the block then returns the same Response. This is the front door the conversation loop depends on; #chat / #stream remain as the underlying transports and stay valid for existing callers.

Graceful thinking degradation (#75): a provider on the anthropic- compatible path that rejects the thinking budget used to hard-error the user’s very first prompt (the default effort is medium). When the rejection is recognised, remember it for the session, tell the user once, and retry this same request WITHOUT the budget. Safe to re-issue: the rejection is a pre-stream 400, so no token reached the UI.

# File 'lib/rubino/llm/ruby_llm_adapter.rb', line 86

def call(request, &)
  dispatch(request, &)
rescue StandardError => e
  raise unless thinking_budget_rejected?(e)

  ThinkingSupport.mark_unsupported!(@provider, notify: @ui)
  dispatch(request, &)
end

#chat(messages:, tools: nil, response_format: nil, image_paths: [], prefill: nil, on_intermediate_message: nil, on_round_trip: nil, budget_exhausted: nil) ⇒ `Object`

Sends a chat completion request (non-streaming). image_paths, if any, are forwarded to ruby_llm’s ‘with:` slot so the primary model ingests the bytes natively (no `vision` tool round-trip). Only meaningful on the first model call of a turn — Loop strips it for follow-ups.

# File 'lib/rubino/llm/ruby_llm_adapter.rb', line 99

def chat(messages:, tools: nil, response_format: nil, image_paths: [], prefill: nil,
         on_intermediate_message: nil, on_round_trip: nil, budget_exhausted: nil)
  if bedrock_bearer_mode?
    bedrock_bearer_client.chat(messages: messages, tools: tools)
  else
    chat_instance = build_chat(tools: tools, response_format: response_format,
                               budget_exhausted: budget_exhausted)
    load_history(chat_instance, messages)
    apply_prefill(chat_instance, prefill)
    usage = wire_round_trip_callbacks(chat_instance,
                                      on_intermediate_message: on_intermediate_message,
                                      on_round_trip: on_round_trip)
    response = chat_instance.ask(last_user_content(messages), with: presence(image_paths))
    build_response(response, usage: usage)
  end
end

#stream(messages:, tools: nil, response_format: nil, image_paths: [], prefill: nil, on_intermediate_message: nil, on_round_trip: nil, budget_exhausted: nil) ⇒ `Object`

Sends a streaming chat request, yielding chunks. Inline <think>…</think> sentinels are routed to the :thinking channel. Buffered partial content is preserved across mid-stream parse errors so downstream code can show whatever the model produced before the failure.

# File 'lib/rubino/llm/ruby_llm_adapter.rb', line 120

def stream(messages:, tools: nil, response_format: nil, image_paths: [], prefill: nil,
           on_intermediate_message: nil, on_round_trip: nil, budget_exhausted: nil, &)
  if bedrock_bearer_mode?
    # BedrockBearerClient#stream buffers the whole /converse response before
    # its first emit, so a transport error can only fire pre-first-chunk —
    # no token reached the UI. It raises straight through to the runner,
    # which re-issues a fresh request (safe, no double output).
    return bedrock_bearer_client.stream(messages: messages, tools: tools, &)
  end

  # No retry wrapper here — retry ownership moved to Agent::ModelCallRunner
  # (Slice 4) to avoid double-retrying the same failure. The streaming
  # transport-drop PROTECTION still lives inside #stream_once: it RAISES a
  # transport drop only when NOTHING was emitted to the UI yet
  # (chunks_seen.zero?), so the runner can re-issue a fresh request before
  # any token reached the user — no double output. Once a chunk has flowed
  # it RETURNS the buffered partial instead of raising, so the drop can
  # never be retried mid-stream. The raise-vs-return decision (the only
  # streaming-specific safety) stays here; the actual retrying is the
  # runner's job.
  stream_once(messages: messages, tools: tools, response_format: response_format,
              image_paths: image_paths, prefill: prefill,
              on_intermediate_message: on_intermediate_message,
              on_round_trip: on_round_trip, budget_exhausted: budget_exhausted, &)
end