Module: Rubino::LLM::ErrorClassifier
- Defined in:
- lib/rubino/llm/error_classifier.rb
Overview
Centralized API-error classifier — the single source of truth for “is this error worth a retry?”, replacing the adapter’s boolean transient_error?. Port of the reference classify_api_error, reduced to the structural signals ruby_llm actually surfaces: a typed error class and the wrapped HTTP status. We do NOT port the giant message-pattern tables (billing/rate-limit/context phrase lists) — ruby_llm raises typed classes, so status + class carry the same information without the brittle matching. The one message-based branch kept is the MiniMax “unknown error” (code 999/1000) blip, which arrives statusless and must stay in the retryable ‘unknown` bucket.
Constant Summary collapse
- STREAM_DROP_ERRORS =
Transport-level drops that surface mid-request and never reach an HTTP status — always retryable. faraday-net_http re-raises IOError/EOFError (and friends) as Faraday::ConnectionFailed, the type we actually see for an upstream socket close; the rest are defensive.
[ Faraday::ConnectionFailed, Faraday::TimeoutError, Net::OpenTimeout, Net::ReadTimeout, EOFError, IOError, Errno::ECONNRESET, Errno::EPIPE ].freeze
- RETRYABLE_HTTP =
ruby_llm 1.15 raises a typed error per HTTP status. Map the classes we can name directly; everything else falls through to status-based then unknown classification.
->(status) { status && (status >= 500 || status == 429) }.freeze
- UNKNOWN_PROVIDER_ERROR_PATTERNS =
Body/message fragments identifying a transient provider “unknown error” (MiniMax api_error 999/1000 on the Anthropic-compatible endpoint). Kept narrow and provider-blip-specific. Moved here from the adapter so the classifier is the single source of truth (folds Slice 0(b)).
[ "unknown error", "api_error 999", "api_error 1000", "\"code\":999", "\"code\": 999", "\"code\":1000", "\"code\": 1000", "code 999", "code 1000" ].freeze
- TRANSIENT_TRANSPORT_PATTERNS =
Last-resort transport-drop phrases for statusless errors that never surfaced as a typed transport class.
[ "timeout", "timed out", "connection reset", "connection refused", "broken pipe", "end of file reached" ].freeze
- LOCAL_PROGRAMMING_ERRORS =
Local Ruby PROGRAMMING errors — unambiguous bugs in our own code (or a caller’s), not provider/API blips. These must NEVER be retried: a retry storm would mask the bug behind backoff (the very thing that turned a mid-turn ‘NoMethodError` from the UI into three `llm.retry` warnings). They reach `classify` only because ModelCallRunner rescues StandardError broadly around the boundary call; the reference classify_api_error never sees them because it only ever runs at the API layer. So we short-circuit them to NON-retryable (reason stays :unknown) BEFORE the unknown→retryable fallback, surfacing the bug immediately. The set is curated by CLASS, not message: every entry is a clear local bug. RuntimeError is deliberately EXCLUDED — it is too generic (ruby_llm/providers raise it for transient conditions), so it stays on the message-based path and keeps its provider-blip retryability.
[ NoMethodError, NameError, NoMatchingPatternError, NoMatchingPatternKeyError, ArgumentError, TypeError, NotImplementedError, FrozenError, LocalJumpError, ThreadError, FiberError ].freeze
- MISSING_CREDENTIAL_PATTERNS =
A missing / unconfigured credential — raised BEFORE any HTTP call, so it carries no status and would otherwise fall through to the unknown→retryable default and trigger an ~80s retry storm that exits empty (#93). ruby_llm raises RubyLLM::ConfigurationError (“Missing configuration for OpenRouter: openrouter_api_key”) when a provider’s key is unset; our own adapter raises Rubino::Error (“Missing API key for provider …”). A missing key is a credential problem the user must fix — classify it as a NON-retryable AUTH error so the runner surfaces it immediately.
[ "missing configuration for", "missing api key", "no api key", "api key is not set", "_api_key" ].freeze
- INVALID_CREDENTIAL_PATTERNS =
A PRESENT but INVALID credential rejected by the provider via a statusless / untyped error body (MiniMax’s Anthropic-compatible endpoint says “login fail” with no 401), which used to fall through to the unknown→retryable default and burn ~60-90s of silent retries on a deterministic auth failure (#126). Same deal as a typed 401/403: NON-retryable AUTH, surfaced immediately. Patterns are the literal provider phrasings, kept narrow.
[ "login fail", "invalid api key", "incorrect api key", "invalid x-api-key", "authentication_error", "authentication failed" ].freeze
- DNS_FAILURE_PATTERNS =
A PERMANENTLY unresolvable host is a misconfiguration, not a transient blip: every retry re-runs the same DNS lookup and fails identically, so retrying burns the whole budget (~81s) on a typo’d base_url (#361a). faraday-net_http wraps the underlying SocketError in a Faraday::ConnectionFailed, so we match on the literal resolver phrasings.
CRUCIAL distinction: only the PERMANENT resolver errors (EAI_NONAME — the host genuinely does not exist) belong here. “Temporary failure in name resolution” (EAI_AGAIN) is the TRANSIENT case — the resolver was momentarily unavailable, e.g. a getaddrinfo storm when several background subagents dial the SAME provider host at once — and the next lookup usually succeeds. It must NOT be marked permanent: left out of this list it falls through to #classify_transport (Faraday::ConnectionFailed →retryable) and recovers, matching Hermes’ transport-retry behaviour.
[ "name or service not known", "nodename nor servname provided", "no address associated with hostname" ].freeze
- INVALID_MEDIA_PATTERNS =
Provider media/image validation rejections — a PERMANENT 4xx-class complaint about the attachment itself, which some providers (MiniMax Anthropic-compat) surface statusless so it used to fall through to the unknown→retryable default and burn the whole retry budget (~80s) on a bad image (#98). The same attachment fails identically on every retry, so fail fast. Patterns are the literal provider phrasings, kept narrow.
[ "media exceeds size limit", "invalid image content", "image: unknown format", "could not process image" ].freeze
- INVALID_PARAMS_PATTERNS =
A deterministic request-VALIDATION rejection (a 4xx “invalid params” / “invalid request” / unprocessable body) that some providers surface STATUSLESS, so it used to fall through to the unknown→retryable default and burn the full api_max_retries:5 backoff (~85s) on a request that fails identically every time (#327). The same body is rejected on every retry, so fail fast. Kept narrow (literal provider phrasings) and ordered AFTER the media check so an image rejection keeps its own reason. The context-overflow phrases are deliberately excluded (skip_if_overflow) —those are handled by the compress-not-fail path above.
[ "invalid params", "invalid parameter", "invalid request", "unprocessable entity", "validation error", "invalid_request_error" ].freeze
- RATE_LIMIT_PATTERNS =
A rate-limit / quota / usage-plan rejection that a provider surfaces with the WRONG shape on the streaming path. The observed case (#WHATIF): a MiniMax HTTP 429 ‘Plan usage limit reached”` reaches ruby_llm’s anthropic-compatible STREAMING parser, which re-wraps it as a 400 BadRequestError carrying the generic default message “Invalid request - please check your input” — so the original 429 is lost and the error is mis-shown as a 400 client/input error, sending a dev to edit a fine prompt. The original “rate_limit_error” / “usage limit” signal survives only in the response BODY, so this stage scans the body as well as the message (the only stage that does) and runs BEFORE invalid-params so it wins over the clobbered “invalid request” text. RATE_LIMIT is retryable (honours Retry-After), matching the typed-429 path.
[ "rate_limit_error", "rate limit", "rate-limit", "ratelimit", "too many requests", "usage limit reached", "usage limit", "token plan", "quota", "plan usage" ].freeze
- MESSAGE_STAGES_PRE_TRANSPORT =
Ordered message-pattern classification stages. Each entry matches when the downcased message contains any pattern (and, for missing-credential, when the error is a RubyLLM::ConfigurationError). Order is PRECEDENCE: the first matching entry wins, mirroring the original ‘||` chain.
:status :http → carry the wrapped HTTP status; :none → force nil (an unresolvable host is statusless by definition). :skip_if_overflow → defer on any context-overflow phrasing so a compressible overflow keeps the compress-not-fail path (only invalid-params needs this guard). :config_error → also match RubyLLM::ConfigurationError by class.The transport (class-based) stage runs BETWEEN the pre- and post-transport groups, so an unresolvable host is caught before transport but a media / params rejection is checked after — exactly as the original chain ordered.
[ { patterns: MISSING_CREDENTIAL_PATTERNS, reason: FailoverReason::AUTH, status: :http, config_error: true }, { patterns: INVALID_CREDENTIAL_PATTERNS, reason: FailoverReason::AUTH, status: :http }, { patterns: DNS_FAILURE_PATTERNS, reason: FailoverReason::FORMAT_ERROR, status: :none } ].freeze
- MESSAGE_STAGES_POST_TRANSPORT =
[ { patterns: INVALID_MEDIA_PATTERNS, reason: FailoverReason::FORMAT_ERROR, status: :http }, { patterns: INVALID_PARAMS_PATTERNS, reason: FailoverReason::FORMAT_ERROR, status: :http, skip_if_overflow: true } ].freeze
- CONTEXT_OVERFLOW_PATTERNS =
[ "context length", "context window", "maximum context", "token limit", "too many tokens", "prompt is too long", "max_tokens" ].freeze
- MODEL_NOT_FOUND_PATTERNS =
[ "is not a valid model", "invalid model", "model not found", "model_not_found", "does not exist", "no such model", "unknown model" ].freeze
Class Method Summary collapse
-
.classify(error) ⇒ Object
Classify an error into a ClassifiedError with reason + recovery hints.
-
.classify_by_status(status, error) ⇒ Object
HTTP status classification with message-aware refinement, mirroring _classify_by_status (error_classifier.py:725) for the CORE reasons.
- .classify_message(error, stages) ⇒ Object
-
.classify_rate_limit(error) ⇒ Object
A rate-limit / quota rejection mis-shaped by the provider’s streaming path (the MiniMax 429→400 BadRequestError case).
-
.classify_statusless(error) ⇒ Object
No decisive status: the MiniMax “unknown error” blip and bare transport drops.
-
.classify_transport(error) ⇒ Object
Transport drops (Faraday::ConnectionFailed for the MiniMax EOF, read/ connect timeouts, …) are retryable regardless of message — they never reach an HTTP status.
-
.classify_typed(error) ⇒ Object
Typed ruby_llm errors we can name without a status lookup.
- .config_error?(error) ⇒ Boolean
- .context_overflow?(error) ⇒ Boolean
-
.error_body(error) ⇒ Object
The raw response body of a typed RubyLLM error (where a provider’s original error frame survives after ruby_llm overwrote the message with a generic default), or “” when unavailable.
-
.http_status(error) ⇒ Object
HTTP status from a typed RubyLLM::Error’s wrapped Faraday response, or nil.
- .local_programming_error?(error) ⇒ Boolean
- .model_not_found?(error) ⇒ Boolean
-
.result_for(reason, status, error, retryable:) ⇒ Object
── helpers ──────────────────────────────────────────────────────────.
-
.retryable?(error) ⇒ Boolean
Convenience: just the boolean the adapter’s retry loop needs.
Class Method Details
.classify(error) ⇒ Object
Classify an error into a ClassifiedError with reason + recovery hints. Priority mirrors the reference pipeline: typed/transport class → HTTP status →statusless provider-unknown / transport → unknown (retryable default).
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
# File 'lib/rubino/llm/error_classifier.rb', line 112 def classify(error) status = http_status(error) result = (error, MESSAGE_STAGES_PRE_TRANSPORT) || classify_transport(error) || classify_rate_limit(error) || (error, MESSAGE_STAGES_POST_TRANSPORT) || classify_typed(error) || (status && classify_by_status(status, error)) || classify_statusless(error) return result if result # A genuine local Ruby bug (NoMethodError, ArgumentError, …) is NOT a # retryable provider blip — propagate it immediately instead of letting # the unknown→retryable default mask it behind a backoff storm. return result_for(FailoverReason::UNKNOWN, status, error, retryable: false) if local_programming_error?(error) result_for(FailoverReason::UNKNOWN, status, error, retryable: true) end |
.classify_by_status(status, error) ⇒ Object
HTTP status classification with message-aware refinement, mirroring _classify_by_status (error_classifier.py:725) for the CORE reasons.
367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 |
# File 'lib/rubino/llm/error_classifier.rb', line 367 def classify_by_status(status, error) case status when 401, 403 result_for(FailoverReason::AUTH, status, error, retryable: false) when 402 result_for(FailoverReason::BILLING, status, error, retryable: false) when 404 # Generic 404 with no "model not found" signal is treated as unknown # (retryable) per the reference: a misconfigured # endpoint or proxy glitch shouldn't masquerade as a missing model. if model_not_found?(error) result_for(FailoverReason::MODEL_NOT_FOUND, status, error, retryable: false) else result_for(FailoverReason::UNKNOWN, status, error, retryable: true) end when 429 result_for(FailoverReason::RATE_LIMIT, status, error, retryable: true) when 503, 529 result_for(FailoverReason::OVERLOADED, status, error, retryable: true) when 400 if context_overflow?(error) result_for(FailoverReason::CONTEXT_OVERFLOW, status, error, retryable: false) elsif model_not_found?(error) result_for(FailoverReason::MODEL_NOT_FOUND, status, error, retryable: false) else result_for(FailoverReason::FORMAT_ERROR, status, error, retryable: false) end else if status >= 500 result_for(FailoverReason::SERVER_ERROR, status, error, retryable: true) elsif status >= 400 result_for(FailoverReason::FORMAT_ERROR, status, error, retryable: false) end end end |
.classify_message(error, stages) ⇒ Object
301 302 303 304 305 306 307 308 309 310 311 312 313 314 |
# File 'lib/rubino/llm/error_classifier.rb', line 301 def (error, stages) msg = error..to_s.downcase stages.each do |stage| next if stage[:skip_if_overflow] && context_overflow?(error) matched = (stage[:config_error] && config_error?(error)) || stage[:patterns].any? { |p| msg.include?(p) } next unless matched status = stage[:status] == :http ? http_status(error) : nil return result_for(stage[:reason], status, error, retryable: false) end nil end |
.classify_rate_limit(error) ⇒ Object
A rate-limit / quota rejection mis-shaped by the provider’s streaming path (the MiniMax 429→400 BadRequestError case). Scans BOTH the message and the response body — the only signal of the original 429 once ruby_llm has clobbered the message to its generic “Invalid request” 400 default lives in the body. A context-overflow phrasing wins instead (a “token limit” 429 is an overflow to compress, not a rate limit to back off). RATE_LIMIT is retryable and carries 429 so Retry-After/backoff applies.
283 284 285 286 287 288 289 290 |
# File 'lib/rubino/llm/error_classifier.rb', line 283 def classify_rate_limit(error) return if context_overflow?(error) text = "#{error.} #{error_body(error)}".downcase return unless RATE_LIMIT_PATTERNS.any? { |p| text.include?(p) } result_for(FailoverReason::RATE_LIMIT, http_status(error) || 429, error, retryable: true) end |
.classify_statusless(error) ⇒ Object
No decisive status: the MiniMax “unknown error” blip and bare transport drops. A permanent 4xx never reaches here (returned above), so the provider-unknown net stays narrow — mirrors the reference unknown→retryable.
406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 |
# File 'lib/rubino/llm/error_classifier.rb', line 406 def classify_statusless(error) msg = error..to_s.downcase if UNKNOWN_PROVIDER_ERROR_PATTERNS.any? { |p| msg.include?(p) } return result_for(FailoverReason::UNKNOWN, nil, error, retryable: true) end if TRANSIENT_TRANSPORT_PATTERNS.any? { |p| msg.include?(p) } return result_for(FailoverReason::TIMEOUT, nil, error, retryable: true) end # A statusless "unknown model" / "invalid model" (some providers, or # ruby_llm's pre-flight, report it as an untyped error rather than a # ModelNotFoundError) is a deterministic config error — fail fast instead # of the unknown→retryable backoff storm (#417). return result_for(FailoverReason::MODEL_NOT_FOUND, nil, error, retryable: false) if model_not_found?(error) nil end |
.classify_transport(error) ⇒ Object
Transport drops (Faraday::ConnectionFailed for the MiniMax EOF, read/ connect timeouts, …) are retryable regardless of message — they never reach an HTTP status. STREAM_DROP_ERRORS lives on the adapter. An unresolvable host is caught BEFORE this (in #classify) so a permanent DNS failure does not get swept into the retryable timeout bucket.
321 322 323 324 325 |
# File 'lib/rubino/llm/error_classifier.rb', line 321 def classify_transport(error) return unless STREAM_DROP_ERRORS.any? { |klass| error.is_a?(klass) } result_for(FailoverReason::TIMEOUT, nil, error, retryable: true) end |
.classify_typed(error) ⇒ Object
Typed ruby_llm errors we can name without a status lookup.
328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 |
# File 'lib/rubino/llm/error_classifier.rb', line 328 def classify_typed(error) # A permanent context-overflow can arrive DISGUISED as a 5xx: MiniMax # wraps the "context window exceeds limit" 400 in a RubyLLM::ServerError, # which the blanket ServerError/OverloadedError branches below would # blindly mark retryable -> a 5x retry storm (~133s) on a request that # fails identically every time (#356). Run the message-based overflow # check FIRST so an overflow masquerading as 5xx routes to # compress-not-retry, regardless of the wrapping error class. if context_overflow?(error) return result_for(FailoverReason::CONTEXT_OVERFLOW, http_status(error), error, retryable: false) end case error when RubyLLM::ContextLengthExceededError result_for(FailoverReason::CONTEXT_OVERFLOW, http_status(error), error, retryable: false) when RubyLLM::ModelNotFoundError # A deterministic CONFIG error: ruby_llm raises ModelNotFoundError # ("Unknown model: ...") BEFORE any HTTP call when the configured model # id isn't registered — statusless, so it used to fall through to the # unknown→retryable default and burn the full api_max_retries backoff # (~73s) on a request that can NEVER succeed (#417). The model id is # fixed for the run, so every retry re-fails identically: fail fast as a # non-retryable config error with the actionable message. result_for(FailoverReason::MODEL_NOT_FOUND, http_status(error), error, retryable: false) when RubyLLM::UnauthorizedError, RubyLLM::ForbiddenError result_for(FailoverReason::AUTH, http_status(error), error, retryable: false) when RubyLLM::PaymentRequiredError result_for(FailoverReason::BILLING, http_status(error), error, retryable: false) when RubyLLM::RateLimitError result_for(FailoverReason::RATE_LIMIT, http_status(error) || 429, error, retryable: true) when RubyLLM::OverloadedError, RubyLLM::ServiceUnavailableError result_for(FailoverReason::OVERLOADED, http_status(error), error, retryable: true) when RubyLLM::ServerError result_for(FailoverReason::SERVER_ERROR, http_status(error), error, retryable: true) end end |
.config_error?(error) ⇒ Boolean
468 469 470 |
# File 'lib/rubino/llm/error_classifier.rb', line 468 def config_error?(error) defined?(RubyLLM::ConfigurationError) && error.is_a?(RubyLLM::ConfigurationError) end |
.context_overflow?(error) ⇒ Boolean
452 453 454 455 456 457 |
# File 'lib/rubino/llm/error_classifier.rb', line 452 def context_overflow?(error) return true if error.is_a?(RubyLLM::ContextLengthExceededError) msg = error..to_s.downcase CONTEXT_OVERFLOW_PATTERNS.any? { |p| msg.include?(p) } end |
.error_body(error) ⇒ Object
The raw response body of a typed RubyLLM error (where a provider’s original error frame survives after ruby_llm overwrote the message with a generic default), or “” when unavailable.
295 296 297 298 299 |
# File 'lib/rubino/llm/error_classifier.rb', line 295 def error_body(error) return "" unless error.respond_to?(:response) && error.response.respond_to?(:body) error.response.body.to_s end |
.http_status(error) ⇒ Object
HTTP status from a typed RubyLLM::Error’s wrapped Faraday response, or nil.
435 436 437 438 439 440 |
# File 'lib/rubino/llm/error_classifier.rb', line 435 def http_status(error) return unless error.respond_to?(:response) && error.response.respond_to?(:status) status = error.response.status status if status.is_a?(Integer) end |
.local_programming_error?(error) ⇒ Boolean
464 465 466 |
# File 'lib/rubino/llm/error_classifier.rb', line 464 def local_programming_error?(error) LOCAL_PROGRAMMING_ERRORS.any? { |klass| error.is_a?(klass) } end |
.model_not_found?(error) ⇒ Boolean
459 460 461 462 |
# File 'lib/rubino/llm/error_classifier.rb', line 459 def model_not_found?(error) msg = error..to_s.downcase MODEL_NOT_FOUND_PATTERNS.any? { |p| msg.include?(p) } end |
.result_for(reason, status, error, retryable:) ⇒ Object
── helpers ──────────────────────────────────────────────────────────
425 426 427 428 429 430 431 432 |
# File 'lib/rubino/llm/error_classifier.rb', line 425 def result_for(reason, status, error, retryable:) ClassifiedError.new( reason: reason, status_code: status, message: error.respond_to?(:message) ? error..to_s[0, 500] : error.to_s[0, 500], retryable: retryable ) end |
.retryable?(error) ⇒ Boolean
Convenience: just the boolean the adapter’s retry loop needs.
133 134 135 |
# File 'lib/rubino/llm/error_classifier.rb', line 133 def retryable?(error) classify(error).retryable end |