llm_optimizer

A Smart Gateway for LLM API calls in Ruby and Rails applications. Reduces token usage and API costs through four composable optimizations all opt-in, all independently configurable.

How it works

Every call to LlmOptimizer.optimize passes through an ordered pipeline:

prompt  Compressor  ModelRouter  SemanticCache lookup  HistoryManager  LLM call  SemanticCache store  OptimizeResult

Each stage is independently enabled via configuration flags. If any stage fails, the gem falls through to a raw LLM call your app never breaks because of the optimizer.

Optimizations

1. Semantic Caching

Stores prompt embeddings in Redis. On subsequent calls, computes cosine similarity against stored embeddings. If similarity ≥ threshold, returns the cached response instantly no LLM call made.

2. Intelligent Model Routing

Classifies each prompt and routes it to the appropriate model tier:

  • Simple → cheaper/faster model (e.g. llama3, gemini-2.5-flash-lite)
  • Complex → premium model (e.g. claude-haiku-4-5-20251001, gemini-3.0-pro)

Routing uses a three-layer decision chain:

  1. Explicit override — if route_to: :simple or :complex is set, always use that
  2. Fast-path signals — code blocks (`, ~~~) and keywords (analyze, refactor, debug, architect, explain in detail) → instantly :complex, no LLM call
  3. LLM classifier (optional) — for ambiguous prompts, calls a cheap model with a classification prompt; falls back to word-count heuristic if not configured or if the call fails

This hybrid approach fixes the core weakness of pure heuristics:

  • "Fix this bug" → 3 words but :complex via classifier
  • "Explain Ruby blocks simply" → long but :simple via classifier
  • "analyze this code" → keyword fast-path → :complex instantly (no classifier call)

Configure the classifier with any cheap model your app already uses:

config.classifier_caller = ->(prompt) {
  RubyLLM.chat(model: "amazon.nova-micro-v1:0", provider: :bedrock, assume_model_exists: true)
    .ask(prompt).content.strip.downcase
}

If classifier_caller is not set, the router falls back to the word-count heuristic (< 20 words → :simple).

3. Token Pruning

Removes common English stop words from prompts before sending to the LLM. Preserves fenced code block content unchanged. Typically reduces token count by 10–20%.

4. Conversation History Sliding Window

When a conversation history exceeds the configured token budget, summarizes the oldest messages using the simple model and replaces them with a single system summary message. Uses Redis to store for fast reetreival and summarizing.

Installation

Add to your Gemfile:

gem "llm_optimizer"

Then run:

bundle install

For Rails apps, generate the initializer:

rails generate llm_optimizer:install

This creates config/initializers/llm_optimizer.rb with all options pre-filled and commented.

Quick Start

LlmOptimizer.configure do |config|
  config.compress_prompt    = true
  config.use_semantic_cache = true
  config.redis_url          = ENV["REDIS_URL"]

  # Wire up your app's LLM client
  config.llm_caller = ->(prompt, model:) {
    # Use whatever LLM client your app already has
    MyLlmService.chat(prompt, model: model)
  }

  # Wire up your embeddings provider (required if use_semantic_cache: true)
  config.embedding_caller = ->(text) {
    MyEmbeddingService.embed(text)
  }
end

result = LlmOptimizer.optimize("What is Redis?")

puts result.response          # => "Redis is an in-memory data store..."
puts result.cache_status      # => :hit or :miss
puts result.model_tier        # => :simple or :complex
puts result.model             # => "gemini-2.5-flash-lite"
puts result.original_tokens   # => 5
puts result.compressed_tokens # => 4
puts result.latency_ms        # => 12.4

Configuration

Rails initializer

# config/initializers/llm_optimizer.rb
require "llm_optimizer"

LlmOptimizer.configure do |config|
  # --- Feature flags (all off by default) ---
  config.compress_prompt    = true   # strip stop words before sending to LLM
  config.use_semantic_cache = true   # cache responses by vector similarity
  config.manage_history     = true   # summarize old messages when over token budget

  # --- Model routing ---
  config.route_to      = :auto                        # :auto, :simple, or :complex
  config.simple_model  = "gemini-2.5-flash-lite" # used for simple prompts
  config.complex_model = "claude-haiku-4-5-20251001" # used for complex prompts

  # --- Redis (required if use_semantic_cache: true) ---
  config.redis_url = ENV["REDIS_URL"]

  # --- Token / cache settings ---
  config.similarity_threshold = 0.96   # cosine similarity cutoff for cache hit
  config.token_budget         = 4000   # max tokens before history summarization
  config.cache_ttl            = 86400  # cache TTL in seconds (24h)
  config.timeout_seconds      = 5      # timeout for external API calls

  # --- Logging ---
  config.logger        = Rails.logger
  config.debug_logging = Rails.env.development? # logs full prompt+response in dev

  # --- Wire up your app's LLM client ---
  # Replace the body with however your app calls the LLM
  config.llm_caller = ->(prompt, model:) {
    model ||= "claude-haiku-4-5-20251001"
    provider = if model.include?("claude") then :anthropic
               elsif model.include?("gpt") then :openai
               elsif model.include?("gemini") then :gemini
               else :ollama
               end
    chat = RubyLLM.chat(model: model, provider: provider, assume_model_exists: true)
    chat.ask(prompt).content
  }

  # Embeddings caller — wire to your embeddings provider (required if use_semantic_cache: true)
  config.embedding_caller = ->(text) {
    response = RubyLLM.embed(text, provider: :gemini, model: 'gemini-embedding-001')
    response.vectors
  }

  # Classifier caller — optional, improves routing accuracy for ambiguous prompts
  # Falls back to word-count heuristic if not set or if the call fails
  config.classifier_caller = ->(prompt) {
    RubyLLM.chat(model: "amazon.nova-micro-v1:0", provider: :bedrock, assume_model_exists: true)
      .ask(prompt).content.strip.downcase
  }

  # Messages caller - optional, handles converation summary and hostiry manager.
  config.system_prompt = "You are a sarcastic comic person who gives witty responses in a non harmful way. If any serious question is asked, handle it in a calm way."

  config.messages_caller = ->(messages, model:) {
    chat = RubyLLM.chat(model: model)
    messages[0..-2].each { |m| chat.add_message(role: m[:role], content: m[:content]) }
    response = chat.ask(messages.last[:content])
    response.content
  }
end

Configuration reference

Key Type Default Description
compress_prompt Boolean false Strip stop words before sending to LLM
use_semantic_cache Boolean false Enable Redis-backed semantic cache
manage_history Boolean false Enable conversation history summarization
route_to Symbol :auto :auto, :simple, or :complex
simple_model String "gemini-2.5-flash-lite" Model for simple prompts
complex_model String "claude-haiku-4-5-20251001" Model for complex prompts
similarity_threshold Float 0.96 Minimum cosine similarity for cache hit
token_budget Integer 4000 Token limit before history summarization
cache_ttl Integer 86400 Cache entry TTL in seconds
timeout_seconds Integer 5 Timeout for external API calls
redis_url String nil Redis connection URL
embedding_model String "gemini-embedding-001" Embedding model name (OpenAI fallback)
logger Logger Logger.new($stdout) Any Logger-compatible object
debug_logging Boolean false Log full prompt and response at DEBUG level
llm_caller Lambda nil (prompt, model:) -> String
embedding_caller Lambda nil (text) -> Array<Float>
classifier_caller Lambda nil (prompt) -> "simple" or "complex"
messages_caller Lambda nil (messages, model:) -> String — used when conversation_id is present; receives full history including current user turn
system_prompt String nil Seeded as the first system message when a new conversation is created via conversation_id
conversation_ttl Integer 86400 TTL in seconds for Redis-backed conversation history (0 for no expiry)

Per-call configuration

Override global config for a single call using a block:

result = LlmOptimizer.optimize(prompt) do |config|
  config.route_to      = :simple
  config.compress_prompt = false
end

Conversation history

Pass a messages array to enable history management:

messages = [
  { role: "user",      content: "Tell me about Redis" },
  { role: "assistant", content: "Redis is an in-memory data store..." },
  # ... more messages
]

result = LlmOptimizer.optimize("What else can it do?", messages: messages)

## OptimizeResult

Every call returns an `OptimizeResult` struct:

| Field | Type | Description |
|---|---|---|
| `response` | String | The LLM response text |
| `model` | String | Model name actually used |
| `model_tier` | Symbol | `:simple` or `:complex` |
| `cache_status` | Symbol | `:hit` or `:miss` |
| `original_tokens` | Integer | Estimated token count before compression |
| `compressed_tokens` | Integer | Estimated token count after compression (`nil` if not compressed) |
| `latency_ms` | Float | Total wall-clock time for the optimize call |
| `messages` | Array | Final messages array sent to the LLM, after history management and conversation hydration (`nil` on a cache hit) |

The `messages` field reflects the actual array passed to `messages_caller` (or built from `conversation_id`), including any summarization applied by the history manager. You can pass it back as `options[:messages]` on the next call to continue a stateless conversation.

## Resilience

| Failure | Behavior |
|---|---|
| Redis unavailable (read) | Treat as cache miss, continue |
| Redis unavailable (write) | Log warning, return LLM result normally |
| Embedding API failure | Treat as cache miss, continue |
| Any component exception | Log error, fall through to raw LLM call |
| History summarization failure | Log warning, return original messages unchanged |
| Conversation load failure | Log warning, proceed without history |
| Conversation save failure | Log warning, return result with pre-save messages |

## Development

```bash
bundle install
bundle exec rake test     # run tests
bundle exec rake rubocop  # lint
bundle exec rake          # test + lint

Generate the Rails initializer in a target app:

rails generate llm_optimizer:install

Contribution

See CONTRIBUTING.md

License

MIT


GitHub · RubyGems · Changelog