lex-llm-vllm
LegionIO LLM provider extension for vLLM.
This gem provides a complete vLLM adapter for the LegionIO LLM routing layer. It speaks the OpenAI-compatible API, discovers models at runtime, publishes availability events, and supports vLLM-specific features like thinking mode and server lifecycle management.
Namespace: Legion::Extensions::Llm::Vllm
Provider slug: :vllm
Dependency: lex-llm >= 0.4.3
Load with:
require 'legion/extensions/llm/vllm'
Architecture at a Glance
Legion::Extensions::Llm::Vllm # Root module (namespace, discovery, defaults)
|-- Provider # Per-instance provider (chat, models, management)
| |-- OpenAICompatible (mixin) # Shared request/response handling
| |-- Capabilities (module) # Capability predicates for offerings
|
|-- Actor::DiscoveryRefresh # Periodic actor: refreshes discovered model list
|-- Actor::FleetWorker # Subscription actor: consumes fleet requests
|
|-- Runners::FleetWorker # Runner: delegates to Fleet::ProviderResponder
File Map
| File | What |
|---|---|
lib/legion/extensions/llm/vllm.rb |
Root module, discover_instances, default_settings, alias normalization |
lib/legion/extensions/llm/vllm/version.rb |
VERSION constant |
lib/legion/extensions/llm/vllm/provider.rb |
Provider class, chat/embeddings/model discovery, management endpoints |
lib/legion/extensions/llm/vllm/actors/discovery_refresh.rb |
Periodic actor to refresh model discovery cache |
lib/legion/extensions/llm/vllm/actors/fleet_worker.rb |
Subscription actor for fleet request consumption |
lib/legion/extensions/llm/vllm/runners/fleet_worker.rb |
Runner entrypoint that delegates to Fleet::ProviderResponder |
Key Classes
Legion::Extensions::Llm::Vllm (Root Module)
The top-level module. It handles auto-registration via Legion::Extensions::Llm::AutoRegistration, instance discovery, and configuration normalization.
Constants:
PROVIDER_FAMILY—:vllmDEFAULT_INSTANCE_TIER—{ tier: :direct, capabilities: [:completion, :streaming, :vision, :tools] }
Class methods:
| Method | Description |
|---|---|
default_settings |
Returns the full default settings hash (endpoint, fleet, thinking, etc.) |
provider_class |
Returns Provider |
registry_publisher |
Memoized Legion::Extensions::Llm::RegistryPublisher instance |
discover_instances |
Probes localhost:8000 health endpoint, merges configured instances from Legion::Settings |
normalize_instance_config(config) |
Normalizes config keys (base_url/api_base/endpoint -> vllm_api_base), infers tier |
normalize_api_base(url) |
Strips trailing /v1 from URLs |
infer_tier_from_endpoint(url) |
Returns :local for localhost addresses, :direct otherwise |
Instance discovery sources:
- HTTP health probe against
http://localhost:8000(0.1s timeout) ->:localtier - Configured instances under
Legion::Settings[:extensions][:llm][:vllm][:instances]
Legion::Extensions::Llm::Vllm::Provider
The per-instance provider class. Inherits from Legion::Extensions::Llm::Provider and mixes in OpenAICompatible for shared HTTP request/response handling.
Class methods:
| Method | Returns |
|---|---|
slug |
'vllm' |
local? |
false |
default_transport |
:http |
default_tier |
:direct |
configuration_options |
[:vllm_api_base, :vllm_api_key] |
configuration_requirements |
[] (no required fields) |
capabilities |
Capabilities module |
registry_publisher |
Delegates to Vllm.registry_publisher |
Instance methods:
| Method | Description |
|---|---|
api_base |
Normalized API root from config, settings, or http://localhost:8000 |
headers |
Identity headers + optional Bearer token |
settings |
Returns Vllm.default_settings |
health(live:) |
GET /health |
readiness(live:) |
Checks readiness, publishes async readiness event when live: true |
list_models |
GET /v1/models, publishes async model availability events |
discover_offerings(live:, **) |
Builds ModelOffering instances from discovered models (uses cache when not live) |
version |
GET /version |
fetch_model_detail(model_name) |
Re-fetches /v1/models to resolve context_window on cache miss |
stream_usage_supported? |
Always true for vLLM |
reset_prefix_cache(reset_running_requests:, reset_external:) |
POST /reset_prefix_cache |
reset_mm_cache |
POST /reset_mm_cache |
sleep(level:) |
POST /sleep |
wake_up(tags:) |
POST /wake_up |
Payload rendering: Overrides render_payload to support vLLM thinking mode via chat_template_kwargs and strips reasoning_effort.
Provider::Capabilities (Module)
Predicate methods for model capability detection. All return true for vLLM by default:
chat?(model),streaming?(model),vision?(model),functions?(model),embeddings?(model)critical_capabilities_for(model)— returns array of active capability names
Actor::DiscoveryRefresh
Periodic actor (extends Legion::Extensions::Actors::Every) that refreshes the vLLM discovered model list.
- Default interval: 1800 seconds (30 minutes)
- Configurable via:
Legion::Settings[:extensions][:llm][:vllm][:discovery_interval] - Action: Calls
Legion::LLM::Discovery.refresh_discovered_models!(provider: :vllm)
Actor::FleetWorker
Subscription actor (extends Legion::Extensions::Actors::Subscription) that consumes LLM fleet requests routed to vLLM.
- Only activates when
Fleet::ProviderResponder.enabled_for?returns true for discovered instances - Delegates execution to
Runners::FleetWorker.handle_fleet_request
Runners::FleetWorker
Runner module that dispatches fleet requests to Legion::Extensions::Llm::Fleet::ProviderResponder with vLLM-specific context (provider family, class, instance discovery callback).
Defaults
Legion::Extensions::Llm::Vllm.default_settings
# {
# provider_family: :vllm,
# instances: {
# default: {
# endpoint: "http://localhost:8000",
# tier: :direct,
# transport: :http,
# credentials: { api_key: nil },
# enable_thinking: true,
# usage: { inference: true, embedding: true, image: true },
# limits: { concurrency: 1 },
# fleet: {
# enabled: false,
# respond_to_requests: false,
# capabilities: [:chat, :stream_chat, :embed],
# lanes: [],
# concurrency: 1,
# queue_suffix: nil
# }
# }
# }
# }
Configuration
Per-instance via Legion::Extensions::Llm.configure
Legion::Extensions::Llm.configure do |config|
config.vllm_api_base = "http://localhost:8000"
config.vllm_api_key = ENV["VLLM_API_KEY"]
config.default_model = "meta-llama/Llama-3.1-8B-Instruct"
config. = "BAAI/bge-base-en-v1.5"
end
Multi-instance via Legion::Settings
extensions:
llm:
vllm:
discovery_interval: 1800 # seconds between model list refreshes
instances:
production:
vllm_api_base: "https://vllm.example.com"
tier: :direct
local:
vllm_api_base: "http://localhost:8000"
tier: :local
Endpoint alias normalization
The following keys are all resolved to vllm_api_base during instance config normalization:
base_urlapi_baseendpoint
Trailing /v1 is stripped automatically.
Fleet Responder
Provider instances can opt in to consuming Legion LLM fleet requests. The fleet actor only starts when at least one configured instance enables respond_to_requests.
extensions:
llm:
vllm:
instances:
local:
fleet:
enabled: true
respond_to_requests: true
capabilities:
- chat
- stream_chat
- embed
Execution flows: Actor::FleetWorker (receives message) -> Runners::FleetWorker.handle_fleet_request -> Fleet::ProviderResponder.call.
Thinking Mode
vLLM supports a "thinking" mode that enables extended reasoning. Enable via:
Instance-level:
extensions:
llm:
vllm:
instances:
default:
enable_thinking: true
Global:
# Legion::Settings or settings JSON
{ llm: { providers: { vllm: { enable_thinking: true } } } }
Per-request:
# Pass thinking: { enabled: true } in the chat kwargs
When enabled, the provider adds chat_template_kwargs: { enable_thinking: true } to the chat payload and strips the OpenAI-specific reasoning_effort key.
Management Endpoints
| Method | Endpoint | Kwargs | Description |
|---|---|---|---|
health(live:) |
GET /health |
live: |
Server health check |
version |
GET /version |
none | Server version info |
reset_prefix_cache |
POST /reset_prefix_cache |
reset_running_requests:, reset_external: |
Clear prefix cache |
reset_mm_cache |
POST /reset_mm_cache |
none | Clear multimodal cache |
sleep(level:) |
POST /sleep |
level: (default: 1) |
Put worker to sleep |
wake_up(tags:) |
POST /wake_up |
tags: |
Wake worker up |
Registry Publishing
When lex-llm routing and Legion transport are available, the provider publishes best-effort availability events to the llm.registry exchange:
- Readiness events on
readiness(live: true)calls - Model availability events on
list_modelsdiscovery
All publishing is async (background threads) and never blocks the caller. Failures are logged via handle_exception.
Model Discovery & Offerings
On list_models, vLLM returns max_model_len which is mapped to context_length. This value is:
- Attached to
Model::Infoobjects - Cached via
cache_setwith 86400s TTL keyed bymodel_detail_cache_key - Available in routing offerings via
limits: { context_window: ctx }
discover_offerings(live: false) serves from the cached model list without hitting the network.
Development
bundle install
bundle exec rspec
bundle exec rubocop -A
License
MIT