lex-llm-vllm
LegionIO LLM provider extension for vLLM.
This gem lives under Legion::Extensions::Llm::Vllm and depends on lex-llm >= 0.4.3 for shared provider-neutral routing, response normalization, fleet envelopes, responder-side fleet execution, and schema primitives.
Load it with require 'legion/extensions/llm/vllm'.
What It Provides
Legion::Extensions::Llm::Providerregistration as:vllm- Shared
Legion::Extensions::Llm::Provider::OpenAICompatiblerequest and response handling - Chat requests through
POST /v1/chat/completions - Streaming chat with
stream_usage_supported?for token usage reporting - Model discovery through
GET /v1/models - Embeddings through
POST /v1/embeddings - vLLM thinking mode via
chat_template_kwargs(configurable throughLegion::Settings) - Best-effort
llm.registryreadiness and model availability event publishing when transport is loaded - vLLM management helpers:
/health,/version,/reset_prefix_cache,/reset_mm_cache,/sleep,/wake_up - Normalized OpenAI-compatible capability and modality metadata for discovered models
- Shared fleet/default settings via
Legion::Extensions::Llm.provider_settings - Structured
Legion::Logging::Helperhandling for provider discovery and fallback paths
Defaults
Legion::Extensions::Llm::Vllm.default_settings
# {
# provider_family: :vllm,
# instances: {
# default: {
# endpoint: "http://localhost:8000",
# tier: :direct,
# transport: :http,
# credentials: { api_key: nil },
# enable_thinking: true,
# usage: { inference: true, embedding: true, image: true },
# limits: { concurrency: 1 },
# fleet: {
# enabled: false,
# respond_to_requests: false,
# capabilities: [:chat, :stream_chat, :embed],
# lanes: [],
# concurrency: 1,
# queue_suffix: nil
# }
# }
# }
# }
Configuration
Legion::Extensions::Llm.configure do |config|
config.vllm_api_base = "http://localhost:8000"
config.vllm_api_key = ENV["VLLM_API_KEY"]
config.default_model = "meta-llama/Llama-3.1-8B-Instruct"
config. = "BAAI/bge-base-en-v1.5"
end
Fleet Responder
Provider instances can opt in to consuming Legion LLM fleet requests. The provider-owned fleet actor only starts when at least one configured instance enables respond_to_requests, and request execution delegates to Legion::Extensions::Llm::Fleet::ProviderResponder.
extensions:
llm:
vllm:
instances:
local:
fleet:
enabled: true
respond_to_requests: true
capabilities:
- chat
- stream_chat
- embed
Thinking Mode
Enable vLLM thinking mode globally via settings:
# In Legion::Settings or settings JSON
{ llm: { providers: { vllm: { enable_thinking: true } } } }
Or pass thinking: { enabled: true } per-request. When enabled, the provider adds chat_template_kwargs: { enable_thinking: true } to the payload and strips reasoning_effort.
Management Endpoints
The provider exposes helpers for vLLM server management:
| Method | Endpoint | Description |
|---|---|---|
health |
GET /health |
Server health check |
version |
GET /version |
Server version info |
reset_prefix_cache |
POST /reset_prefix_cache |
Clear prefix cache |
reset_mm_cache |
POST /reset_mm_cache |
Clear multimodal cache |
sleep(level:) |
POST /sleep |
Put server to sleep |
wake_up(tags:) |
POST /wake_up |
Wake server up |
Registry Publishing
When lex-llm routing and Legion transport are available, the provider publishes best-effort availability events to the llm.registry exchange:
- Readiness events on
readiness(live: true)calls - Model availability events on
list_modelsdiscovery
Publishing is async (background threads) and never blocks the caller. All failures are handled gracefully via handle_exception.
Development
bundle install
bundle exec rspec --format json --out tmp/rspec_results.json --format progress --out tmp/rspec_progress.txt
bundle exec rubocop -A
License
MIT