lex-llm-vllm
LegionIO LLM provider extension for vLLM.
This gem lives under Legion::Extensions::Llm::Vllm and depends on lex-llm for shared provider-neutral routing, fleet, and schema primitives.
Load it with require 'legion/extensions/llm/vllm'.
What It Provides
Legion::Extensions::Llm::Providerregistration as:vllm- Shared
Legion::Extensions::Llm::Provider::OpenAICompatiblerequest and response handling - Chat requests through
POST /v1/chat/completions - Streaming chat with
stream_usage_supported?for token usage reporting - Model discovery through
GET /v1/models - Embeddings through
POST /v1/embeddings - vLLM thinking mode via
chat_template_kwargs(configurable throughLegion::Settings) - Best-effort
llm.registryreadiness and model availability event publishing when transport is loaded - vLLM management helpers:
/health,/version,/reset_prefix_cache,/reset_mm_cache,/sleep,/wake_up - Normalized OpenAI-compatible capability and modality metadata for discovered models
- Shared fleet/default settings via
Legion::Extensions::Llm.provider_settings - Full
Legion::Logging::Helperintegration with structuredhandle_exceptionacross all classes
Defaults
Legion::Extensions::Llm::Vllm.default_settings
# {
# provider_family: :vllm,
# instances: {
# default: {
# endpoint: "http://localhost:8000",
# tier: :private,
# transport: :http,
# usage: { inference: true, embedding: true },
# limits: { concurrency: 8 }
# }
# }
# }
Configuration
Legion::Extensions::Llm.configure do |config|
config.vllm_api_base = "http://localhost:8000"
config.vllm_api_key = ENV["VLLM_API_KEY"]
config.default_model = "meta-llama/Llama-3.1-8B-Instruct"
config. = "BAAI/bge-base-en-v1.5"
end
Thinking Mode
Enable vLLM thinking mode globally via settings:
# In Legion::Settings or settings JSON
{ llm: { providers: { vllm: { enable_thinking: true } } } }
Or pass thinking: { enabled: true } per-request. When enabled, the provider adds chat_template_kwargs: { enable_thinking: true } to the payload and strips reasoning_effort.
Management Endpoints
The provider exposes helpers for vLLM server management:
| Method | Endpoint | Description |
|---|---|---|
health |
GET /health |
Server health check |
version |
GET /version |
Server version info |
reset_prefix_cache |
POST /reset_prefix_cache |
Clear prefix cache |
reset_mm_cache |
POST /reset_mm_cache |
Clear multimodal cache |
sleep(level:) |
POST /sleep |
Put server to sleep |
wake_up(tags:) |
POST /wake_up |
Wake server up |
Registry Publishing
When lex-llm routing and Legion transport are available, the provider publishes best-effort availability events to the llm.registry exchange:
- Readiness events on
readiness(live: true)calls - Model availability events on
list_modelsdiscovery
Publishing is async (background threads) and never blocks the caller. All failures are handled gracefully via handle_exception.
Development
bundle install
bundle exec rspec
bundle exec rubocop
License
MIT