lex-llm-vllm

LegionIO LLM provider extension for vLLM.

This gem lives under Legion::Extensions::Llm::Vllm and depends on lex-llm for shared provider-neutral routing, fleet, and schema primitives.

Load it with require 'legion/extensions/llm/vllm'.

What It Provides

  • Legion::Extensions::Llm::Provider registration as :vllm
  • Shared Legion::Extensions::Llm::Provider::OpenAICompatible request and response handling
  • Chat requests through POST /v1/chat/completions
  • Streaming chat with stream_usage_supported? for token usage reporting
  • Model discovery through GET /v1/models
  • Embeddings through POST /v1/embeddings
  • vLLM thinking mode via chat_template_kwargs (configurable through Legion::Settings)
  • Best-effort llm.registry readiness and model availability event publishing when transport is loaded
  • vLLM management helpers: /health, /version, /reset_prefix_cache, /reset_mm_cache, /sleep, /wake_up
  • Normalized OpenAI-compatible capability and modality metadata for discovered models
  • Shared fleet/default settings via Legion::Extensions::Llm.provider_settings
  • Full Legion::Logging::Helper integration with structured handle_exception across all classes

Defaults

Legion::Extensions::Llm::Vllm.default_settings
# {
#   provider_family: :vllm,
#   instances: {
#     default: {
#       endpoint: "http://localhost:8000",
#       tier: :private,
#       transport: :http,
#       usage: { inference: true, embedding: true },
#       limits: { concurrency: 8 }
#     }
#   }
# }

Configuration

Legion::Extensions::Llm.configure do |config|
  config.vllm_api_base = "http://localhost:8000"
  config.vllm_api_key = ENV["VLLM_API_KEY"]
  config.default_model = "meta-llama/Llama-3.1-8B-Instruct"
  config.default_embedding_model = "BAAI/bge-base-en-v1.5"
end

Thinking Mode

Enable vLLM thinking mode globally via settings:

# In Legion::Settings or settings JSON
{ llm: { providers: { vllm: { enable_thinking: true } } } }

Or pass thinking: { enabled: true } per-request. When enabled, the provider adds chat_template_kwargs: { enable_thinking: true } to the payload and strips reasoning_effort.

Management Endpoints

The provider exposes helpers for vLLM server management:

Method Endpoint Description
health GET /health Server health check
version GET /version Server version info
reset_prefix_cache POST /reset_prefix_cache Clear prefix cache
reset_mm_cache POST /reset_mm_cache Clear multimodal cache
sleep(level:) POST /sleep Put server to sleep
wake_up(tags:) POST /wake_up Wake server up

Registry Publishing

When lex-llm routing and Legion transport are available, the provider publishes best-effort availability events to the llm.registry exchange:

  • Readiness events on readiness(live: true) calls
  • Model availability events on list_models discovery

Publishing is async (background threads) and never blocks the caller. All failures are handled gracefully via handle_exception.

Development

bundle install
bundle exec rspec
bundle exec rubocop

License

MIT