Class: Tep::Llm::OpenAI::Backend
- Inherits:
-
Object
- Object
- Tep::Llm::OpenAI::Backend
- Defined in:
- lib/tep/openai_server.rb
Overview
The interface an app’s backend implements. Defaults make a bare backend safe to compile + serve (empty model list, chat unsupported, cpu device). Subclasses override what they offer.
Instance Method Summary collapse
-
#chat_completion(req) ⇒ Object
Message-level (chat) generation.
-
#chat_completion_stream(req, sink) ⇒ Object
Streaming chat (#127).
-
#device_kind ⇒ Object
Backend’s device, surfaced into the run_start event’s backend.kind at serve! time.
-
#generate_embeddings(model, token_ids) ⇒ Object
Embedding generation for /v1/embeddings.
-
#generate_from_tokens(model, token_ids, sampling) ⇒ Object
PRIMARY shape: token-level generation (maps to /v1/completions, non-streaming).
-
#generate_stream_from_tokens(model, token_ids, sampling, sink) ⇒ Object
STREAMING shape (7.2): the per-token variant for SSE /v1/completions when the request carries “stream”: true.
-
#list_models ⇒ Object
Available model names -> [String].
-
#supports_chat? ⇒ Boolean
Does this backend implement message-level (chat) generation? When false, /v1/chat/completions returns 501.
-
#supports_embeddings? ⇒ Boolean
Backends that can embed override this -> true (gates /v1/embeddings, chunk 7.3).
Instance Method Details
#chat_completion(req) ⇒ Object
Message-level (chat) generation. Mirrors generate_from_tokens but receives the raw req so the backend can parse the messages array itself + apply its own chat template. Tep doesn’t pre-build a Message[] because templating + role ordering is per-model; the JSON tools live in Tep::Json. The return is reused from the token path (text becomes the assistant message’s content). Base no-op; subclasses override. Only reached when supports_chat? returns true – the handler gates with a 501 otherwise.
77 78 79 |
# File 'lib/tep/openai_server.rb', line 77 def chat_completion(req) Tep::Llm::OpenAI::Completion.new end |
#chat_completion_stream(req, sink) ⇒ Object
Streaming chat (#127). Per-token variant for SSE /v1/chat/completions when the request carries “stream”:true. Backend writes each token to ‘sink` via sink.emit_token(piece); the sink formats it as the OpenAI chat-streaming delta frame and writes one chunked frame. Same subclass-override-sink pattern as 7.2 (generate_stream_from_tokens). Base no-op.
87 88 89 |
# File 'lib/tep/openai_server.rb', line 87 def chat_completion_stream(req, sink) 0 end |
#device_kind ⇒ Object
Backend’s device, surfaced into the run_start event’s backend.kind at serve! time. Defaults to cpu.
93 94 95 |
# File 'lib/tep/openai_server.rb', line 93 def device_kind "cpu" end |
#generate_embeddings(model, token_ids) ⇒ Object
Embedding generation for /v1/embeddings. ‘token_ids` is the encoded input (Array; this server speaks IDs only, tokenize client-side, same policy as generate_from_tokens). Returns the pooled embedding as an Array of length d_model – the backend owns the lookup + pooling strategy (toy mean-pools per-token embeddings). Base returns an empty vector so a bare backend compiles; only reached when supports_embeddings? is true (EmbeddingsHandler gates 501).
111 112 113 114 115 |
# File 'lib/tep/openai_server.rb', line 111 def (model, token_ids) empty = [0.0] empty.delete_at(0) empty end |
#generate_from_tokens(model, token_ids, sampling) ⇒ Object
PRIMARY shape: token-level generation (maps to /v1/completions, non-streaming). ‘token_ids` is the encoded prompt (Array); `sampling` is a Tep::Llm::OpenAI::Sampling. Returns a Tep::Llm::OpenAI::Completion (text + usage). The base returns an empty completion so a bare backend compiles; real backends override.
44 45 46 |
# File 'lib/tep/openai_server.rb', line 44 def generate_from_tokens(model, token_ids, sampling) Tep::Llm::OpenAI::Completion.new end |
#generate_stream_from_tokens(model, token_ids, sampling, sink) ⇒ Object
STREAMING shape (7.2): the per-token variant for SSE /v1/completions when the request carries “stream”: true. The backend writes each token to ‘sink` via sink.emit_token(piece); the sink (Tep::Llm::OpenAI::StreamSink) formats it as an OpenAI SSE frame and writes to the outbound chunked stream. Blocks/yields don’t lower across the spinel boundary, so a typed sink replaces the block – backends never see SSE wire format or the client fd. Base no-op (subclasses override).
57 58 59 |
# File 'lib/tep/openai_server.rb', line 57 def generate_stream_from_tokens(model, token_ids, sampling, sink) 0 end |
#list_models ⇒ Object
Available model names -> [String]. /v1/models wraps these.
31 32 33 34 35 |
# File 'lib/tep/openai_server.rb', line 31 def list_models empty = [""] empty.delete_at(0) empty end |
#supports_chat? ⇒ Boolean
Does this backend implement message-level (chat) generation? When false, /v1/chat/completions returns 501. (The chat template is per-model + an ML concern; tep doesn’t ship one.)
64 65 66 |
# File 'lib/tep/openai_server.rb', line 64 def supports_chat? false end |
#supports_embeddings? ⇒ Boolean
Backends that can embed override this -> true (gates /v1/embeddings, chunk 7.3).
99 100 101 |
# File 'lib/tep/openai_server.rb', line 99 def false end |