Class: Pikuri::VectorDb::Tokenizer::LlamaServer

Inherits:

Object

Object
Pikuri::VectorDb::Tokenizer::LlamaServer

show all

Defined in:: lib/pikuri/vector_db/tokenizer/llama_server.rb

Overview

Exact tokenization via HTTP POST /tokenize against a llama.cpp server. Faraday wire format:

POST <endpoint>/tokenize
Content-Type: application/json
{ "content": "hello world" }
→
200 OK
{ "tokens": [123, 456, 789] }

#count returns body.length. Matches the native llama.cpp REST shape — no OpenAI-compat layer needed; the /tokenize endpoint isn’t part of the OpenAI spec anyway, so this is llama.cpp’s own.

Point this at the embedder, not the chat server

Chunking is sized to fit the *embedder’s* context (e.g., bge-small-en-v1.5 has a 512-token limit; text-embedding-3-large has 8192). The chat model’s tokenizer would give the wrong number when they’re different models. So endpoint: should be the URL of the server hosting the embedder, not the chat model.

Per-string round-trip cost

Indexing a 10 MB corpus chunked at 512 tokens ≈ 20k tokenize calls. Each is a few ms on localhost. So indexing pays ~minutes of extra wall time over CharHeuristic — one-time cost, paid at boot or explicit reindex. Acceptable; not worth optimizing.

Errors are loud

Per CLAUDE.md “Errors are loud”: HTTP non-2xx, JSON parse failure, missing tokens key, network failure all raise rather than swallow + return a fudge value. The caller (Chunker) is internal pikuri code, not the LLM; this is bug territory, not “tell the model and let it retry.”

Instance Method Summary collapse

#count(text) ⇒ Integer

Exact token count via the llama.cpp server.
#initialize(endpoint:, connection: nil) ⇒ LlamaServer constructor

Constructor Details

#initialize(endpoint:, connection: nil) ⇒ `LlamaServer`

Parameters:

endpoint (String) —

base URL of the llama.cpp server, e.g. ‘localhost:8081’. The /tokenize path is appended internally.
connection (Faraday::Connection, nil) (defaults to: nil) —

optional dependency-inject for tests. When nil, a fresh Faraday connection is built against endpoint with the JSON middleware applied.

Raises:

(ArgumentError) —

on empty endpoint.

# File 'lib/pikuri/vector_db/tokenizer/llama_server.rb', line 59

def initialize(endpoint:, connection: nil)
  raise ArgumentError, 'endpoint must be non-empty' if endpoint.nil? || endpoint.empty?

  @endpoint = endpoint
  @connection = connection || Faraday.new(url: endpoint) do |f|
    f.request :json
    f.response :json
    f.adapter Faraday.default_adapter
  end
end

Instance Method Details

#count(text) ⇒ `Integer`

Exact token count via the llama.cpp server.