Module: Pikuri::VectorDb::Tokenizer
- Defined in:
- lib/pikuri/vector_db/tokenizer.rb,
lib/pikuri/vector_db/tokenizer/llama_server.rb,
lib/pikuri/vector_db/tokenizer/char_heuristic.rb
Overview
Namespace for tokenizers. Two ship in v1:
-
CharHeuristic — zero-dep, default. Approximates as ~4 chars/token (configurable). Slight overshoot is fine because embedders truncate gracefully.
-
LlamaServer — exact via HTTP POST
/tokenizeagainst a llama.cpp server (typically the same endpoint hosting the embedder model, so chunk sizing matches the embedder’s actual vocab).
Tokenizer protocol
Duck-typed, single method. The Chunker::FixedWindow consumes any object responding to:
-
#count(text) — return token count for
textasInteger. Empty string returns0. Implementations should be deterministic — same text in returns same count.
No abstract base class. The Chunker doesn’t care which implementation it gets, only that #count exists; matches pikuri’s other duck-typed seams (Backend, Confirmer, Filesystem).
Why a separate protocol rather than baking heuristics
into the Chunker
The chunker’s job is “pack text into size-token windows”; how the count is derived is orthogonal. Lifting tokenization out as its own seam means:
-
The Chunker is testable against a deterministic fake tokenizer that returns whatever the test wants.
-
A future
Tokenizer::Tiktoken(orllama_cppgem-based tokenizer) plugs in without touching the Chunker. -
The CharHeuristic /LlamaServer choice is a host-level configuration knob — the host wires whichever fits their stack into the
chunker:passed to Extension, no other change required.
Where this lives
Pikuri::VectorDb::Tokenizer::* for v1. Promotes to Pikuri::Tokenizer::* in pikuri-core when a second consumer arrives (a token-aware bash-output cap, a token-aware Read truncation) — the protocol shape stays the same; only the namespace and dependency direction change.
Defined Under Namespace
Classes: CharHeuristic, LlamaServer