Class: Pikuri::VectorDb::Tokenizer::CharHeuristic
- Inherits:
-
Object
- Object
- Pikuri::VectorDb::Tokenizer::CharHeuristic
- Defined in:
- lib/pikuri/vector_db/tokenizer/char_heuristic.rb
Overview
Approximate tokenization via a fixed chars-per-token ratio. The zero-dep default; ships at construction without a network round-trip. Slight overshoot (under-counting tokens) is fine because embedders truncate gracefully — a chunk that overshoots by 10% is silently capped, not rejected.
The 4-chars-per-token rule
English BPE tokenizers — GPT, llama, BGE, the family of sentence-transformer models pikuri-vectordb users actually run — settle around 3.5–4.5 chars per token on natural-language text. Code drops to ~3 (punctuation density). Mandarin / Japanese drop to ~1–2 (one logograph per token). Default 4 sits in the English-prose sweet spot; override chars_per_token: for other corpora.
When to choose this vs LlamaServer
-
CharHeuristic — you want zero network dependency, you trust the embedder’s truncation, you’re indexing English / Western-European prose. The educational / out-of-the-box default.
-
LlamaServer — you want exact counts, your corpus is non-English or code-heavy, or your embedder has a tight context window where overshoot matters (e.g., bge-small-en-v1.5 at 512 tokens — 10% slop is 50 tokens of clipped content).
Instance Method Summary collapse
-
#count(text) ⇒ Integer
Approximate token count.
- #initialize(chars_per_token: 4.0) ⇒ CharHeuristic constructor
Constructor Details
#initialize(chars_per_token: 4.0) ⇒ CharHeuristic
41 42 43 44 45 |
# File 'lib/pikuri/vector_db/tokenizer/char_heuristic.rb', line 41 def initialize(chars_per_token: 4.0) raise ArgumentError, 'chars_per_token must be positive' if chars_per_token <= 0 @chars_per_token = chars_per_token.to_f end |
Instance Method Details
#count(text) ⇒ Integer
Approximate token count. Empty string returns 0; short strings round up to at least 1 token.
52 53 54 55 56 |
# File 'lib/pikuri/vector_db/tokenizer/char_heuristic.rb', line 52 def count(text) return 0 if text.empty? (text.length / @chars_per_token).ceil end |