Class: Woods::Embedding::TokenCounter
- Inherits:
-
Object
- Object
- Woods::Embedding::TokenCounter
- Defined in:
- lib/woods/embedding/token_counter.rb
Overview
Exact or estimated token counts for embedding inputs.
When the optional ‘tokenizers` gem (ankane) is installed, loads the `bert-base-uncased` WordPiece tokenizer that nomic-embed-text is built on and returns exact token counts. Otherwise falls back to a conservative chars/token ratio and warns once.
Exact counting is strictly preferred for the Ollama path — Ollama v0.13.5+ stopped honouring the ‘truncate: true` flag on `/api/embed` (see ollama/ollama#14186), so chunks that exceed `num_ctx` return a 400 instead of being truncated. Client-side sizing is the only reliable option until the regression is fixed upstream, and chars/token ratios vary too widely across Rails internals to cover every case with a fixed number.
Constant Summary collapse
- BERT_MODEL =
HuggingFace tokenizer id shared by every nomic-embed-text variant.
'bert-base-uncased'- CONSERVATIVE_CHARS_PER_TOKEN =
Conservative floor for when the tokenizer gem isn’t installed. Lower than any ratio we’ve observed failing in the testbed against dense Rails source. Still approximate — install ‘tokenizers` for exact counts.
1.2
Class Attribute Summary collapse
-
.warned_messages ⇒ Object
readonly
Returns the value of attribute warned_messages.
-
.warned_mutex ⇒ Object
readonly
Returns the value of attribute warned_mutex.
Instance Attribute Summary collapse
-
#chars_per_token ⇒ Float
readonly
Fallback chars-per-token ratio.
Class Method Summary collapse
-
.reset_warned! ⇒ Object
Reset the per-process warning dedup.
Instance Method Summary collapse
-
#count(text) ⇒ Integer
Exact token count when the tokenizer is loaded, chars/token estimate otherwise.
-
#exact? ⇒ Boolean
True when the real tokenizer is loaded and in use.
-
#initialize(chars_per_token: CONSERVATIVE_CHARS_PER_TOKEN, tokenizer_id: BERT_MODEL) ⇒ TokenCounter
constructor
A new instance of TokenCounter.
Constructor Details
#initialize(chars_per_token: CONSERVATIVE_CHARS_PER_TOKEN, tokenizer_id: BERT_MODEL) ⇒ TokenCounter
Returns a new instance of TokenCounter.
39 40 41 42 43 44 |
# File 'lib/woods/embedding/token_counter.rb', line 39 def initialize(chars_per_token: CONSERVATIVE_CHARS_PER_TOKEN, tokenizer_id: BERT_MODEL) @chars_per_token = chars_per_token @tokenizer_id = tokenizer_id @load_attempted = false @load_mutex = Mutex.new end |
Class Attribute Details
.warned_messages ⇒ Object (readonly)
Returns the value of attribute warned_messages.
113 114 115 |
# File 'lib/woods/embedding/token_counter.rb', line 113 def @warned_messages end |
.warned_mutex ⇒ Object (readonly)
Returns the value of attribute warned_mutex.
113 114 115 |
# File 'lib/woods/embedding/token_counter.rb', line 113 def warned_mutex @warned_mutex end |
Instance Attribute Details
#chars_per_token ⇒ Float (readonly)
Returns fallback chars-per-token ratio.
47 48 49 |
# File 'lib/woods/embedding/token_counter.rb', line 47 def chars_per_token @chars_per_token end |
Class Method Details
.reset_warned! ⇒ Object
Reset the per-process warning dedup. For tests only — production callers should never need to clear it.
117 118 119 |
# File 'lib/woods/embedding/token_counter.rb', line 117 def reset_warned! @warned_mutex.synchronize { @warned_messages.clear } end |
Instance Method Details
#count(text) ⇒ Integer
Exact token count when the tokenizer is loaded, chars/token estimate otherwise.
54 55 56 57 58 59 |
# File 'lib/woods/embedding/token_counter.rb', line 54 def count(text) return 0 if text.nil? || text.empty? tok = tokenizer tok ? tok.encode(text).ids.length : estimate(text) end |
#exact? ⇒ Boolean
True when the real tokenizer is loaded and in use.
64 65 66 |
# File 'lib/woods/embedding/token_counter.rb', line 64 def exact? !tokenizer.nil? end |