Class: Woods::Embedding::TokenCounter

Inherits:

Object

Object
Woods::Embedding::TokenCounter

show all

Defined in:: lib/woods/embedding/token_counter.rb

Overview

Exact or estimated token counts for embedding inputs.

When the optional ‘tokenizers` gem (ankane) is installed, loads the `bert-base-uncased` WordPiece tokenizer that nomic-embed-text is built on and returns exact token counts. Otherwise falls back to a conservative chars/token ratio and warns once.

Exact counting is strictly preferred for the Ollama path — Ollama v0.13.5+ stopped honouring the ‘truncate: true` flag on `/api/embed` (see ollama/ollama#14186), so chunks that exceed `num_ctx` return a 400 instead of being truncated. Client-side sizing is the only reliable option until the regression is fixed upstream, and chars/token ratios vary too widely across Rails internals to cover every case with a fixed number.

Examples:

counter = Woods::Embedding::TokenCounter.new
counter.count("ActionController::Metal::ConditionalGet")  # => 13

Constant Summary collapse

BERT_MODEL = HuggingFace tokenizer id shared by every nomic-embed-text variant.

'bert-base-uncased'

CONSERVATIVE_CHARS_PER_TOKEN = Conservative floor for when the tokenizer gem isn’t installed. Lower than any ratio we’ve observed failing in the testbed against dense Rails source. Still approximate — install ‘tokenizers` for exact counts.

1.2

Class Attribute Summary collapse

.warned_messages ⇒ Object readonly

Returns the value of attribute warned_messages.
.warned_mutex ⇒ Object readonly

Returns the value of attribute warned_mutex.

Instance Attribute Summary collapse

#chars_per_token ⇒ Float readonly

Fallback chars-per-token ratio.

Class Method Summary collapse

.reset_warned! ⇒ Object

Reset the per-process warning dedup.

Instance Method Summary collapse

#count(text) ⇒ Integer

Exact token count when the tokenizer is loaded, chars/token estimate otherwise.
#exact? ⇒ Boolean

True when the real tokenizer is loaded and in use.
#initialize(chars_per_token: CONSERVATIVE_CHARS_PER_TOKEN, tokenizer_id: BERT_MODEL) ⇒ TokenCounter constructor

A new instance of TokenCounter.

Constructor Details

#initialize(chars_per_token: CONSERVATIVE_CHARS_PER_TOKEN, tokenizer_id: BERT_MODEL) ⇒ `TokenCounter`

Returns a new instance of TokenCounter.

Parameters:

chars_per_token (Float) (defaults to: CONSERVATIVE_CHARS_PER_TOKEN) —

fallback ratio when the tokenizer is unavailable
tokenizer_id (String) (defaults to: BERT_MODEL) —

HuggingFace model id passed to ‘Tokenizers.from_pretrained`

# File 'lib/woods/embedding/token_counter.rb', line 39

def initialize(chars_per_token: CONSERVATIVE_CHARS_PER_TOKEN, tokenizer_id: BERT_MODEL)
  @chars_per_token = chars_per_token
  @tokenizer_id = tokenizer_id
  @load_attempted = false
  @load_mutex = Mutex.new
end

Class Attribute Details

.warned_messages ⇒ `Object` (readonly)

Returns the value of attribute warned_messages.



113
114
115

# File 'lib/woods/embedding/token_counter.rb', line 113

def warned_messages
  @warned_messages
end

.warned_mutex ⇒ `Object` (readonly)

Returns the value of attribute warned_mutex.



113
114
115

# File 'lib/woods/embedding/token_counter.rb', line 113

def warned_mutex
  @warned_mutex
end

Instance Attribute Details

#chars_per_token ⇒ `Float` (readonly)

Returns fallback chars-per-token ratio.

Returns:

(Float) —

fallback chars-per-token ratio



47
48
49

# File 'lib/woods/embedding/token_counter.rb', line 47

def chars_per_token
  @chars_per_token
end

Class Method Details

.reset_warned! ⇒ `Object`

Reset the per-process warning dedup. For tests only — production callers should never need to clear it.



117
118
119

# File 'lib/woods/embedding/token_counter.rb', line 117

def reset_warned!
  @warned_mutex.synchronize { @warned_messages.clear }
end

Instance Method Details

#count(text) ⇒ `Integer`

Exact token count when the tokenizer is loaded, chars/token estimate otherwise.

Parameters:

text (String, nil)

Returns:

(Integer)

# File 'lib/woods/embedding/token_counter.rb', line 54

def count(text)
  return 0 if text.nil? || text.empty?

  tok = tokenizer
  tok ? tok.encode(text).ids.length : estimate(text)
end

#exact? ⇒ `Boolean`

True when the real tokenizer is loaded and in use.

Returns:

(Boolean)



64
65
66

# File 'lib/woods/embedding/token_counter.rb', line 64

def exact?
  !tokenizer.nil?
end

Class: Woods::Embedding::TokenCounter

Overview

Examples:

Constant Summary collapse

Class Attribute Summary collapse

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(chars_per_token: CONSERVATIVE_CHARS_PER_TOKEN, tokenizer_id: BERT_MODEL) ⇒ TokenCounter

Class Attribute Details

.warned_messages ⇒ Object (readonly)

.warned_mutex ⇒ Object (readonly)

Instance Attribute Details

#chars_per_token ⇒ Float (readonly)

Class Method Details

.reset_warned! ⇒ Object

Instance Method Details

#count(text) ⇒ Integer

#exact? ⇒ Boolean

#initialize(chars_per_token: CONSERVATIVE_CHARS_PER_TOKEN, tokenizer_id: BERT_MODEL) ⇒ `TokenCounter`

.warned_messages ⇒ `Object` (readonly)

.warned_mutex ⇒ `Object` (readonly)

#chars_per_token ⇒ `Float` (readonly)

.reset_warned! ⇒ `Object`

#count(text) ⇒ `Integer`

#exact? ⇒ `Boolean`