Class: Woods::Embedding::TokenCounter

Inherits:
Object
  • Object
show all
Defined in:
lib/woods/embedding/token_counter.rb

Overview

Exact or estimated token counts for embedding inputs.

When the optional ‘tokenizers` gem (ankane) is installed, loads the `bert-base-uncased` WordPiece tokenizer that nomic-embed-text is built on and returns exact token counts. Otherwise falls back to a conservative chars/token ratio and warns once.

Exact counting is strictly preferred for the Ollama path — Ollama v0.13.5+ stopped honouring the ‘truncate: true` flag on `/api/embed` (see ollama/ollama#14186), so chunks that exceed `num_ctx` return a 400 instead of being truncated. Client-side sizing is the only reliable option until the regression is fixed upstream, and chars/token ratios vary too widely across Rails internals to cover every case with a fixed number.

Examples:

counter = Woods::Embedding::TokenCounter.new
counter.count("ActionController::Metal::ConditionalGet")  # => 13

Constant Summary collapse

BERT_MODEL =

HuggingFace tokenizer id shared by every nomic-embed-text variant.

'bert-base-uncased'
CONSERVATIVE_CHARS_PER_TOKEN =

Conservative floor for when the tokenizer gem isn’t installed. Lower than any ratio we’ve observed failing in the testbed against dense Rails source. Still approximate — install ‘tokenizers` for exact counts.

1.2

Class Attribute Summary collapse

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(chars_per_token: CONSERVATIVE_CHARS_PER_TOKEN, tokenizer_id: BERT_MODEL) ⇒ TokenCounter

Returns a new instance of TokenCounter.

Parameters:

  • chars_per_token (Float) (defaults to: CONSERVATIVE_CHARS_PER_TOKEN)

    fallback ratio when the tokenizer is unavailable

  • tokenizer_id (String) (defaults to: BERT_MODEL)

    HuggingFace model id passed to ‘Tokenizers.from_pretrained`



39
40
41
42
43
44
# File 'lib/woods/embedding/token_counter.rb', line 39

def initialize(chars_per_token: CONSERVATIVE_CHARS_PER_TOKEN, tokenizer_id: BERT_MODEL)
  @chars_per_token = chars_per_token
  @tokenizer_id = tokenizer_id
  @load_attempted = false
  @load_mutex = Mutex.new
end

Class Attribute Details

.warned_messagesObject (readonly)

Returns the value of attribute warned_messages.



113
114
115
# File 'lib/woods/embedding/token_counter.rb', line 113

def warned_messages
  @warned_messages
end

.warned_mutexObject (readonly)

Returns the value of attribute warned_mutex.



113
114
115
# File 'lib/woods/embedding/token_counter.rb', line 113

def warned_mutex
  @warned_mutex
end

Instance Attribute Details

#chars_per_tokenFloat (readonly)

Returns fallback chars-per-token ratio.

Returns:

  • (Float)

    fallback chars-per-token ratio



47
48
49
# File 'lib/woods/embedding/token_counter.rb', line 47

def chars_per_token
  @chars_per_token
end

Class Method Details

.reset_warned!Object

Reset the per-process warning dedup. For tests only — production callers should never need to clear it.



117
118
119
# File 'lib/woods/embedding/token_counter.rb', line 117

def reset_warned!
  @warned_mutex.synchronize { @warned_messages.clear }
end

Instance Method Details

#count(text) ⇒ Integer

Exact token count when the tokenizer is loaded, chars/token estimate otherwise.

Parameters:

  • text (String, nil)

Returns:

  • (Integer)


54
55
56
57
58
59
# File 'lib/woods/embedding/token_counter.rb', line 54

def count(text)
  return 0 if text.nil? || text.empty?

  tok = tokenizer
  tok ? tok.encode(text).ids.length : estimate(text)
end

#exact?Boolean

True when the real tokenizer is loaded and in use.

Returns:

  • (Boolean)


64
65
66
# File 'lib/woods/embedding/token_counter.rb', line 64

def exact?
  !tokenizer.nil?
end