Module: Woods::TokenUtils

Defined in:: lib/woods/token_utils.rb

Overview

Shared token estimation utility — the single source of truth for the chars-per-token ratio used across cost estimation, context assembly, and embedding budgeting.

Ratios:

‘:openai` / default — 4.0 chars/token. Benchmarked against tiktoken (cl100k_base) on 19 Ruby source files (mean 4.41 chars/token). We use 4.0 as a conservative floor (~10.6 % overestimate) so truncation never hands the model more tokens than it budgeted for. See `docs/TOKEN_BENCHMARK.md`.
‘:ollama` — 1.5 chars/token. Matches the BERT WordPiece tokenizers used by nomic-embed-text and mxbai-embed-large. See `docs/EMBEDDING_MODELS.md` and `Woods::Builder#chars_per_token_for`.

Callers should prefer TokenUtils.chars_per_token_for over hardcoding a divisor so future tokenizer changes propagate in one place instead of drifting between ContextAssembler, Builder, and cost-model components.

Constant Summary collapse

CHARS_PER_TOKEN_BY_PROVIDER =

{
  openai: 4.0,
  ollama: 1.5
}.freeze

DEFAULT_CHARS_PER_TOKEN =

CHARS_PER_TOKEN_BY_PROVIDER[:openai]

Class Method Summary collapse

.chars_per_token_for(provider) ⇒ Float

Chars-per-token ratio for the given embedding provider.
.estimate_tokens(text) ⇒ Integer

Estimate token count for a string using the default (OpenAI) ratio.
.estimate_tokens_for(text, provider:) ⇒ Integer

Estimate token count for a string using the provider’s native ratio.

Class Method Details

.chars_per_token_for(provider) ⇒ `Float`

Chars-per-token ratio for the given embedding provider.

Parameters:

provider (Symbol, String, nil) —

Provider identifier. Unknown or nil providers fall back to DEFAULT_CHARS_PER_TOKEN.

Returns:

(Float)



36
37
38

# File 'lib/woods/token_utils.rb', line 36

def chars_per_token_for(provider)
  CHARS_PER_TOKEN_BY_PROVIDER.fetch(provider&.to_sym, DEFAULT_CHARS_PER_TOKEN)
end

.estimate_tokens(text) ⇒ `Integer`

Estimate token count for a string using the default (OpenAI) ratio. Use estimate_tokens_for when a specific provider is in play.

Parameters:

text (String) —

Text to estimate

Returns:

(Integer) —

Estimated token count



45
46
47

# File 'lib/woods/token_utils.rb', line 45

def estimate_tokens(text)
  estimate_tokens_for(text, provider: nil)
end

.estimate_tokens_for(text, provider:) ⇒ `Integer`

Estimate token count for a string using the provider’s native ratio.

Parameters:

text (String) —

Text to estimate
provider (Symbol, String, nil) —

‘:openai`, `:ollama`, or nil.

Returns:

(Integer) —

Estimated token count



54
55
56

# File 'lib/woods/token_utils.rb', line 54

def estimate_tokens_for(text, provider:)
  (text.length / chars_per_token_for(provider)).ceil
end

Module: Woods::TokenUtils

Overview

Constant Summary collapse

Class Method Summary collapse

Class Method Details

.chars_per_token_for(provider) ⇒ Float

.estimate_tokens(text) ⇒ Integer

.estimate_tokens_for(text, provider:) ⇒ Integer

.chars_per_token_for(provider) ⇒ `Float`

.estimate_tokens(text) ⇒ `Integer`

.estimate_tokens_for(text, provider:) ⇒ `Integer`