Module: Woods::TokenUtils

Defined in:
lib/woods/token_utils.rb

Overview

Shared token estimation utility — the single source of truth for the chars-per-token ratio used across cost estimation, context assembly, and embedding budgeting.

Ratios:

  • ‘:openai` / default — 4.0 chars/token. Benchmarked against tiktoken (cl100k_base) on 19 Ruby source files (mean 4.41 chars/token). We use 4.0 as a conservative floor (~10.6 % overestimate) so truncation never hands the model more tokens than it budgeted for. See `docs/TOKEN_BENCHMARK.md`.

  • ‘:ollama` — 1.5 chars/token. Matches the BERT WordPiece tokenizers used by nomic-embed-text and mxbai-embed-large. See `docs/EMBEDDING_MODELS.md` and `Woods::Builder#chars_per_token_for`.

Callers should prefer TokenUtils.chars_per_token_for over hardcoding a divisor so future tokenizer changes propagate in one place instead of drifting between ContextAssembler, Builder, and cost-model components.

Constant Summary collapse

CHARS_PER_TOKEN_BY_PROVIDER =
{
  openai: 4.0,
  ollama: 1.5
}.freeze
DEFAULT_CHARS_PER_TOKEN =
CHARS_PER_TOKEN_BY_PROVIDER[:openai]

Class Method Summary collapse

Class Method Details

.chars_per_token_for(provider) ⇒ Float

Chars-per-token ratio for the given embedding provider.

Parameters:

  • provider (Symbol, String, nil)

    Provider identifier. Unknown or nil providers fall back to DEFAULT_CHARS_PER_TOKEN.

Returns:

  • (Float)


36
37
38
# File 'lib/woods/token_utils.rb', line 36

def chars_per_token_for(provider)
  CHARS_PER_TOKEN_BY_PROVIDER.fetch(provider&.to_sym, DEFAULT_CHARS_PER_TOKEN)
end

.estimate_tokens(text) ⇒ Integer

Estimate token count for a string using the default (OpenAI) ratio. Use estimate_tokens_for when a specific provider is in play.

Parameters:

  • text (String)

    Text to estimate

Returns:

  • (Integer)

    Estimated token count



45
46
47
# File 'lib/woods/token_utils.rb', line 45

def estimate_tokens(text)
  estimate_tokens_for(text, provider: nil)
end

.estimate_tokens_for(text, provider:) ⇒ Integer

Estimate token count for a string using the provider’s native ratio.

Parameters:

  • text (String)

    Text to estimate

  • provider (Symbol, String, nil)

    ‘:openai`, `:ollama`, or nil.

Returns:

  • (Integer)

    Estimated token count



54
55
56
# File 'lib/woods/token_utils.rb', line 54

def estimate_tokens_for(text, provider:)
  (text.length / chars_per_token_for(provider)).ceil
end