Class: Woods::Embedding::TextPreparer

Inherits:
Object
  • Object
show all
Defined in:
lib/woods/embedding/text_preparer.rb

Overview

Prepares ExtractedUnit data for embedding by building context-prefixed text.

Follows the context prefix format from docs/CONTEXT_AND_CHUNKING.md:

[type] identifier
namespace: ...
file: ...
dependencies: dep1, dep2, ...

Handles token limit enforcement by truncating text that exceeds the embedding model’s context window.

Examples:

preparer = Woods::Embedding::TextPreparer.new(max_tokens: 8192)
text = preparer.prepare(unit)
chunks = preparer.prepare_chunks(unit)

Constant Summary collapse

DEFAULT_MAX_TOKENS =
8192
DEFAULT_CHARS_PER_TOKEN =

Aliased to the single source of truth in TokenUtils so the OpenAI 4.0 / Ollama 1.5 ratios stay consistent across TextPreparer, ContextAssembler, Builder, and cost_model/. See docs/TOKEN_BENCHMARK.md and lib/woods/token_utils.rb.

TokenUtils::DEFAULT_CHARS_PER_TOKEN

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(max_tokens: DEFAULT_MAX_TOKENS, chars_per_token: DEFAULT_CHARS_PER_TOKEN) ⇒ TextPreparer

Returns a new instance of TextPreparer.

Parameters:

  • max_tokens (Integer) (defaults to: DEFAULT_MAX_TOKENS)

    maximum token budget for prepared text

  • chars_per_token (Float) (defaults to: DEFAULT_CHARS_PER_TOKEN)

    tokenizer-calibrated char/token ratio



32
33
34
35
# File 'lib/woods/embedding/text_preparer.rb', line 32

def initialize(max_tokens: DEFAULT_MAX_TOKENS, chars_per_token: DEFAULT_CHARS_PER_TOKEN)
  @max_tokens = max_tokens
  @chars_per_token = chars_per_token
end

Instance Attribute Details

#chars_per_tokenFloat (readonly)

Returns configured chars-per-token ratio.

Returns:

  • (Float)

    configured chars-per-token ratio



38
39
40
# File 'lib/woods/embedding/text_preparer.rb', line 38

def chars_per_token
  @chars_per_token
end

#max_tokensInteger (readonly)

Returns configured token budget.

Returns:

  • (Integer)

    configured token budget



41
42
43
# File 'lib/woods/embedding/text_preparer.rb', line 41

def max_tokens
  @max_tokens
end

Instance Method Details

#prepare(unit) ⇒ String

Prepare text for embedding from an ExtractedUnit.

Builds a context prefix and appends the unit’s source code (or first chunk content for chunked units). Enforces token limits via truncation.

Parameters:

Returns:

  • (String)

    context-prefixed text ready for embedding



50
51
52
53
54
55
# File 'lib/woods/embedding/text_preparer.rb', line 50

def prepare(unit)
  prefix = build_prefix(unit)
  content = select_content(unit)
  text = "#{prefix}\n#{content}"
  enforce_token_limit(text)
end

#prepare_chunks(unit) ⇒ Array<String>

Prepare text for each chunk of an ExtractedUnit.

If the unit has no chunks, returns a single-element array with the full prepared text. For chunked units, each chunk gets the same context prefix prepended.

Parameters:

Returns:

  • (Array<String>)

    array of context-prefixed texts



65
66
67
68
69
70
71
72
73
# File 'lib/woods/embedding/text_preparer.rb', line 65

def prepare_chunks(unit)
  return [prepare(unit)] unless unit.chunks&.any?

  prefix = build_prefix(unit)
  unit.chunks.map do |chunk|
    text = "#{prefix}\n#{chunk[:content]}"
    enforce_token_limit(text)
  end
end