Class: Woods::Embedding::TextPreparer

Inherits:

Object

Object
Woods::Embedding::TextPreparer

show all

Defined in:: lib/woods/embedding/text_preparer.rb

Overview

Prepares ExtractedUnit data for embedding by building context-prefixed text.

Follows the context prefix format from docs/CONTEXT_AND_CHUNKING.md:

[type] identifier
namespace: ...
file: ...
dependencies: dep1, dep2, ...

Handles token limit enforcement by truncating text that exceeds the embedding model’s context window.

Examples:

preparer = Woods::Embedding::TextPreparer.new(max_tokens: 8192)
text = preparer.prepare(unit)
chunks = preparer.prepare_chunks(unit)

Constant Summary collapse

DEFAULT_MAX_TOKENS =

DEFAULT_CHARS_PER_TOKEN = Aliased to the single source of truth in TokenUtils so the OpenAI 4.0 / Ollama 1.5 ratios stay consistent across TextPreparer, ContextAssembler, Builder, and cost_model/. See docs/TOKEN_BENCHMARK.md and lib/woods/token_utils.rb.

TokenUtils::DEFAULT_CHARS_PER_TOKEN

Instance Attribute Summary collapse

#chars_per_token ⇒ Float readonly

Configured chars-per-token ratio.
#max_tokens ⇒ Integer readonly

Configured token budget.

Instance Method Summary collapse

#initialize(max_tokens: DEFAULT_MAX_TOKENS, chars_per_token: DEFAULT_CHARS_PER_TOKEN) ⇒ TextPreparer constructor

A new instance of TextPreparer.
#prepare(unit) ⇒ String

Prepare text for embedding from an ExtractedUnit.
#prepare_chunks(unit) ⇒ Array<String>

Prepare text for each chunk of an ExtractedUnit.

Constructor Details

#initialize(max_tokens: DEFAULT_MAX_TOKENS, chars_per_token: DEFAULT_CHARS_PER_TOKEN) ⇒ `TextPreparer`

Returns a new instance of TextPreparer.

Parameters:

max_tokens (Integer) (defaults to: DEFAULT_MAX_TOKENS) —

maximum token budget for prepared text
chars_per_token (Float) (defaults to: DEFAULT_CHARS_PER_TOKEN) —

tokenizer-calibrated char/token ratio

# File 'lib/woods/embedding/text_preparer.rb', line 32

def initialize(max_tokens: DEFAULT_MAX_TOKENS, chars_per_token: DEFAULT_CHARS_PER_TOKEN)
  @max_tokens = max_tokens
  @chars_per_token = chars_per_token
end

Instance Attribute Details

#chars_per_token ⇒ `Float` (readonly)

Returns configured chars-per-token ratio.

Returns:

(Float) —

configured chars-per-token ratio



38
39
40

# File 'lib/woods/embedding/text_preparer.rb', line 38

def chars_per_token
  @chars_per_token
end

#max_tokens ⇒ `Integer` (readonly)

Returns configured token budget.

Returns:

(Integer) —

configured token budget



41
42
43

# File 'lib/woods/embedding/text_preparer.rb', line 41

def max_tokens
  @max_tokens
end

Instance Method Details

#prepare(unit) ⇒ `String`

Prepare text for embedding from an ExtractedUnit.

Builds a context prefix and appends the unit’s source code (or first chunk content for chunked units). Enforces token limits via truncation.

Parameters:

unit (Woods::ExtractedUnit) —

the unit to prepare

Returns:

(String) —

context-prefixed text ready for embedding

# File 'lib/woods/embedding/text_preparer.rb', line 50

def prepare(unit)
  prefix = build_prefix(unit)
  content = select_content(unit)
  text = "#{prefix}\n#{content}"
  enforce_token_limit(text)
end

#prepare_chunks(unit) ⇒ `Array<String>`

Prepare text for each chunk of an ExtractedUnit.

If the unit has no chunks, returns a single-element array with the full prepared text. For chunked units, each chunk gets the same context prefix prepended.

Parameters:

unit (Woods::ExtractedUnit) —

the unit to prepare

Returns:

(Array<String>) —

array of context-prefixed texts

# File 'lib/woods/embedding/text_preparer.rb', line 65

def prepare_chunks(unit)
  return [prepare(unit)] unless unit.chunks&.any?

  prefix = build_prefix(unit)
  unit.chunks.map do |chunk|
    text = "#{prefix}\n#{chunk[:content]}"
    enforce_token_limit(text)
  end
end

Class: Woods::Embedding::TextPreparer

Overview

Examples:

Constant Summary collapse

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(max_tokens: DEFAULT_MAX_TOKENS, chars_per_token: DEFAULT_CHARS_PER_TOKEN) ⇒ TextPreparer

Instance Attribute Details

#chars_per_token ⇒ Float (readonly)

#max_tokens ⇒ Integer (readonly)

Instance Method Details

#prepare(unit) ⇒ String

#prepare_chunks(unit) ⇒ Array<String>

#initialize(max_tokens: DEFAULT_MAX_TOKENS, chars_per_token: DEFAULT_CHARS_PER_TOKEN) ⇒ `TextPreparer`

#chars_per_token ⇒ `Float` (readonly)

#max_tokens ⇒ `Integer` (readonly)

#prepare(unit) ⇒ `String`

#prepare_chunks(unit) ⇒ `Array<String>`