Class: Woods::Embedding::TextPreparer
- Inherits:
-
Object
- Object
- Woods::Embedding::TextPreparer
- Defined in:
- lib/woods/embedding/text_preparer.rb
Overview
Prepares ExtractedUnit data for embedding by building context-prefixed text.
Follows the context prefix format from docs/CONTEXT_AND_CHUNKING.md:
[type] identifier
namespace: ...
file: ...
dependencies: dep1, dep2, ...
Handles token limit enforcement by truncating text that exceeds the embedding model’s context window.
Constant Summary collapse
- DEFAULT_MAX_TOKENS =
8192- DEFAULT_CHARS_PER_TOKEN =
Aliased to the single source of truth in TokenUtils so the OpenAI 4.0 / Ollama 1.5 ratios stay consistent across TextPreparer, ContextAssembler, Builder, and cost_model/. See docs/TOKEN_BENCHMARK.md and lib/woods/token_utils.rb.
TokenUtils::DEFAULT_CHARS_PER_TOKEN
Instance Attribute Summary collapse
-
#chars_per_token ⇒ Float
readonly
Configured chars-per-token ratio.
-
#max_tokens ⇒ Integer
readonly
Configured token budget.
Instance Method Summary collapse
-
#initialize(max_tokens: DEFAULT_MAX_TOKENS, chars_per_token: DEFAULT_CHARS_PER_TOKEN) ⇒ TextPreparer
constructor
A new instance of TextPreparer.
-
#prepare(unit) ⇒ String
Prepare text for embedding from an ExtractedUnit.
-
#prepare_chunks(unit) ⇒ Array<String>
Prepare text for each chunk of an ExtractedUnit.
Constructor Details
#initialize(max_tokens: DEFAULT_MAX_TOKENS, chars_per_token: DEFAULT_CHARS_PER_TOKEN) ⇒ TextPreparer
Returns a new instance of TextPreparer.
32 33 34 35 |
# File 'lib/woods/embedding/text_preparer.rb', line 32 def initialize(max_tokens: DEFAULT_MAX_TOKENS, chars_per_token: DEFAULT_CHARS_PER_TOKEN) @max_tokens = max_tokens @chars_per_token = chars_per_token end |
Instance Attribute Details
#chars_per_token ⇒ Float (readonly)
Returns configured chars-per-token ratio.
38 39 40 |
# File 'lib/woods/embedding/text_preparer.rb', line 38 def chars_per_token @chars_per_token end |
#max_tokens ⇒ Integer (readonly)
Returns configured token budget.
41 42 43 |
# File 'lib/woods/embedding/text_preparer.rb', line 41 def max_tokens @max_tokens end |
Instance Method Details
#prepare(unit) ⇒ String
Prepare text for embedding from an ExtractedUnit.
Builds a context prefix and appends the unit’s source code (or first chunk content for chunked units). Enforces token limits via truncation.
50 51 52 53 54 55 |
# File 'lib/woods/embedding/text_preparer.rb', line 50 def prepare(unit) prefix = build_prefix(unit) content = select_content(unit) text = "#{prefix}\n#{content}" enforce_token_limit(text) end |
#prepare_chunks(unit) ⇒ Array<String>
Prepare text for each chunk of an ExtractedUnit.
If the unit has no chunks, returns a single-element array with the full prepared text. For chunked units, each chunk gets the same context prefix prepended.
65 66 67 68 69 70 71 72 73 |
# File 'lib/woods/embedding/text_preparer.rb', line 65 def prepare_chunks(unit) return [prepare(unit)] unless unit.chunks&.any? prefix = build_prefix(unit) unit.chunks.map do |chunk| text = "#{prefix}\n#{chunk[:content]}" enforce_token_limit(text) end end |