Class: Woods::Chunking::SemanticChunker
- Inherits:
-
Object
- Object
- Woods::Chunking::SemanticChunker
- Defined in:
- lib/woods/chunking/semantic_chunker.rb
Overview
Splits ExtractedUnits into semantic chunks based on unit type.
Models are split by: summary, associations, validations, callbacks, scopes, methods. Controllers are split by: summary (filters), per-action. Class-like types (services, jobs, mailers, concerns, policies, …) split by summary + per-public-method + bundled privates via MethodChunker. Other types stay whole.
Any chunk that still exceeds ‘max_chars` after semantic splitting is sliced into line-balanced sub-chunks so no single chunk is ever larger than the embedding provider’s input budget.
Units below the token threshold are returned as a single :whole chunk.
Constant Summary collapse
- DEFAULT_THRESHOLD =
Default token threshold below which units stay whole.
200
Instance Attribute Summary collapse
- #max_chars ⇒ Integer? readonly
- #max_tokens ⇒ Integer? readonly
- #token_counter ⇒ Woods::Embedding::TokenCounter? readonly
Instance Method Summary collapse
-
#chunk(unit) ⇒ Array<Chunk>
Split an ExtractedUnit into semantic chunks.
-
#enforce_chunk_limits!(unit) ⇒ void
Enforce @max_chars on a unit’s already-populated ‘chunks` array (hashes produced by extraction or a prior chunking pass).
-
#initialize(threshold: DEFAULT_THRESHOLD, max_chars: nil, token_counter: nil, max_tokens: nil) ⇒ SemanticChunker
constructor
A new instance of SemanticChunker.
Constructor Details
#initialize(threshold: DEFAULT_THRESHOLD, max_chars: nil, token_counter: nil, max_tokens: nil) ⇒ SemanticChunker
Returns a new instance of SemanticChunker.
85 86 87 88 89 90 91 |
# File 'lib/woods/chunking/semantic_chunker.rb', line 85 def initialize(threshold: DEFAULT_THRESHOLD, max_chars: nil, token_counter: nil, max_tokens: nil) @threshold = threshold @max_chars = max_chars @token_counter = token_counter @max_tokens = max_tokens end |
Instance Attribute Details
#max_chars ⇒ Integer? (readonly)
100 101 102 |
# File 'lib/woods/chunking/semantic_chunker.rb', line 100 def max_chars @max_chars end |
#max_tokens ⇒ Integer? (readonly)
97 98 99 |
# File 'lib/woods/chunking/semantic_chunker.rb', line 97 def max_tokens @max_tokens end |
#token_counter ⇒ Woods::Embedding::TokenCounter? (readonly)
94 95 96 |
# File 'lib/woods/chunking/semantic_chunker.rb', line 94 def token_counter @token_counter end |
Instance Method Details
#chunk(unit) ⇒ Array<Chunk>
Split an ExtractedUnit into semantic chunks.
106 107 108 109 110 111 |
# File 'lib/woods/chunking/semantic_chunker.rb', line 106 def chunk(unit) return [] if unit.source_code.nil? || unit.source_code.strip.empty? return [build_whole_chunk(unit)] if unit.estimated_tokens <= @threshold enforce_char_limit(chunks_for(unit), unit) end |
#enforce_chunk_limits!(unit) ⇒ void
This method returns an undefined value.
Enforce @max_chars on a unit’s already-populated ‘chunks` array (hashes produced by extraction or a prior chunking pass). Oversize chunks are split into line-balanced siblings with `_part_N` chunk types; small chunks pass through unchanged. No-op when `@max_chars` is unset or `unit.chunks` is empty.
Exists so the Indexer can apply the same ceiling to pre-chunked units (e.g. ‘rails_source`) that extraction already sliced — the extractor’s own chunker is unaware of the embedding provider’s budget and can emit chunks larger than the ceiling we’d pick here.
126 127 128 129 130 131 |
# File 'lib/woods/chunking/semantic_chunker.rb', line 126 def enforce_chunk_limits!(unit) return unless enforcement_active? return if unit.chunks.nil? || unit.chunks.empty? unit.chunks = unit.chunks.flat_map { |chunk| split_oversize_hash_chunk(chunk) } end |