Class: Woods::Chunking::SemanticChunker

Inherits:
Object
  • Object
show all
Defined in:
lib/woods/chunking/semantic_chunker.rb

Overview

Splits ExtractedUnits into semantic chunks based on unit type.

Models are split by: summary, associations, validations, callbacks, scopes, methods. Controllers are split by: summary (filters), per-action. Class-like types (services, jobs, mailers, concerns, policies, …) split by summary + per-public-method + bundled privates via MethodChunker. Other types stay whole.

Any chunk that still exceeds ‘max_chars` after semantic splitting is sliced into line-balanced sub-chunks so no single chunk is ever larger than the embedding provider’s input budget.

Units below the token threshold are returned as a single :whole chunk.

Examples:

chunker = SemanticChunker.new(threshold: 200, max_chars: 20_480)
chunks = chunker.chunk(extracted_unit)
chunks.map(&:chunk_type) # => [:summary, :associations, :validations, :methods]

Constant Summary collapse

DEFAULT_THRESHOLD =

Default token threshold below which units stay whole.

200

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(threshold: DEFAULT_THRESHOLD, max_chars: nil, token_counter: nil, max_tokens: nil) ⇒ SemanticChunker

Returns a new instance of SemanticChunker.

Parameters:

  • threshold (Integer) (defaults to: DEFAULT_THRESHOLD)

    Token count threshold for chunking

  • max_chars (Integer, nil) (defaults to: nil)

    Hard character ceiling for any single chunk. When set, any chunk larger than this is sliced into line-balanced sub-chunks. ‘nil` disables the safety net.

  • token_counter (Woods::Embedding::TokenCounter, nil) (defaults to: nil)

    Optional exact-token counter. When both this and ‘max_tokens` are set, oversize detection uses the real tokenizer rather than the char-length estimate, and post-slice verification recursively re-splits any piece that still exceeds `max_tokens`.

  • max_tokens (Integer, nil) (defaults to: nil)

    Token budget used with ‘token_counter` for the authoritative oversize check.



85
86
87
88
89
90
91
# File 'lib/woods/chunking/semantic_chunker.rb', line 85

def initialize(threshold: DEFAULT_THRESHOLD, max_chars: nil,
               token_counter: nil, max_tokens: nil)
  @threshold = threshold
  @max_chars = max_chars
  @token_counter = token_counter
  @max_tokens = max_tokens
end

Instance Attribute Details

#max_charsInteger? (readonly)

Returns:

  • (Integer, nil)


100
101
102
# File 'lib/woods/chunking/semantic_chunker.rb', line 100

def max_chars
  @max_chars
end

#max_tokensInteger? (readonly)

Returns:

  • (Integer, nil)


97
98
99
# File 'lib/woods/chunking/semantic_chunker.rb', line 97

def max_tokens
  @max_tokens
end

#token_counterWoods::Embedding::TokenCounter? (readonly)



94
95
96
# File 'lib/woods/chunking/semantic_chunker.rb', line 94

def token_counter
  @token_counter
end

Instance Method Details

#chunk(unit) ⇒ Array<Chunk>

Split an ExtractedUnit into semantic chunks.

Parameters:

Returns:

  • (Array<Chunk>)

    Ordered list of chunks



106
107
108
109
110
111
# File 'lib/woods/chunking/semantic_chunker.rb', line 106

def chunk(unit)
  return [] if unit.source_code.nil? || unit.source_code.strip.empty?
  return [build_whole_chunk(unit)] if unit.estimated_tokens <= @threshold

  enforce_char_limit(chunks_for(unit), unit)
end

#enforce_chunk_limits!(unit) ⇒ void

This method returns an undefined value.

Enforce @max_chars on a unit’s already-populated ‘chunks` array (hashes produced by extraction or a prior chunking pass). Oversize chunks are split into line-balanced siblings with `_part_N` chunk types; small chunks pass through unchanged. No-op when `@max_chars` is unset or `unit.chunks` is empty.

Exists so the Indexer can apply the same ceiling to pre-chunked units (e.g. ‘rails_source`) that extraction already sliced — the extractor’s own chunker is unaware of the embedding provider’s budget and can emit chunks larger than the ceiling we’d pick here.

Parameters:



126
127
128
129
130
131
# File 'lib/woods/chunking/semantic_chunker.rb', line 126

def enforce_chunk_limits!(unit)
  return unless enforcement_active?
  return if unit.chunks.nil? || unit.chunks.empty?

  unit.chunks = unit.chunks.flat_map { |chunk| split_oversize_hash_chunk(chunk) }
end