Module: Ace::Bundle::Atoms::BoundaryFinder
- Defined in:
- lib/ace/bundle/atoms/boundary_finder.rb
Overview
Pure functions to find semantic boundaries in XML-structured content Used by ContextChunker to split content at clean boundaries (between </file> and <file>, between </output> and <output>)
## Whitespace Handling
Whitespace-only content between XML elements is intentionally dropped. This means the sum of block line counts may be less than the total content line count. This is acceptable because:
-
The primary goal is preserving XML element integrity, not exact line counting
-
Chunk limits are approximate; slightly exceeding is better than splitting elements
-
Typical variance is ~2-5% of content lines
Constant Summary collapse
- FILE_ELEMENT_PATTERN =
XML element patterns for semantic blocks These elements should never be split in the middle
%r{<file\s+[^>]*>.*?</file>}m- OUTPUT_ELEMENT_PATTERN =
%r{<output\s+[^>]*>.*?</output>}m
Class Method Summary collapse
-
.create_block(content, type) ⇒ Hash
Create a block hash.
-
.has_semantic_elements?(content) ⇒ Boolean
Check if content contains XML elements that require semantic chunking.
-
.parse_blocks(content) ⇒ Array<Hash>
Parse content into semantic blocks Each block represents a unit that should not be split.
Class Method Details
.create_block(content, type) ⇒ Hash
Create a block hash
112 113 114 115 116 117 118 |
# File 'lib/ace/bundle/atoms/boundary_finder.rb', line 112 def create_block(content, type) { content: content, type: type, lines: content.lines.size } end |
.has_semantic_elements?(content) ⇒ Boolean
Check if content contains XML elements that require semantic chunking
100 101 102 103 104 |
# File 'lib/ace/bundle/atoms/boundary_finder.rb', line 100 def has_semantic_elements?(content) return false if content.nil? || content.empty? content.match?(FILE_ELEMENT_PATTERN) || content.match?(OUTPUT_ELEMENT_PATTERN) end |
.parse_blocks(content) ⇒ Array<Hash>
Parse content into semantic blocks Each block represents a unit that should not be split
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
# File 'lib/ace/bundle/atoms/boundary_finder.rb', line 43 def parse_blocks(content) return [] if content.nil? || content.empty? blocks = [] remaining = content while remaining && !remaining.empty? # Find the next XML element (file or output) file_match = remaining.match(FILE_ELEMENT_PATTERN) output_match = remaining.match(OUTPUT_ELEMENT_PATTERN) # Determine which comes first next_match = nil match_type = nil if file_match && output_match if file_match.begin(0) <= output_match.begin(0) next_match = file_match match_type = :file else next_match = output_match match_type = :output end elsif file_match next_match = file_match match_type = :file elsif output_match next_match = output_match match_type = :output end if next_match # Add text before the match as a text block (if non-whitespace) if next_match.begin(0) > 0 text_content = remaining[0...next_match.begin(0)] # Only add text blocks with actual content (not just whitespace) blocks << create_block(text_content, :text) unless text_content.strip.empty? end # Add the XML element as a block blocks << create_block(next_match[0], match_type) # Move past this match remaining = remaining[next_match.end(0)..] else # No more XML elements, add remaining as text (if non-whitespace) blocks << create_block(remaining, :text) unless remaining.strip.empty? break end end blocks end |