Class: SemanticTextChunker::Splitters::StructureSplitter
- Inherits:
-
Object
- Object
- SemanticTextChunker::Splitters::StructureSplitter
- Defined in:
- lib/semantic_text_chunker/splitters/structure_splitter.rb
Overview
Splits text while respecting document structure. Blank-line paragraph breaks and markdown headings produce “hard” boundaries that chunks are never merged across. A standalone heading is attached to the block of content that follows it, so the heading travels with its section.
Constant Summary collapse
- HEADING_LINE =
ATX markdown heading line, e.g. “## Section title”
/\A\#{1,6}\s+\S/
Instance Method Summary collapse
-
#initialize(sentence_splitter: SentenceSplitter.new) ⇒ StructureSplitter
constructor
A new instance of StructureSplitter.
-
#split(text) ⇒ Object
Returns [sentences, hard_boundaries] where: sentences - flat array of sentence strings across all blocks hard_boundaries - sentence indices that must end a chunk.
Constructor Details
#initialize(sentence_splitter: SentenceSplitter.new) ⇒ StructureSplitter
Returns a new instance of StructureSplitter.
13 14 15 |
# File 'lib/semantic_text_chunker/splitters/structure_splitter.rb', line 13 def initialize(sentence_splitter: SentenceSplitter.new) @sentence_splitter = sentence_splitter end |
Instance Method Details
#split(text) ⇒ Object
Returns [sentences, hard_boundaries] where:
sentences - flat array of sentence strings across all blocks
hard_boundaries - sentence indices that must end a chunk
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
# File 'lib/semantic_text_chunker/splitters/structure_splitter.rb', line 20 def split(text) sentences = [] hard = [] segment(text).each do |block| block_sentences = @sentence_splitter.split(block) next if block_sentences.empty? sentences.concat(block_sentences) hard << sentences.size - 1 end # The last block's trailing boundary is the document end, not a split. hard.pop [sentences, hard] end |