Class: SemanticTextChunker::Splitters::StructureSplitter

Inherits:
Object
  • Object
show all
Defined in:
lib/semantic_text_chunker/splitters/structure_splitter.rb

Overview

Splits text while respecting document structure. Blank-line paragraph breaks and markdown headings produce “hard” boundaries that chunks are never merged across. A standalone heading is attached to the block of content that follows it, so the heading travels with its section.

Constant Summary collapse

HEADING_LINE =

ATX markdown heading line, e.g. “## Section title”

/\A\#{1,6}\s+\S/

Instance Method Summary collapse

Constructor Details

#initialize(sentence_splitter: SentenceSplitter.new) ⇒ StructureSplitter

Returns a new instance of StructureSplitter.



13
14
15
# File 'lib/semantic_text_chunker/splitters/structure_splitter.rb', line 13

def initialize(sentence_splitter: SentenceSplitter.new)
  @sentence_splitter = sentence_splitter
end

Instance Method Details

#split(text) ⇒ Object

Returns [sentences, hard_boundaries] where:

sentences       - flat array of sentence strings across all blocks
hard_boundaries - sentence indices that must end a chunk


20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# File 'lib/semantic_text_chunker/splitters/structure_splitter.rb', line 20

def split(text)
  sentences = []
  hard      = []

  segment(text).each do |block|
    block_sentences = @sentence_splitter.split(block)
    next if block_sentences.empty?

    sentences.concat(block_sentences)
    hard << sentences.size - 1
  end

  # The last block's trailing boundary is the document end, not a split.
  hard.pop

  [sentences, hard]
end