Class: Phronomy::Splitter::RecursiveSplitter

Inherits:
Base
  • Object
show all
Defined in:
lib/phronomy/splitter/recursive_splitter.rb

Overview

Splits text recursively using a prioritised list of separator strings.

The splitter tries each separator in order. When a separator produces chunks that are still larger than +chunk_size+, it recurses with the next separator in the list. This mirrors LangChain's RecursiveCharacterTextSplitter behaviour.

Default separators (in priority order):

  1. "\n\n" — paragraph breaks
  2. "\n" — line breaks
  3. ". " — sentence boundaries
  4. " " — word boundaries
  5. "" — character-level fallback

Examples:

splitter = Phronomy::Splitter::RecursiveSplitter.new(chunk_size: 300, chunk_overlap: 30)
chunks   = splitter.split({ text: long_markdown, metadata: { source: "guide.md" } })

Constant Summary collapse

DEFAULT_SEPARATORS =
["\n\n", "\n", ". ", " ", ""].freeze

Instance Method Summary collapse

Methods inherited from Base

#split_all

Constructor Details

#initialize(chunk_size: 1000, chunk_overlap: 200, separators: DEFAULT_SEPARATORS) ⇒ RecursiveSplitter

Returns a new instance of RecursiveSplitter.

Parameters:

  • chunk_size (Integer) (defaults to: 1000)

    maximum characters per chunk (default: 1000)

  • chunk_overlap (Integer) (defaults to: 200)

    overlap characters (default: 200)

  • separators (Array<String>) (defaults to: DEFAULT_SEPARATORS)

    separator list in priority order

Raises:

  • (ArgumentError)


28
29
30
31
32
33
34
# File 'lib/phronomy/splitter/recursive_splitter.rb', line 28

def initialize(chunk_size: 1000, chunk_overlap: 200, separators: DEFAULT_SEPARATORS)
  raise ArgumentError, "chunk_overlap must be less than chunk_size" if chunk_overlap >= chunk_size

  @chunk_size = chunk_size
  @chunk_overlap = chunk_overlap
  @separators = separators
end

Instance Method Details

#split(document) ⇒ Array<Hash>

Parameters:

  • document (Hash, String)

Returns:

  • (Array<Hash>)


38
39
40
41
42
43
44
# File 'lib/phronomy/splitter/recursive_splitter.rb', line 38

def split(document)
  doc = normalise(document)
  texts = recursive_split(doc[:text], @separators)
  merge_with_overlap(texts).each_with_index.map do |text, idx|
    {text: text, metadata: doc[:metadata].merge(chunk: idx)}
  end
end