Class: Phronomy::Splitter::RecursiveSplitter
- Defined in:
- lib/phronomy/splitter/recursive_splitter.rb
Overview
Splits text recursively using a prioritised list of separator strings.
The splitter tries each separator in order. When a separator produces chunks that are still larger than +chunk_size+, it recurses with the next separator in the list. This mirrors LangChain's RecursiveCharacterTextSplitter behaviour.
Default separators (in priority order):
- "\n\n" — paragraph breaks
- "\n" — line breaks
- ". " — sentence boundaries
- " " — word boundaries
- "" — character-level fallback
Constant Summary collapse
- DEFAULT_SEPARATORS =
["\n\n", "\n", ". ", " ", ""].freeze
Instance Method Summary collapse
-
#initialize(chunk_size: 1000, chunk_overlap: 200, separators: DEFAULT_SEPARATORS) ⇒ RecursiveSplitter
constructor
A new instance of RecursiveSplitter.
- #split(document) ⇒ Array<Hash>
Methods inherited from Base
Constructor Details
#initialize(chunk_size: 1000, chunk_overlap: 200, separators: DEFAULT_SEPARATORS) ⇒ RecursiveSplitter
Returns a new instance of RecursiveSplitter.
28 29 30 31 32 33 34 |
# File 'lib/phronomy/splitter/recursive_splitter.rb', line 28 def initialize(chunk_size: 1000, chunk_overlap: 200, separators: DEFAULT_SEPARATORS) raise ArgumentError, "chunk_overlap must be less than chunk_size" if chunk_overlap >= chunk_size @chunk_size = chunk_size @chunk_overlap = chunk_overlap @separators = separators end |
Instance Method Details
#split(document) ⇒ Array<Hash>
38 39 40 41 42 43 44 |
# File 'lib/phronomy/splitter/recursive_splitter.rb', line 38 def split(document) doc = normalise(document) texts = recursive_split(doc[:text], @separators) merge_with_overlap(texts).each_with_index.map do |text, idx| {text: text, metadata: doc[:metadata].merge(chunk: idx)} end end |