Class: Documentrix::Documents::Splitters::RecursiveCharacter
- Inherits:
-
Object
- Object
- Documentrix::Documents::Splitters::RecursiveCharacter
- Includes:
- Common
- Defined in:
- lib/documentrix/documents/splitters/character.rb
Overview
The RecursiveCharacter class implements a hierarchical splitting strategy.
It attempts to split text using a priority list of separators. If a resulting chunk is still larger than the specified chunk_size, it recursively applies the next separator in the list until the size limit is met or all separators have been exhausted.
Constant Summary collapse
- DEFAULT_SEPARATORS =
The default priority list of regexes used for recursive splitting. The strategy is to split by the coarsest grain first (paragraphs) and move toward the finest grain (individual characters) as needed.
Order: Paragraphs -> Newlines -> Word Boundaries -> Characters
[ /(?:\r?\n){2,}/, /\r?\n/, /\b/, //, ].freeze
Instance Method Summary collapse
-
#initialize(separators: DEFAULT_SEPARATORS, include_separator: false, combining_string: "\n\n", chunk_size: 4096) ⇒ RecursiveCharacter
constructor
Initializes a new RecursiveCharacter splitter.
-
#split(text, separators: @separators) ⇒ Array<String>
Recursively splits the given text into chunks using the list of separators.
Constructor Details
#initialize(separators: DEFAULT_SEPARATORS, include_separator: false, combining_string: "\n\n", chunk_size: 4096) ⇒ RecursiveCharacter
Initializes a new RecursiveCharacter splitter.
90 91 92 93 94 95 96 |
# File 'lib/documentrix/documents/splitters/character.rb', line 90 def initialize(separators: DEFAULT_SEPARATORS, include_separator: false, combining_string: "\n\n", chunk_size: 4096) separators.empty? and raise ArgumentError, "non-empty array of separators required" @separators, @include_separator, @combining_string, @chunk_size = separators, include_separator, combining_string, chunk_size @force = separators.last == // end |
Instance Method Details
#split(text, separators: @separators) ⇒ Array<String>
Recursively splits the given text into chunks using the list of separators.
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 |
# File 'lib/documentrix/documents/splitters/character.rb', line 104 def split(text, separators: @separators) separators.empty? and return [ text ] separators = separators.dup separator = separators.shift texts = Character.new( separator:, include_separator: @include_separator, combining_string: @combining_string, chunk_size: @chunk_size ).split(text) texts.count == 0 and return [ text ] texts.inject([]) do |r, t| if t.size > @chunk_size r.concat(split(t, separators:)) else r.concat([ t ]) end end end |