Class: Documentrix::Documents::Splitters::Character
- Inherits:
-
Object
- Object
- Documentrix::Documents::Splitters::Character
- Includes:
- Common
- Defined in:
- lib/documentrix/documents/splitters/character.rb
Overview
The Character class provides basic text splitting based on a single separator and bundles the resulting segments into chunks of a maximum size.
It allows for the preservation of separators and uses a combining string to join segments back together into chunks.
Constant Summary collapse
- DEFAULT_SEPARATOR =
The default regex used to identify paragraph boundaries. It matches two or more consecutive newline characters (CRLF or LF).
/(?:\r?\n){2,}/
Instance Method Summary collapse
-
#initialize(separator: DEFAULT_SEPARATOR, include_separator: false, combining_string: "\n\n", chunk_size: 4096, force: false) ⇒ Character
constructor
Initializes a new Character splitter.
-
#split(text) ⇒ Array<String>
Splits the given text into chunks based on the configured separator and size limit.
Constructor Details
#initialize(separator: DEFAULT_SEPARATOR, include_separator: false, combining_string: "\n\n", chunk_size: 4096, force: false) ⇒ Character
Initializes a new Character splitter.
23 24 25 26 27 28 29 |
# File 'lib/documentrix/documents/splitters/character.rb', line 23 def initialize(separator: DEFAULT_SEPARATOR, include_separator: false, combining_string: "\n\n", chunk_size: 4096, force: false) @separator, @include_separator, @combining_string, @chunk_size, @force = separator, include_separator, combining_string, chunk_size, force if include_separator @separator = Regexp.new("(#@separator)") end end |
Instance Method Details
#split(text) ⇒ Array<String>
Splits the given text into chunks based on the configured separator and size limit.
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
# File 'lib/documentrix/documents/splitters/character.rb', line 36 def split(text) texts = [] text.split(@separator) do |t| if @include_separator && t =~ @separator texts.last&.concat t else texts.push(t) end end result = [] current_text = +'' texts.each do |t| if current_text.size + t.size < @chunk_size current_text << t << @combining_string else current_text.empty? or result << current_text current_text = t end end result.concat force_split(current_text) result end |