Class: Documentrix::Documents::Splitters::RecursiveCharacter

Inherits:
Object
  • Object
show all
Includes:
Common
Defined in:
lib/documentrix/documents/splitters/character.rb

Overview

The RecursiveCharacter class implements a hierarchical splitting strategy.

It attempts to split text using a priority list of separators. If a resulting chunk is still larger than the specified chunk_size, it recursively applies the next separator in the list until the size limit is met or all separators have been exhausted.

Constant Summary collapse

DEFAULT_SEPARATORS =

The default priority list of regexes used for recursive splitting. The strategy is to split by the coarsest grain first (paragraphs) and move toward the finest grain (individual characters) as needed.

Order: Paragraphs -> Newlines -> Word Boundaries -> Characters

Returns:

  • (Array<Regexp>)
[
  /(?:\r?\n){2,}/,
  /\r?\n/,
  /\b/,
  //,
].freeze

Instance Method Summary collapse

Constructor Details

#initialize(separators: DEFAULT_SEPARATORS, include_separator: false, combining_string: "\n\n", chunk_size: 4096) ⇒ RecursiveCharacter

Initializes a new RecursiveCharacter splitter.

Parameters:

  • separators (Array<Regexp>) (defaults to: DEFAULT_SEPARATORS)

    a priority list of regexes to use for splitting (defaults to DEFAULT_SEPARATORS)

  • include_separator (Boolean) (defaults to: false)

    whether to include the separator in the resulting chunks (defaults to false)

  • combining_string (String) (defaults to: "\n\n")

    the string used to join segments into chunks (defaults to "\n\n")

  • chunk_size (Integer) (defaults to: 4096)

    the maximum size of each resulting chunk (defaults to 4096)

Raises:

  • (ArgumentError)

    if the separators array is empty



90
91
92
93
94
95
96
# File 'lib/documentrix/documents/splitters/character.rb', line 90

def initialize(separators: DEFAULT_SEPARATORS, include_separator: false, combining_string: "\n\n", chunk_size: 4096)
  separators.empty? and
    raise ArgumentError, "non-empty array of separators required"
  @separators, @include_separator, @combining_string, @chunk_size =
    separators, include_separator, combining_string, chunk_size
  @force = separators.last == //
end

Instance Method Details

#split(text, separators: @separators) ⇒ Array<String>

Recursively splits the given text into chunks using the list of separators.

Parameters:

  • text (String)

    the text to be split

  • separators (Array<Regexp>) (defaults to: @separators)

    the list of separators to use (defaults to @separators)

Returns:

  • (Array<String>)

    an array of text chunks



104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
# File 'lib/documentrix/documents/splitters/character.rb', line 104

def split(text, separators: @separators)
  separators.empty? and return [ text ]
  separators = separators.dup
  separator = separators.shift
  texts = Character.new(
    separator:,
    include_separator: @include_separator,
    combining_string: @combining_string,
    chunk_size: @chunk_size
  ).split(text)
  texts.count == 0 and return [ text ]
  texts.inject([]) do |r, t|
    if t.size > @chunk_size
      r.concat(split(t, separators:))
    else
      r.concat([ t ])
    end
  end
end