Class: Documentrix::Documents::Splitters::Character

Inherits:
Object
  • Object
show all
Includes:
Common
Defined in:
lib/documentrix/documents/splitters/character.rb

Overview

The Character class provides basic text splitting based on a single separator and bundles the resulting segments into chunks of a maximum size.

It allows for the preservation of separators and uses a combining string to join segments back together into chunks.

Constant Summary collapse

DEFAULT_SEPARATOR =

The default regex used to identify paragraph boundaries. It matches two or more consecutive newline characters (CRLF or LF).

Returns:

  • (Regexp)
/(?:\r?\n){2,}/

Instance Method Summary collapse

Constructor Details

#initialize(separator: DEFAULT_SEPARATOR, include_separator: false, combining_string: "\n\n", chunk_size: 4096, force: false) ⇒ Character

Initializes a new Character splitter.

Parameters:

  • separator (Regexp) (defaults to: DEFAULT_SEPARATOR)

    the regex used to split the text (defaults to DEFAULT_SEPARATOR)

  • include_separator (Boolean) (defaults to: false)

    whether to include the separator in the resulting chunks (defaults to false)

  • combining_string (String) (defaults to: "\n\n")

    the string used to join segments into chunks (defaults to "\n\n")

  • chunk_size (Integer) (defaults to: 4096)

    the maximum size of each resulting chunk (defaults to 4096)

  • force (Boolean) (defaults to: false)

    whether to force-split the final chunk if it exceeds chunk_size (defaults to false)



23
24
25
26
27
28
29
# File 'lib/documentrix/documents/splitters/character.rb', line 23

def initialize(separator: DEFAULT_SEPARATOR, include_separator: false, combining_string: "\n\n", chunk_size: 4096, force: false)
  @separator, @include_separator, @combining_string, @chunk_size, @force =
    separator, include_separator, combining_string, chunk_size, force
  if include_separator
    @separator = Regexp.new("(#@separator)")
  end
end

Instance Method Details

#split(text) ⇒ Array<String>

Splits the given text into chunks based on the configured separator and size limit.

Parameters:

  • text (String)

    the text to be split

Returns:

  • (Array<String>)

    an array of text chunks



36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# File 'lib/documentrix/documents/splitters/character.rb', line 36

def split(text)
  texts = []
  text.split(@separator) do |t|
    if @include_separator && t =~ @separator
      texts.last&.concat t
    else
      texts.push(t)
    end
  end
  result = []
  current_text = +''
  texts.each do |t|
    if current_text.size + t.size < @chunk_size
      current_text << t << @combining_string
    else
      current_text.empty? or result << current_text
      current_text = t
    end
  end
  result.concat force_split(current_text)
  result
end