Class: Phronomy::Splitter::FixedSizeSplitter

Inherits:
Base
  • Object
show all
Defined in:
lib/phronomy/splitter/fixed_size_splitter.rb

Overview

Splits text into fixed-size character chunks with optional overlap.

Examples:

splitter = Phronomy::Splitter::FixedSizeSplitter.new(chunk_size: 200, chunk_overlap: 20)
chunks   = splitter.split({ text: long_text, metadata: { source: "doc.txt" } })
# => [
#   { text: "...(200 chars)...", metadata: { source: "doc.txt", chunk: 0 } },
#   { text: "...(200 chars, 20-char overlap)...", metadata: { source: "doc.txt", chunk: 1 } },
# ]

Instance Method Summary collapse

Methods inherited from Base

#split_all

Constructor Details

#initialize(chunk_size: 1000, chunk_overlap: 200) ⇒ FixedSizeSplitter

Returns a new instance of FixedSizeSplitter.

Parameters:

  • chunk_size (Integer) (defaults to: 1000)

    maximum characters per chunk (default: 1000)

  • chunk_overlap (Integer) (defaults to: 200)

    characters to repeat at the start of each subsequent chunk (default: 200); must be less than chunk_size

Raises:

  • (ArgumentError)


18
19
20
21
22
23
# File 'lib/phronomy/splitter/fixed_size_splitter.rb', line 18

def initialize(chunk_size: 1000, chunk_overlap: 200)
  raise ArgumentError, "chunk_overlap must be less than chunk_size" if chunk_overlap >= chunk_size

  @chunk_size = chunk_size
  @chunk_overlap = chunk_overlap
end

Instance Method Details

#split(document) ⇒ Array<Hash>

Parameters:

  • document (Hash, String)

Returns:

  • (Array<Hash>)


27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# File 'lib/phronomy/splitter/fixed_size_splitter.rb', line 27

def split(document)
  doc = normalise(document)
  text = doc[:text]
   = doc[:metadata]

  chunks = []
  start = 0
  index = 0

  while start < text.length
    chunk_text = text[start, @chunk_size]
    chunks << {text: chunk_text, metadata: .merge(chunk: index)}
    break if start + @chunk_size >= text.length

    start += @chunk_size - @chunk_overlap
    index += 1
  end

  chunks
end