Class: Parse::Retrieval::Chunker::FixedSizeOverlap

Inherits:
Base
  • Object
show all
Defined in:
lib/parse/retrieval/chunker.rb

Overview

Fixed-size sliding-window chunker with overlap.

Splits text into windows of size units, advancing by size - overlap each step so consecutive chunks share overlap units of context. by: :chars (default) counts characters; by: :tokens counts whitespace-delimited tokens (a cheap approximation — there is no model tokenizer here; see the :tokens note below).

c = Parse::Retrieval::Chunker::FixedSizeOverlap.new(size: 800, overlap: 100) c.chunk(long_text) #=> ["…800 chars…", "…overlap+800…", …]

== Amplification cap

max_chunks_per_document (default 200) bounds how many chunks a single document can yield. Beyond the cap the chunker truncates — it returns the first max_chunks_per_document chunks rather than raising — and #chunk_with_meta reports truncated: true. This is the DoS guard: a 10 MB field at 800-char windows would otherwise yield ~12,500 chunks.

== :tokens

by: :tokens treats size/overlap as literal whitespace-token counts supplied by the caller. The chunker does NOT consult an embedding provider's max_input_tokens; that hint is the caller's concern (see Parse::Retrieval.retrieve). The chunker always does exactly what it was constructed with and never silently switches modes.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(size: 800, overlap: 100, by: :chars, max_chunks_per_document: 200) ⇒ FixedSizeOverlap

Returns a new instance of FixedSizeOverlap.

Parameters:

  • size (Integer) (defaults to: 800)

    window width (> 0).

  • overlap (Integer) (defaults to: 100)

    shared units between windows (0 <= overlap < size).

  • by (Symbol) (defaults to: :chars)

    :chars (default) or :tokens.

  • max_chunks_per_document (Integer) (defaults to: 200)

    cap (> 0, default 200).

Raises:

  • (ArgumentError)

    on any out-of-range argument. In particular overlap >= size is refused: a non-shrinking stride would never advance and would loop forever.



130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
# File 'lib/parse/retrieval/chunker.rb', line 130

def initialize(size: 800, overlap: 100, by: :chars, max_chunks_per_document: 200)
  unless size.is_a?(Integer) && size > 0
    raise ArgumentError, "size must be a positive Integer (got #{size.inspect})."
  end
  unless overlap.is_a?(Integer) && overlap >= 0
    raise ArgumentError, "overlap must be a non-negative Integer (got #{overlap.inspect})."
  end
  if overlap >= size
    raise ArgumentError,
          "overlap (#{overlap}) must be strictly less than size (#{size}); " \
          "a stride of size - overlap <= 0 would never advance."
  end
  unless %i[chars tokens].include?(by)
    raise ArgumentError, "by must be :chars or :tokens (got #{by.inspect})."
  end
  unless max_chunks_per_document.is_a?(Integer) && max_chunks_per_document > 0
    raise ArgumentError,
          "max_chunks_per_document must be a positive Integer " \
          "(got #{max_chunks_per_document.inspect})."
  end
  @size = size
  @overlap = overlap
  @by = by
  @max_chunks_per_document = max_chunks_per_document
  @stride = size - overlap
end

Instance Attribute Details

#bySymbol (readonly)

Returns :chars or :tokens.

Returns:

  • (Symbol)

    :chars or :tokens.



118
119
120
# File 'lib/parse/retrieval/chunker.rb', line 118

def by
  @by
end

#max_chunks_per_documentInteger (readonly)

Returns hard cap on chunks emitted per document.

Returns:

  • (Integer)

    hard cap on chunks emitted per document.



120
121
122
# File 'lib/parse/retrieval/chunker.rb', line 120

def max_chunks_per_document
  @max_chunks_per_document
end

#overlapInteger (readonly)

Returns units shared between consecutive windows.

Returns:

  • (Integer)

    units shared between consecutive windows.



116
117
118
# File 'lib/parse/retrieval/chunker.rb', line 116

def overlap
  @overlap
end

#sizeInteger (readonly)

Returns window width in by: units.

Returns:

  • (Integer)

    window width in by: units.



114
115
116
# File 'lib/parse/retrieval/chunker.rb', line 114

def size
  @size
end

Instance Method Details

#chunk(text) ⇒ Array<String>

Returns chunks (capped at #max_chunks_per_document). [] for blank input.

Parameters:

Returns:



160
161
162
# File 'lib/parse/retrieval/chunker.rb', line 160

def chunk(text)
  chunk_with_meta(text)[:chunks]
end

#chunk_with_meta(text) ⇒ Hash

Wrap #chunk with truncation metadata. The default implementation here does NOT cap — it reports the chunk list as produced. Parse::Retrieval::Chunker::FixedSizeOverlap overrides this to enforce its max_chunks_per_document cap and report the pre-cap count.

Parameters:

Returns:

  • (Hash)

    { chunks: Array<String>, truncated: Boolean, total_before_truncation: Integer }.



165
166
167
168
169
170
171
172
173
174
175
176
177
178
# File 'lib/parse/retrieval/chunker.rb', line 165

def chunk_with_meta(text)
  source = normalize(text)
  return { chunks: [], truncated: false, total_before_truncation: 0 } if source.nil?

  all = (@by == :tokens) ? window_tokens(source) : window_chars(source)
  total = all.length
  if total > @max_chunks_per_document
    { chunks: all.first(@max_chunks_per_document),
      truncated: true,
      total_before_truncation: total }
  else
    { chunks: all, truncated: false, total_before_truncation: total }
  end
end