Class: Parse::Retrieval::Chunker::FixedSizeOverlap
- Defined in:
- lib/parse/retrieval/chunker.rb
Overview
Fixed-size sliding-window chunker with overlap.
Splits text into windows of size units, advancing by
size - overlap each step so consecutive chunks share overlap
units of context. by: :chars (default) counts characters;
by: :tokens counts whitespace-delimited tokens (a cheap
approximation — there is no model tokenizer here; see the
:tokens note below).
c = Parse::Retrieval::Chunker::FixedSizeOverlap.new(size: 800, overlap: 100) c.chunk(long_text) #=> ["…800 chars…", "…overlap+800…", …]
== Amplification cap
max_chunks_per_document (default 200) bounds how many chunks a
single document can yield. Beyond the cap the chunker
truncates — it returns the first max_chunks_per_document
chunks rather than raising — and #chunk_with_meta reports
truncated: true. This is the DoS guard: a 10 MB field at
800-char windows would otherwise yield ~12,500 chunks.
== :tokens
by: :tokens treats size/overlap as literal whitespace-token
counts supplied by the caller. The chunker does NOT consult an
embedding provider's max_input_tokens; that hint is the
caller's concern (see Parse::Retrieval.retrieve). The chunker
always does exactly what it was constructed with and never
silently switches modes.
Instance Attribute Summary collapse
-
#by ⇒ Symbol
readonly
:charsor:tokens. -
#max_chunks_per_document ⇒ Integer
readonly
Hard cap on chunks emitted per document.
-
#overlap ⇒ Integer
readonly
Units shared between consecutive windows.
-
#size ⇒ Integer
readonly
Window width in
by:units.
Instance Method Summary collapse
-
#chunk(text) ⇒ Array<String>
Chunks (capped at #max_chunks_per_document).
-
#chunk_with_meta(text) ⇒ Hash
Wrap #chunk with truncation metadata.
-
#initialize(size: 800, overlap: 100, by: :chars, max_chunks_per_document: 200) ⇒ FixedSizeOverlap
constructor
A new instance of FixedSizeOverlap.
Constructor Details
#initialize(size: 800, overlap: 100, by: :chars, max_chunks_per_document: 200) ⇒ FixedSizeOverlap
Returns a new instance of FixedSizeOverlap.
130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
# File 'lib/parse/retrieval/chunker.rb', line 130 def initialize(size: 800, overlap: 100, by: :chars, max_chunks_per_document: 200) unless size.is_a?(Integer) && size > 0 raise ArgumentError, "size must be a positive Integer (got #{size.inspect})." end unless overlap.is_a?(Integer) && overlap >= 0 raise ArgumentError, "overlap must be a non-negative Integer (got #{overlap.inspect})." end if overlap >= size raise ArgumentError, "overlap (#{overlap}) must be strictly less than size (#{size}); " \ "a stride of size - overlap <= 0 would never advance." end unless %i[chars tokens].include?(by) raise ArgumentError, "by must be :chars or :tokens (got #{by.inspect})." end unless max_chunks_per_document.is_a?(Integer) && max_chunks_per_document > 0 raise ArgumentError, "max_chunks_per_document must be a positive Integer " \ "(got #{max_chunks_per_document.inspect})." end @size = size @overlap = overlap @by = by @max_chunks_per_document = max_chunks_per_document @stride = size - overlap end |
Instance Attribute Details
#by ⇒ Symbol (readonly)
Returns :chars or :tokens.
118 119 120 |
# File 'lib/parse/retrieval/chunker.rb', line 118 def by @by end |
#max_chunks_per_document ⇒ Integer (readonly)
Returns hard cap on chunks emitted per document.
120 121 122 |
# File 'lib/parse/retrieval/chunker.rb', line 120 def max_chunks_per_document @max_chunks_per_document end |
#overlap ⇒ Integer (readonly)
Returns units shared between consecutive windows.
116 117 118 |
# File 'lib/parse/retrieval/chunker.rb', line 116 def overlap @overlap end |
#size ⇒ Integer (readonly)
Returns window width in by: units.
114 115 116 |
# File 'lib/parse/retrieval/chunker.rb', line 114 def size @size end |
Instance Method Details
#chunk(text) ⇒ Array<String>
Returns chunks (capped at
#max_chunks_per_document). [] for blank input.
160 161 162 |
# File 'lib/parse/retrieval/chunker.rb', line 160 def chunk(text) (text)[:chunks] end |
#chunk_with_meta(text) ⇒ Hash
Wrap #chunk with truncation metadata. The default
implementation here does NOT cap — it reports the chunk list as
produced. Parse::Retrieval::Chunker::FixedSizeOverlap overrides this to enforce its
max_chunks_per_document cap and report the pre-cap count.
165 166 167 168 169 170 171 172 173 174 175 176 177 178 |
# File 'lib/parse/retrieval/chunker.rb', line 165 def (text) source = normalize(text) return { chunks: [], truncated: false, total_before_truncation: 0 } if source.nil? all = (@by == :tokens) ? window_tokens(source) : window_chars(source) total = all.length if total > @max_chunks_per_document { chunks: all.first(@max_chunks_per_document), truncated: true, total_before_truncation: total } else { chunks: all, truncated: false, total_before_truncation: total } end end |