Class: Pikuri::VectorDb::Chunker::FixedWindow
- Inherits:
-
Object
- Object
- Pikuri::VectorDb::Chunker::FixedWindow
- Defined in:
- lib/pikuri/vector_db/chunker/fixed_window.rb
Overview
Sliding-window chunker. Splits text on whitespace into words, then walks forward emitting chunks of approximately size tokens with overlap tokens of repeated tail from the previous chunk.
Why whitespace-split rather than a richer tokenization
Whitespace gives word boundaries for English / Western- European languages, which is good enough for v1 against any text the TextExtractor produces (Markdown, PDF, HTML are all flattened to whitespace-separated prose). CJK languages (no whitespace) degrade to one-huge-unit chunks — documented limitation, picked up if a real CJK use case arrives.
How overlap works
Each chunk after the first re-includes the tail of the previous chunk, sized to overlap tokens. This gives the embedder context across chunk boundaries: an answer that straddles two sliding-window positions stays intact in at least one chunk. Standard RAG practice; LangChain / LlamaIndex / opencode all do the same.
Per-chunk tokenizer cost
The greedy “add one word, check token count, repeat” algorithm calls tokenizer.count once per word per candidate chunk — O(n_words * chunks_in_text) calls. For Tokenizer::CharHeuristic this is negligible. For Tokenizer::LlamaServer this is one HTTP round-trip per call; indexing a large corpus takes minutes-not-seconds. One-time cost paid at boot or explicit reindex; acceptable for v1. A doubling-search + binary-refine variant is the obvious optimization if it bites; the protocol stays the same.
Forward-progress guard
The constructor rejects overlap >= size to keep the sliding window actually moving forward. Even with valid inputs, the inner loop always advances by at least one word — guarantees termination even if a pathological tokenizer were to report misleading counts.
Instance Attribute Summary collapse
-
#overlap ⇒ Integer
readonly
Tokens of overlap between adjacent chunks.
-
#size ⇒ Integer
readonly
Target token count per chunk.
Instance Method Summary collapse
-
#chunk(text) ⇒ Array<String>
Chunk
textinto approximatelysize-token windows withoverlap-token tail repeats. - #initialize(size:, overlap: 0, tokenizer: Tokenizer::CharHeuristic.new) ⇒ FixedWindow constructor
Constructor Details
#initialize(size:, overlap: 0, tokenizer: Tokenizer::CharHeuristic.new) ⇒ FixedWindow
71 72 73 74 75 76 77 78 79 80 81 82 83 |
# File 'lib/pikuri/vector_db/chunker/fixed_window.rb', line 71 def initialize(size:, overlap: 0, tokenizer: Tokenizer::CharHeuristic.new) raise ArgumentError, "size must be positive (got #{size})" if size <= 0 raise ArgumentError, "overlap must be >= 0 (got #{overlap})" if overlap.negative? if overlap >= size raise ArgumentError, "overlap (#{overlap}) must be strictly less than size (#{size}) " \ "— the sliding window would not advance" end @size = size @overlap = overlap @tokenizer = tokenizer end |
Instance Attribute Details
#overlap ⇒ Integer (readonly)
Returns tokens of overlap between adjacent chunks.
56 57 58 |
# File 'lib/pikuri/vector_db/chunker/fixed_window.rb', line 56 def overlap @overlap end |
#size ⇒ Integer (readonly)
Returns target token count per chunk.
52 53 54 |
# File 'lib/pikuri/vector_db/chunker/fixed_window.rb', line 52 def size @size end |
Instance Method Details
#chunk(text) ⇒ Array<String>
Chunk text into approximately size-token windows with overlap-token tail repeats. Empty / whitespace-only input returns [].
92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 |
# File 'lib/pikuri/vector_db/chunker/fixed_window.rb', line 92 def chunk(text) words = text.split return [] if words.empty? chunks = [] start = 0 while start < words.length finish = find_chunk_end(words, start) chunks << words[start...finish].join(' ') break if finish >= words.length start = find_next_start(words, start, finish) end chunks end |