Class: Pikuri::VectorDb::Chunker::FixedWindow

Inherits:

Object

Object
Pikuri::VectorDb::Chunker::FixedWindow

show all

Defined in:: lib/pikuri/vector_db/chunker/fixed_window.rb

Overview

Sliding-window chunker. Splits text on whitespace into words, then walks forward emitting chunks of approximately size tokens with overlap tokens of repeated tail from the previous chunk.

Why whitespace-split rather than a richer tokenization

Whitespace gives word boundaries for English / Western- European languages, which is good enough for v1 against any text the TextExtractor produces (Markdown, PDF, HTML are all flattened to whitespace-separated prose). CJK languages (no whitespace) degrade to one-huge-unit chunks — documented limitation, picked up if a real CJK use case arrives.

How overlap works

Each chunk after the first re-includes the tail of the previous chunk, sized to overlap tokens. This gives the embedder context across chunk boundaries: an answer that straddles two sliding-window positions stays intact in at least one chunk. Standard RAG practice; LangChain / LlamaIndex / opencode all do the same.

Per-chunk tokenizer cost

The greedy “add one word, check token count, repeat” algorithm calls tokenizer.count once per word per candidate chunk — O(n_words * chunks_in_text) calls. For Tokenizer::CharHeuristic this is negligible. For Tokenizer::LlamaServer this is one HTTP round-trip per call; indexing a large corpus takes minutes-not-seconds. One-time cost paid at boot or explicit reindex; acceptable for v1. A doubling-search + binary-refine variant is the obvious optimization if it bites; the protocol stays the same.

Forward-progress guard

The constructor rejects overlap >= size to keep the sliding window actually moving forward. Even with valid inputs, the inner loop always advances by at least one word — guarantees termination even if a pathological tokenizer were to report misleading counts.

Instance Attribute Summary collapse

#overlap ⇒ Integer readonly

Tokens of overlap between adjacent chunks.
#size ⇒ Integer readonly

Target token count per chunk.

Instance Method Summary collapse

#chunk(text) ⇒ Array<String>

Chunk text into approximately size-token windows with overlap-token tail repeats.
#initialize(size:, overlap: 0, tokenizer: Tokenizer::CharHeuristic.new) ⇒ FixedWindow constructor

Constructor Details

#initialize(size:, overlap: 0, tokenizer: Tokenizer::CharHeuristic.new) ⇒ `FixedWindow`

Parameters:

size (Integer) —

target token count per chunk. Must be positive. Common values: 256, 512, 1024 —pick to match the embedder’s context (e.g. 512 for bge-small-en-v1.5).
overlap (Integer) (defaults to: 0) —

tokens of overlap between adjacent chunks. Must be >= 0 and strictly less than size. Common values: ~10% of size.
tokenizer (#count) (defaults to: Tokenizer::CharHeuristic.new) —

a Tokenizer; anything responding to count(text) -> Integer. Defaults to Tokenizer::CharHeuristic (zero-dep, ~4-chars-per-token approximation).

Raises:

(ArgumentError) —

on invalid size or overlap.

# File 'lib/pikuri/vector_db/chunker/fixed_window.rb', line 71

def initialize(size:, overlap: 0, tokenizer: Tokenizer::CharHeuristic.new)
  raise ArgumentError, "size must be positive (got #{size})" if size <= 0
  raise ArgumentError, "overlap must be >= 0 (got #{overlap})" if overlap.negative?
  if overlap >= size
    raise ArgumentError,
          "overlap (#{overlap}) must be strictly less than size (#{size}) " \
          "— the sliding window would not advance"
  end

  @size = size
  @overlap = overlap
  @tokenizer = tokenizer
end

Instance Attribute Details

#overlap ⇒ `Integer` (readonly)

Returns tokens of overlap between adjacent chunks.

Returns:

(Integer) —

tokens of overlap between adjacent chunks.



56
57
58

# File 'lib/pikuri/vector_db/chunker/fixed_window.rb', line 56

def overlap
  @overlap
end

#size ⇒ `Integer` (readonly)

Returns target token count per chunk.

Returns:

(Integer) —

target token count per chunk.



52
53
54

# File 'lib/pikuri/vector_db/chunker/fixed_window.rb', line 52

def size
  @size
end

Instance Method Details

#chunk(text) ⇒ `Array<String>`

Chunk text into approximately size-token windows with overlap-token tail repeats. Empty / whitespace-only input returns [].

Parameters:

text (String)

Returns:

(Array<String>) —

non-empty chunks, in source order. May be empty.

# File 'lib/pikuri/vector_db/chunker/fixed_window.rb', line 92

def chunk(text)
  words = text.split
  return [] if words.empty?

  chunks = []
  start = 0
  while start < words.length
    finish = find_chunk_end(words, start)
    chunks << words[start...finish].join(' ')

    break if finish >= words.length

    start = find_next_start(words, start, finish)
  end

  chunks
end

Class: Pikuri::VectorDb::Chunker::FixedWindow

Overview

Why whitespace-split rather than a richer tokenization

How overlap works

Per-chunk tokenizer cost

Forward-progress guard

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(size:, overlap: 0, tokenizer: Tokenizer::CharHeuristic.new) ⇒ FixedWindow

Instance Attribute Details

#overlap ⇒ Integer (readonly)

#size ⇒ Integer (readonly)

Instance Method Details

#chunk(text) ⇒ Array<String>

#initialize(size:, overlap: 0, tokenizer: Tokenizer::CharHeuristic.new) ⇒ `FixedWindow`

#overlap ⇒ `Integer` (readonly)

#size ⇒ `Integer` (readonly)

#chunk(text) ⇒ `Array<String>`