Class: Pikuri::VectorDb::Chunker::FixedWindow

Inherits:
Object
  • Object
show all
Defined in:
lib/pikuri/vector_db/chunker/fixed_window.rb

Overview

Sliding-window chunker. Splits text on whitespace into words, then walks forward emitting chunks of approximately size tokens with overlap tokens of repeated tail from the previous chunk.

Why whitespace-split rather than a richer tokenization

Whitespace gives word boundaries for English / Western- European languages, which is good enough for v1 against any text the TextExtractor produces (Markdown, PDF, HTML are all flattened to whitespace-separated prose). CJK languages (no whitespace) degrade to one-huge-unit chunks — documented limitation, picked up if a real CJK use case arrives.

How overlap works

Each chunk after the first re-includes the tail of the previous chunk, sized to overlap tokens. This gives the embedder context across chunk boundaries: an answer that straddles two sliding-window positions stays intact in at least one chunk. Standard RAG practice; LangChain / LlamaIndex / opencode all do the same.

Per-chunk tokenizer cost

The greedy “add one word, check token count, repeat” algorithm calls tokenizer.count once per word per candidate chunk — O(n_words * chunks_in_text) calls. For Tokenizer::CharHeuristic this is negligible. For Tokenizer::LlamaServer this is one HTTP round-trip per call; indexing a large corpus takes minutes-not-seconds. One-time cost paid at boot or explicit reindex; acceptable for v1. A doubling-search + binary-refine variant is the obvious optimization if it bites; the protocol stays the same.

Forward-progress guard

The constructor rejects overlap >= size to keep the sliding window actually moving forward. Even with valid inputs, the inner loop always advances by at least one word — guarantees termination even if a pathological tokenizer were to report misleading counts.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(size:, overlap: 0, tokenizer: Tokenizer::CharHeuristic.new) ⇒ FixedWindow

Parameters:

  • size (Integer)

    target token count per chunk. Must be positive. Common values: 256, 512, 1024 —pick to match the embedder’s context (e.g. 512 for bge-small-en-v1.5).

  • overlap (Integer) (defaults to: 0)

    tokens of overlap between adjacent chunks. Must be >= 0 and strictly less than size. Common values: ~10% of size.

  • tokenizer (#count) (defaults to: Tokenizer::CharHeuristic.new)

    a Tokenizer; anything responding to count(text) -> Integer. Defaults to Tokenizer::CharHeuristic (zero-dep, ~4-chars-per-token approximation).

Raises:

  • (ArgumentError)

    on invalid size or overlap.



71
72
73
74
75
76
77
78
79
80
81
82
83
# File 'lib/pikuri/vector_db/chunker/fixed_window.rb', line 71

def initialize(size:, overlap: 0, tokenizer: Tokenizer::CharHeuristic.new)
  raise ArgumentError, "size must be positive (got #{size})" if size <= 0
  raise ArgumentError, "overlap must be >= 0 (got #{overlap})" if overlap.negative?
  if overlap >= size
    raise ArgumentError,
          "overlap (#{overlap}) must be strictly less than size (#{size}) " \
          "— the sliding window would not advance"
  end

  @size = size
  @overlap = overlap
  @tokenizer = tokenizer
end

Instance Attribute Details

#overlapInteger (readonly)

Returns tokens of overlap between adjacent chunks.

Returns:

  • (Integer)

    tokens of overlap between adjacent chunks.



56
57
58
# File 'lib/pikuri/vector_db/chunker/fixed_window.rb', line 56

def overlap
  @overlap
end

#sizeInteger (readonly)

Returns target token count per chunk.

Returns:

  • (Integer)

    target token count per chunk.



52
53
54
# File 'lib/pikuri/vector_db/chunker/fixed_window.rb', line 52

def size
  @size
end

Instance Method Details

#chunk(text) ⇒ Array<String>

Chunk text into approximately size-token windows with overlap-token tail repeats. Empty / whitespace-only input returns [].

Parameters:

  • text (String)

Returns:

  • (Array<String>)

    non-empty chunks, in source order. May be empty.



92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
# File 'lib/pikuri/vector_db/chunker/fixed_window.rb', line 92

def chunk(text)
  words = text.split
  return [] if words.empty?

  chunks = []
  start = 0
  while start < words.length
    finish = find_chunk_end(words, start)
    chunks << words[start...finish].join(' ')

    break if finish >= words.length

    start = find_next_start(words, start, finish)
  end

  chunks
end