Module: Kreuzberg::ChunkSizing

Extended by:
T::Helpers, T::Sig
Included in:
ChunkSizingCharacters, ChunkSizingTokenizer
Defined in:
lib/kreuzberg/native.rb

Overview

How chunk size is measured.

Defaults to ‘Characters` (Unicode character count). When using token-based sizing, chunks are sized by token count according to the specified tokenizer.

Token-based sizing uses HuggingFace tokenizers loaded at runtime. Any tokenizer available on HuggingFace Hub can be used, including OpenAI-compatible tokenizers (e.g., ‘Xenova/gpt-4o`, `Xenova/cl100k_base`).

Class Method Summary collapse

Class Method Details

.from_hash(hash) ⇒ Object



30
31
32
33
34
35
36
37
# File 'lib/kreuzberg/native.rb', line 30

def self.from_hash(hash)
  discriminator = hash[:type] || hash["type"]
  case discriminator
  when "characters" then ChunkSizingCharacters.from_hash(hash)
  when "tokenizer" then ChunkSizingTokenizer.from_hash(hash)
  else raise "Unknown discriminator: #{discriminator}"
  end
end