Module: Kreuzberg::ChunkSizing
- Extended by:
- T::Helpers, T::Sig
- Included in:
- ChunkSizingCharacters, ChunkSizingTokenizer
- Defined in:
- lib/kreuzberg/native.rb
Overview
How chunk size is measured.
Defaults to ‘Characters` (Unicode character count). When using token-based sizing, chunks are sized by token count according to the specified tokenizer.
Token-based sizing uses HuggingFace tokenizers loaded at runtime. Any tokenizer available on HuggingFace Hub can be used, including OpenAI-compatible tokenizers (e.g., ‘Xenova/gpt-4o`, `Xenova/cl100k_base`).
Class Method Summary collapse
Class Method Details
.from_hash(hash) ⇒ Object
30 31 32 33 34 35 36 37 |
# File 'lib/kreuzberg/native.rb', line 30 def self.from_hash(hash) discriminator = hash[:type] || hash["type"] case discriminator when "characters" then ChunkSizingCharacters.from_hash(hash) when "tokenizer" then ChunkSizingTokenizer.from_hash(hash) else raise "Unknown discriminator: #{discriminator}" end end |