Class: SemanticTextChunker::Chunker
- Inherits:
-
Object
- Object
- SemanticTextChunker::Chunker
- Defined in:
- lib/semantic_text_chunker/chunker.rb
Instance Method Summary collapse
- #chunk(text) ⇒ Object
- #chunk_with_metadata(text, **metadata) ⇒ Object
-
#initialize(embedder: Embedders::Null.new, threshold: 0.75, max_tokens: 512, overlap_sentences: 2, respect_structure: true, extra_abbreviations: []) ⇒ Chunker
constructor
A new instance of Chunker.
Constructor Details
#initialize(embedder: Embedders::Null.new, threshold: 0.75, max_tokens: 512, overlap_sentences: 2, respect_structure: true, extra_abbreviations: []) ⇒ Chunker
Returns a new instance of Chunker.
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
# File 'lib/semantic_text_chunker/chunker.rb', line 11 def initialize( embedder: Embedders::Null.new, threshold: 0.75, max_tokens: 512, overlap_sentences: 2, respect_structure: true, extra_abbreviations: [] ) @embedder = @threshold = threshold @max_tokens = max_tokens @overlap_sentences = overlap_sentences @respect_structure = respect_structure @splitter = Splitters::SentenceSplitter.new(extra_abbreviations: extra_abbreviations) @structure_splitter = Splitters::StructureSplitter.new(sentence_splitter: @splitter) end |
Instance Method Details
#chunk(text) ⇒ Object
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
# File 'lib/semantic_text_chunker/chunker.rb', line 28 def chunk(text) return [] if text.nil? || text.strip.empty? if @respect_structure sentences, hard = @structure_splitter.split(text) else sentences = @splitter.split(text) hard = [] end = @embedder.(sentences) boundaries = BoundaryDetector.new( sentences: sentences, embeddings: , threshold: @threshold, max_tokens: @max_tokens, embedder: @embedder, forced: hard ).boundaries ChunkBuilder.new( sentences: sentences, boundaries: boundaries, overlap_sentences: @overlap_sentences, hard_boundaries: hard ).build end |