Class: Documentrix::Documents::Splitters::Semantic
- Inherits:
-
Object
- Object
- Documentrix::Documents::Splitters::Semantic
- Includes:
- Common, Utils::Math
- Defined in:
- lib/documentrix/documents/splitters/semantic.rb
Overview
Semantic splitter that divides text based on thematic changes in meaning.
It works by splitting text into sentences, computing embeddings for each, and then calculating the cosine distance between adjacent sentences. Where the distance exceeds a calculated threshold (the "breakpoint"), a semantic boundary is identified.
Constant Summary collapse
- DEFAULT_SEPARATOR =
The default regex used to identify sentence boundaries for semantic splitting. It matches a sentence-ending punctuation mark (., !, ?) followed by optional whitespace at a word boundary or the end of the string.
/[.!?,;]\s*(?:\b|\z)/
Instance Method Summary collapse
-
#initialize(ollama:, model:, model_options: nil, separator: DEFAULT_SEPARATOR, chunk_size: 4096, force: false) ⇒ Semantic
constructor
Initializes a new Semantic splitter.
-
#split(text, batch_size: 100, breakpoint: :percentile, **opts) ⇒ Array<String>
Splits the given text into semantic chunks.
Methods included from Utils::Math
#convert_to_vector, #cosine_similarity, #norm
Constructor Details
#initialize(ollama:, model:, model_options: nil, separator: DEFAULT_SEPARATOR, chunk_size: 4096, force: false) ⇒ Semantic
Initializes a new Semantic splitter.
35 36 37 38 |
# File 'lib/documentrix/documents/splitters/semantic.rb', line 35 def initialize(ollama:, model:, model_options: nil, separator: DEFAULT_SEPARATOR, chunk_size: 4096, force: false) @ollama, @model, @model_options, @separator, @chunk_size, @force = ollama, model, , separator, chunk_size, force end |
Instance Method Details
#split(text, batch_size: 100, breakpoint: :percentile, **opts) ⇒ Array<String>
Splits the given text into semantic chunks.
The method first decomposes the text into sentences, then identifies gaps in semantic similarity. It then groups these sentences into chunks that respect both the semantic boundaries and the maximum chunk size.
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
# File 'lib/documentrix/documents/splitters/semantic.rb', line 58 def split(text, batch_size: 100, breakpoint: :percentile, **opts) sentences = Documentrix::Documents::Splitters::Character.new( separator: @separator, include_separator: opts.fetch(:include_separator, true), chunk_size: 1, ).split(text) = sentences.(label: 'Split').each_slice(batch_size).inject([]) do |e, batch| e.concat (batch) .progress by: batch.size e end .newline .size < 2 and return sentences distances = .each_cons(2).map do |a, b| 1.0 - cosine_similarity(a:, b:) end max_distance = calculate_breakpoint_threshold(breakpoint, distances, **opts) gaps = distances.each_with_index.select do |d, i| d > max_distance end.transpose.last gaps or return sentences if gaps.last < distances.size gaps << distances.size end if gaps.last < sentences.size - 1 gaps << sentences.size - 1 end result = [] sg = 0 current_text = +'' gaps.each do |g| sg.upto(g) do |i| sentence = sentences[i] if current_text.size + sentence.size < @chunk_size current_text += sentence else result.concat(force_split(current_text)) current_text = sentence end end if current_text.present? result.concat(force_split(current_text)) current_text = +'' end sg = g.succ end result.concat(force_split(current_text)) result end |