Class: Documentrix::Documents::Splitters::Semantic

Inherits:
Object
  • Object
show all
Includes:
Common, Utils::Math
Defined in:
lib/documentrix/documents/splitters/semantic.rb

Overview

Semantic splitter that divides text based on thematic changes in meaning.

It works by splitting text into sentences, computing embeddings for each, and then calculating the cosine distance between adjacent sentences. Where the distance exceeds a calculated threshold (the "breakpoint"), a semantic boundary is identified.

Examples:

splitter = Documentrix::Documents::Splitters::Semantic.new(
  ollama: ollama_client,
  model: 'mxbai-embed-large'
)
chunks = splitter.split(text, breakpoint: :percentile, percentile: 90)

Constant Summary collapse

DEFAULT_SEPARATOR =

The default regex used to identify sentence boundaries for semantic splitting. It matches a sentence-ending punctuation mark (., !, ?) followed by optional whitespace at a word boundary or the end of the string.

Returns:

  • (Regexp)
/[.!?,;]\s*(?:\b|\z)/

Instance Method Summary collapse

Methods included from Utils::Math

#convert_to_vector, #cosine_similarity, #norm

Constructor Details

#initialize(ollama:, model:, model_options: nil, separator: DEFAULT_SEPARATOR, chunk_size: 4096, force: false) ⇒ Semantic

Initializes a new Semantic splitter.

Parameters:

  • ollama (Ollama::Client)

    the client used for generating embeddings

  • model (String)

    the embedding model name

  • model_options (Hash, nil) (defaults to: nil)

    optional parameters passed to the embedding model

  • separator (Regexp) (defaults to: DEFAULT_SEPARATOR)

    the regex used to identify sentence boundaries

  • chunk_size (Integer) (defaults to: 4096)

    the maximum character length of a resulting chunk

  • force (Boolean) (defaults to: false)

    whether to force split chunks that exceed chunk_size (defaults to false)



35
36
37
38
# File 'lib/documentrix/documents/splitters/semantic.rb', line 35

def initialize(ollama:, model:, model_options: nil, separator: DEFAULT_SEPARATOR, chunk_size: 4096, force: false)
  @ollama, @model, @model_options, @separator, @chunk_size, @force =
    ollama, model, model_options, separator, chunk_size, force
end

Instance Method Details

#split(text, batch_size: 100, breakpoint: :percentile, **opts) ⇒ Array<String>

Splits the given text into semantic chunks.

The method first decomposes the text into sentences, then identifies gaps in semantic similarity. It then groups these sentences into chunks that respect both the semantic boundaries and the maximum chunk size.

Parameters:

  • text (String)

    the text to be split

  • batch_size (Integer) (defaults to: 100)

    the number of sentences to embed in a single API call

  • breakpoint (Symbol) (defaults to: :percentile)

    the method used to determine the distance threshold

    • :percentile (default) - uses the N-th percentile of distances
    • :standard_deviation - uses mean + (std_dev * multiplier)
    • :interquartile - uses mean + (iqr * multiplier)
  • opts (Hash)

    additional options for the splitting process:

    • :include_separator [Boolean] whether to keep the sentence separator in the result
    • :percentile [Integer] the percentile to use if breakpoint is :percentile (default: 95)
    • :percentage [Integer] the multiplier percentage for :standard_deviation or :interquartile (default: 100)

Returns:

  • (Array<String>)

    an array of semantically grouped text chunks



58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
# File 'lib/documentrix/documents/splitters/semantic.rb', line 58

def split(text, batch_size: 100, breakpoint: :percentile, **opts)
  sentences  = Documentrix::Documents::Splitters::Character.new(
    separator: @separator,
    include_separator: opts.fetch(:include_separator, true),
    chunk_size: 1,
  ).split(text)
  embeddings = sentences.with_infobar(label: 'Split').each_slice(batch_size).inject([]) do |e, batch|
    e.concat sentence_embeddings(batch)
    infobar.progress by: batch.size
    e
  end
  infobar.newline
  embeddings.size < 2 and return sentences
  distances = embeddings.each_cons(2).map do |a, b|
    1.0 - cosine_similarity(a:, b:)
  end
  max_distance = calculate_breakpoint_threshold(breakpoint, distances, **opts)
  gaps = distances.each_with_index.select do |d, i|
    d > max_distance
  end.transpose.last
  gaps or return sentences
  if gaps.last < distances.size
    gaps << distances.size
  end
  if gaps.last < sentences.size - 1
    gaps << sentences.size - 1
  end
  result = []
  sg = 0
  current_text = +''
  gaps.each do |g|
    sg.upto(g) do |i|
      sentence = sentences[i]
      if current_text.size + sentence.size < @chunk_size
        current_text += sentence
      else
        result.concat(force_split(current_text))
        current_text = sentence
      end
    end
    if current_text.present?
      result.concat(force_split(current_text))
      current_text = +''
    end
    sg = g.succ
  end
  result.concat(force_split(current_text))
  result
end