Class: Documentrix::Documents::Splitters::Semantic

Inherits:

Object

Object
Documentrix::Documents::Splitters::Semantic

show all

Includes:: Common, Utils::Math

Defined in:: lib/documentrix/documents/splitters/semantic.rb

Overview

Semantic splitter that divides text based on thematic changes in meaning.

It works by splitting text into sentences, computing embeddings for each, and then calculating the cosine distance between adjacent sentences. Where the distance exceeds a calculated threshold (the "breakpoint"), a semantic boundary is identified.

Examples:

splitter = Documentrix::Documents::Splitters::Semantic.new(
  ollama: ollama_client,
  model: 'mxbai-embed-large'
)
chunks = splitter.split(text, breakpoint: :percentile, percentile: 90)

Constant Summary collapse

DEFAULT_SEPARATOR = The default regex used to identify sentence boundaries for semantic splitting. It matches a sentence-ending punctuation mark (., !, ?) followed by optional whitespace at a word boundary or the end of the string. Returns: (Regexp)

/[.!?,;]\s*(?:\b|\z)/

Instance Method Summary collapse

#initialize(ollama:, model:, model_options: nil, separator: DEFAULT_SEPARATOR, chunk_size: 4096, force: false) ⇒ Semantic constructor
Initializes a new Semantic splitter.
#split(text, batch_size: 100, breakpoint: :percentile, **opts) ⇒ Array<String>
Splits the given text into semantic chunks.

Methods included from Utils::Math

#convert_to_vector, #cosine_similarity, #norm

Constructor Details

#initialize(ollama:, model:, model_options: nil, separator: DEFAULT_SEPARATOR, chunk_size: 4096, force: false) ⇒ `Semantic`

Initializes a new Semantic splitter.

Parameters:

ollama (Ollama::Client) —
the client used for generating embeddings
model (String) —
the embedding model name
model_options (Hash, nil) (defaults to: nil) —
optional parameters passed to the embedding model
separator (Regexp) (defaults to: DEFAULT_SEPARATOR) —
the regex used to identify sentence boundaries
chunk_size (Integer) (defaults to: 4096) —
the maximum character length of a resulting chunk
force (Boolean) (defaults to: false) —
whether to force split chunks that exceed chunk_size (defaults to false)

# File 'lib/documentrix/documents/splitters/semantic.rb', line 35

def initialize(ollama:, model:, model_options: nil, separator: DEFAULT_SEPARATOR, chunk_size: 4096, force: false)
  @ollama, @model, @model_options, @separator, @chunk_size, @force =
    ollama, model, model_options, separator, chunk_size, force
end

Instance Method Details

#split(text, batch_size: 100, breakpoint: :percentile, **opts) ⇒ `Array<String>`

Splits the given text into semantic chunks.

The method first decomposes the text into sentences, then identifies gaps in semantic similarity. It then groups these sentences into chunks that respect both the semantic boundaries and the maximum chunk size.