Class: SemanticTextChunker::BoundaryDetector

Inherits:

Object

Object
SemanticTextChunker::BoundaryDetector

show all

Defined in:: lib/semantic_text_chunker/boundary_detector.rb

Instance Method Summary collapse

#boundaries ⇒ Object

Returns array of sentence indices where chunks end.
#initialize(sentences:, embeddings:, threshold:, max_tokens:, embedder:) ⇒ BoundaryDetector constructor

A new instance of BoundaryDetector.

Constructor Details

#initialize(sentences:, embeddings:, threshold:, max_tokens:, embedder:) ⇒ `BoundaryDetector`

Returns a new instance of BoundaryDetector.

# File 'lib/semantic_text_chunker/boundary_detector.rb', line 3

def initialize(sentences:, embeddings:, threshold:, max_tokens:, embedder:)
  @sentences  = sentences
  @embeddings = embeddings
  @threshold  = threshold
  @max_tokens = max_tokens
  @embedder   = embedder
end

Instance Method Details

#boundaries ⇒ `Object`

Returns array of sentence indices where chunks end

# File 'lib/semantic_text_chunker/boundary_detector.rb', line 12

def boundaries
  return [] if @sentences.size <= 1

  boundaries    = []
  chunk_start   = 0
  current_text  = ""

  @sentences.each_with_index do |sentence, i|
    next if i == 0
    current_text = @sentences[chunk_start..i - 1].join(" ")
    next_text    = current_text + " " + sentence

    # Force boundary if adding this sentence exceeds token limit
    if tokens(next_text) > @max_tokens
      boundaries << i - 1
      chunk_start  = i
      current_text = ""
      next
    end

    # Compute similarity between accumulated chunk and next sentence
    chunk_embedding    = mean_embedding(@embeddings[chunk_start..i - 1])
    sentence_embedding = @embeddings[i]
    similarity         = @embedder.cosine_similarity(chunk_embedding, sentence_embedding)

    if similarity < @threshold
      boundaries << i - 1
      chunk_start = i
    end
  end

  boundaries
end

Class: SemanticTextChunker::BoundaryDetector

Instance Method Summary collapse

Constructor Details

#initialize(sentences:, embeddings:, threshold:, max_tokens:, embedder:) ⇒ BoundaryDetector

Instance Method Details

#boundaries ⇒ Object

#initialize(sentences:, embeddings:, threshold:, max_tokens:, embedder:) ⇒ `BoundaryDetector`

#boundaries ⇒ `Object`