SemanticTextChunker

Embedding-aware semantic chunking for Ruby RAG pipelines. Splits text into coherent chunks by detecting topic boundaries using embedding similarity, rather than blindly splitting on character count.

Installation

Add to your Gemfile:

gem "semantic_text_chunker"

Then run:

bundle install

Or install directly:

gem install semantic_text_chunker

Quick Start

require "semantic_text_chunker"

text = "Your long document text here..."

# Using OpenAI embeddings
chunks = SemanticTextChunker.chunk(text,
  embedder: SemanticTextChunker::Embedders::OpenAI.new(api_key: ENV["OPENAI_API_KEY"])
)

chunks.each { |chunk| puts chunk, "---" }

Embedders

OpenAI

embedder = SemanticTextChunker::Embedders::OpenAI.new(
  api_key: ENV["OPENAI_API_KEY"],
  model: "text-embedding-3-small"  # default
)

Cohere

embedder = SemanticTextChunker::Embedders::Cohere.new(
  api_key: ENV["COHERE_API_KEY"],
  model: "embed-english-v3.0"  # default
)

OpenRouter

embedder = SemanticTextChunker::Embedders::OpenRouter.new(
  api_key: ENV["OPENROUTER_API_KEY"],
  model: "openai/text-embedding-3-small"  # default
)

Null (no API required)

A hash-based embedder useful for testing and development. No external API calls needed.

embedder = SemanticTextChunker::Embedders::Null.new

Options

Option Default Description
embedder Null Embedder instance to use for generating embeddings
threshold 0.75 Cosine similarity threshold for detecting boundaries
max_tokens 512 Maximum tokens per chunk (estimated at ~4 chars/token)
overlap_sentences 2 Number of sentences to overlap between chunks
respect_structure true Treat paragraph breaks and markdown headings as hard chunk boundaries
extra_abbreviations [] Additional abbreviations the sentence splitter should not split on
chunks = SemanticTextChunker.chunk(text,
  embedder: embedder,
  threshold: 0.8,
  max_tokens: 1024,
  overlap_sentences: 3,
  respect_structure: true,
  extra_abbreviations: ["Inc", "Ltd"]
)

Metadata

Prepend metadata to each chunk for better retrieval context:

chunks = SemanticTextChunker.(text,
  embedder: embedder,
  title: "The Great Gatsby",
  author: "F. Scott Fitzgerald",
  chapter: "Chapter 1",
  section: "Opening",
  source: "gutenberg.org"
)

Each chunk will be prefixed with:

Title: The Great Gatsby
Author: F. Scott Fitzgerald
Chapter: Chapter 1
Section: Opening
Source: gutenberg.org

<chunk text>

Custom Embedders

Create your own embedder by subclassing SemanticTextChunker::Embedders::Base:

class MyEmbedder < SemanticTextChunker::Embedders::Base
  def embed(texts)
    # texts is an array of strings
    # Return an array of embedding vectors (arrays of floats)
    texts.map { |t| your_embedding_logic(t) }
  end
end

The base class provides a cosine_similarity method used for boundary detection.

Sentence Splitting

Sentences are detected with punctuation-aware rules that:

  • Keep common abbreviations intact (Mr., Dr., U.S.A., e.g., etc.)
  • Keep decimal numbers intact (3.14, v1.2.3)
  • Split dialogue ending in a closing quote ("Stop!" He ran.)
  • Start new sentences on digits or opening quotes, not just capital letters

To recognize domain-specific abbreviations, pass extra_abbreviations (also accepted directly by SemanticTextChunker.chunk):

splitter = SemanticTextChunker::Splitters::SentenceSplitter.new(
  extra_abbreviations: ["Inc", "Ltd", "cf", "al"]
)
splitter.split("Acme Inc. shipped it. Done.")
# => ["Acme Inc. shipped it.", "Done."]

Structure-Aware Chunking

By default (respect_structure: true), the chunker respects document structure so that chunks never blur across obvious boundaries:

  • Paragraph breaks (blank lines) end a chunk — two paragraphs are never merged into one.
  • Markdown headings (# ... through ###### ...) start a new section. A standalone heading is attached to the content that follows it, so each section's chunk carries its heading for context.
  • Overlap never crosses a structural boundary, so a section's chunk won't leak the tail of the previous section.

Semantic similarity and the token limit still apply within each structural block. Set respect_structure: false to disable this and chunk purely by similarity and token count.

How It Works

  1. Structure splitting - Text is broken into blocks on paragraph breaks and markdown headings, which become hard boundaries that chunks are never merged across (when respect_structure is enabled)
  2. Sentence splitting - Each block is split into sentences using punctuation-aware rules that handle abbreviations (Mr., Dr., U.S., etc.), decimals, and dialogue
  3. Embedding - Each sentence is embedded using the configured embedder
  4. Boundary detection - Consecutive sentences are grouped. A new chunk boundary is created at a structural boundary, when the cosine similarity between the accumulated chunk embedding and the next sentence drops below the threshold, or when the token limit is exceeded
  5. Chunk building - Sentences are assembled into chunks with configurable overlap for context continuity (overlap never crosses a structural boundary)

License

MIT