SemanticTextChunker
Embedding-aware semantic chunking for Ruby RAG pipelines. Splits text into coherent chunks by detecting topic boundaries using embedding similarity, rather than blindly splitting on character count.
Installation
Add to your Gemfile:
gem "semantic_text_chunker"
Then run:
bundle install
Or install directly:
gem install semantic_text_chunker
Quick Start
require "semantic_text_chunker"
text = "Your long document text here..."
# Using OpenAI embeddings
chunks = SemanticTextChunker.chunk(text,
embedder: SemanticTextChunker::Embedders::OpenAI.new(api_key: ENV["OPENAI_API_KEY"])
)
chunks.each { |chunk| puts chunk, "---" }
Embedders
OpenAI
= SemanticTextChunker::Embedders::OpenAI.new(
api_key: ENV["OPENAI_API_KEY"],
model: "text-embedding-3-small" # default
)
Cohere
= SemanticTextChunker::Embedders::Cohere.new(
api_key: ENV["COHERE_API_KEY"],
model: "embed-english-v3.0" # default
)
OpenRouter
= SemanticTextChunker::Embedders::OpenRouter.new(
api_key: ENV["OPENROUTER_API_KEY"],
model: "openai/text-embedding-3-small" # default
)
Null (no API required)
A hash-based embedder useful for testing and development. No external API calls needed.
= SemanticTextChunker::Embedders::Null.new
Options
| Option | Default | Description |
|---|---|---|
embedder |
Null |
Embedder instance to use for generating embeddings |
threshold |
0.75 |
Cosine similarity threshold for detecting boundaries |
max_tokens |
512 |
Maximum tokens per chunk (estimated at ~4 chars/token) |
overlap_sentences |
2 |
Number of sentences to overlap between chunks |
respect_structure |
true |
Treat paragraph breaks and markdown headings as hard chunk boundaries |
extra_abbreviations |
[] |
Additional abbreviations the sentence splitter should not split on |
chunks = SemanticTextChunker.chunk(text,
embedder: ,
threshold: 0.8,
max_tokens: 1024,
overlap_sentences: 3,
respect_structure: true,
extra_abbreviations: ["Inc", "Ltd"]
)
Metadata
Prepend metadata to each chunk for better retrieval context:
chunks = SemanticTextChunker.(text,
embedder: ,
title: "The Great Gatsby",
author: "F. Scott Fitzgerald",
chapter: "Chapter 1",
section: "Opening",
source: "gutenberg.org"
)
Each chunk will be prefixed with:
Title: The Great Gatsby
Author: F. Scott Fitzgerald
Chapter: Chapter 1
Section: Opening
Source: gutenberg.org
<chunk text>
Custom Embedders
Create your own embedder by subclassing SemanticTextChunker::Embedders::Base:
class MyEmbedder < SemanticTextChunker::Embedders::Base
def (texts)
# texts is an array of strings
# Return an array of embedding vectors (arrays of floats)
texts.map { |t| (t) }
end
end
The base class provides a cosine_similarity method used for boundary detection.
Sentence Splitting
Sentences are detected with punctuation-aware rules that:
- Keep common abbreviations intact (
Mr.,Dr.,U.S.A.,e.g., etc.) - Keep decimal numbers intact (
3.14,v1.2.3) - Split dialogue ending in a closing quote (
"Stop!" He ran.) - Start new sentences on digits or opening quotes, not just capital letters
To recognize domain-specific abbreviations, pass extra_abbreviations (also accepted
directly by SemanticTextChunker.chunk):
splitter = SemanticTextChunker::Splitters::SentenceSplitter.new(
extra_abbreviations: ["Inc", "Ltd", "cf", "al"]
)
splitter.split("Acme Inc. shipped it. Done.")
# => ["Acme Inc. shipped it.", "Done."]
Structure-Aware Chunking
By default (respect_structure: true), the chunker respects document structure so that
chunks never blur across obvious boundaries:
- Paragraph breaks (blank lines) end a chunk — two paragraphs are never merged into one.
- Markdown headings (
# ...through###### ...) start a new section. A standalone heading is attached to the content that follows it, so each section's chunk carries its heading for context. - Overlap never crosses a structural boundary, so a section's chunk won't leak the tail of the previous section.
Semantic similarity and the token limit still apply within each structural block. Set
respect_structure: false to disable this and chunk purely by similarity and token count.
How It Works
- Structure splitting - Text is broken into blocks on paragraph breaks and markdown headings, which become hard boundaries that chunks are never merged across (when
respect_structureis enabled) - Sentence splitting - Each block is split into sentences using punctuation-aware rules that handle abbreviations (Mr., Dr., U.S., etc.), decimals, and dialogue
- Embedding - Each sentence is embedded using the configured embedder
- Boundary detection - Consecutive sentences are grouped. A new chunk boundary is created at a structural boundary, when the cosine similarity between the accumulated chunk embedding and the next sentence drops below the threshold, or when the token limit is exceeded
- Chunk building - Sentences are assembled into chunks with configurable overlap for context continuity (overlap never crosses a structural boundary)
License
MIT