SemanticTextChunker
Embedding-aware semantic chunking for Ruby RAG pipelines. Splits text into coherent chunks by detecting topic boundaries using embedding similarity, rather than blindly splitting on character count.
Installation
Add to your Gemfile:
gem "semantic_text_chunker"
Then run:
bundle install
Or install directly:
gem install semantic_text_chunker
Quick Start
require "semantic_text_chunker"
text = "Your long document text here..."
# Using OpenAI embeddings
chunks = SemanticTextChunker.chunk(text,
embedder: SemanticTextChunker::Embedders::OpenAI.new(api_key: ENV["OPENAI_API_KEY"])
)
chunks.each { |chunk| puts chunk, "---" }
Embedders
OpenAI
= SemanticTextChunker::Embedders::OpenAI.new(
api_key: ENV["OPENAI_API_KEY"],
model: "text-embedding-3-small" # default
)
Cohere
= SemanticTextChunker::Embedders::Cohere.new(
api_key: ENV["COHERE_API_KEY"],
model: "embed-english-v3.0" # default
)
OpenRouter
= SemanticTextChunker::Embedders::OpenRouter.new(
api_key: ENV["OPENROUTER_API_KEY"],
model: "openai/text-embedding-3-small" # default
)
Null (no API required)
A hash-based embedder useful for testing and development. No external API calls needed.
= SemanticTextChunker::Embedders::Null.new
Options
| Option | Default | Description |
|---|---|---|
embedder |
Null |
Embedder instance to use for generating embeddings |
threshold |
0.75 |
Cosine similarity threshold for detecting boundaries |
max_tokens |
512 |
Maximum tokens per chunk (estimated at ~4 chars/token) |
overlap_sentences |
2 |
Number of sentences to overlap between chunks |
chunks = SemanticTextChunker.chunk(text,
embedder: ,
threshold: 0.8,
max_tokens: 1024,
overlap_sentences: 3
)
Metadata
Prepend metadata to each chunk for better retrieval context:
chunks = SemanticTextChunker.(text,
embedder: ,
title: "The Great Gatsby",
author: "F. Scott Fitzgerald",
chapter: "Chapter 1",
section: "Opening",
source: "gutenberg.org"
)
Each chunk will be prefixed with:
Title: The Great Gatsby
Author: F. Scott Fitzgerald
Chapter: Chapter 1
Section: Opening
Source: gutenberg.org
<chunk text>
Custom Embedders
Create your own embedder by subclassing SemanticTextChunker::Embedders::Base:
class MyEmbedder < SemanticTextChunker::Embedders::Base
def (texts)
# texts is an array of strings
# Return an array of embedding vectors (arrays of floats)
texts.map { |t| (t) }
end
end
The base class provides a cosine_similarity method used for boundary detection.
How It Works
- Sentence splitting - Text is split into sentences using punctuation-aware rules that handle abbreviations (Mr., Dr., U.S., etc.)
- Embedding - Each sentence is embedded using the configured embedder
- Boundary detection - Consecutive sentences are grouped. A new chunk boundary is created when the cosine similarity between the accumulated chunk embedding and the next sentence drops below the threshold, or when the token limit is exceeded
- Chunk building - Sentences are assembled into chunks with configurable overlap for context continuity
License
MIT