Architecture Guide
This guide explains the internal architecture of TokenKit, design decisions, and how the Ruby and Rust components work together.
Overview
TokenKit is a hybrid Ruby/Rust gem that leverages Rust's performance for tokenization while providing a friendly Ruby API. The architecture prioritizes:
- Performance: Rust implementation with minimal FFI overhead
- Safety: Thread-safe by design with proper error handling
- Flexibility: Multiple tokenization strategies with unified API
- Maintainability: Clear separation of concerns and trait-based design
┌─────────────────┐
│ Ruby Layer │ lib/tokenkit.rb, lib/tokenkit/*.rb
├─────────────────┤
│ Magnus Bridge │ FFI boundary (automatic serialization)
├─────────────────┤
│ Rust Layer │ ext/tokenkit/src/*.rs
└─────────────────┘
Component Architecture
Ruby Layer (lib/)
lib/
├── tokenkit.rb # Main module and API
├── tokenkit/
│ ├── version.rb # Version constant
│ ├── tokenizer.rb # Instance-based tokenizer
│ └── configuration.rb # Config object with accessors
Key Components:
TokenKit Module (
lib/tokenkit.rb):- Public API methods:
tokenize,configure,reset - Delegates to Rust via Magnus
- Handles option merging for per-call overrides
- Public API methods:
Configuration (
lib/tokenkit/configuration.rb):- Ruby wrapper around config hash from Rust
- Provides convenient accessors and predicates
- Immutable once created
Tokenizer Class (
lib/tokenkit/tokenizer.rb):- Instance-based API for specific configurations
- Useful for bulk processing with different settings
- Wraps module-level functions
Rust Layer (ext/tokenkit/src/)
ext/tokenkit/src/
├── lib.rs # Magnus bindings and caching
├── config.rs # Configuration structs
├── error.rs # Error types with thiserror
├── tokenizer/
│ ├── mod.rs # Trait definition and factory
│ ├── base.rs # Common functionality
│ ├── unicode.rs # Unicode word boundaries
│ ├── whitespace.rs # Simple whitespace splitting
│ ├── pattern.rs # Regex-based tokenization
│ └── ... # Other tokenizer implementations
Key Components:
Entry Point (
lib.rs):- Magnus function exports
- Tokenizer cache management
- Configuration parsing and validation
- Ruby ↔ Rust type conversion
Configuration (
config.rs):TokenizerConfigstruct with all settingsTokenizerStrategyenum for strategy-specific options- Serde serialization for debugging
Error Handling (
error.rs):TokenizerErrorenum with thiserror- Automatic conversion to Ruby exceptions
- Detailed error messages
Tokenizer Trait (
tokenizer/mod.rs):pub trait Tokenizer: Send + Sync { fn tokenize(&self, text: &str) -> Vec<String>; }- Simple, focused interface
- Thread-safe (
Send + Sync) - Returns owned strings for Ruby
Base Functionality (
tokenizer/base.rs):BaseTokenizerFieldsfor common state- Regex compilation for preserve_patterns
- Shared helper functions
Design Patterns
1. Strategy Pattern
Each tokenization strategy implements the Tokenizer trait:
// Factory function creates appropriate tokenizer
pub fn from_config(config: TokenizerConfig) -> Result<Box<dyn Tokenizer>> {
match config.strategy {
TokenizerStrategy::Unicode => Ok(Box::new(UnicodeTokenizer::new(config))),
TokenizerStrategy::Whitespace => Ok(Box::new(WhitespaceTokenizer::new(config))),
// ... other strategies
}
}
2. Composition over Inheritance
Tokenizers compose BaseTokenizerFields for common functionality:
pub struct UnicodeTokenizer {
base: BaseTokenizerFields,
}
impl UnicodeTokenizer {
pub fn new(config: TokenizerConfig) -> Self {
Self {
base: BaseTokenizerFields::new(config),
}
}
}
3. Lazy Initialization with Caching
Tokenizer instances are created lazily and cached:
static DEFAULT_CACHE: Lazy<Mutex<TokenizerCache>> = Lazy::new(|| {
Mutex::new(TokenizerCache {
config: TokenizerConfig::default(),
tokenizer: None, // Created on first use
})
});
fn tokenize(text: String) -> Result<Vec<String>, Error> {
let mut cache = DEFAULT_CACHE.lock()?;
if cache.tokenizer.is_none() {
cache.tokenizer = Some(from_config(cache.config.clone())?);
}
Ok(cache.tokenizer.as_ref().unwrap().tokenize(&text))
}
4. Builder Pattern for Configuration
Configuration uses a builder-like pattern in Ruby:
TokenKit.configure do |config|
config.strategy = :unicode
config.lowercase = true
config.preserve_patterns = [...]
end
Pattern Preservation Architecture
Pattern preservation is a key feature that required careful design:
Problem
Different tokenizers split text differently, but preserve_patterns must work consistently across all strategies.
Solution
Strategy-aware pattern preservation:
// Each tokenizer can provide custom tokenization for pattern gaps
pub fn apply_preserve_patterns_with_tokenizer<F>(
tokens: Vec<String>,
preserve_patterns: &[Regex],
original_text: &str,
config: &TokenizerConfig,
tokenizer_fn: F,
) -> Vec<String>
where
F: Fn(&str) -> Vec<String>
Example for CharGroup tokenizer:
// CharGroup uses its own delimiter logic for consistency
let tokens_with_patterns = apply_preserve_patterns_with_tokenizer(
tokens,
self.base.preserve_patterns(),
text,
&self.base.config,
|text| self.tokenize_text(text), // Uses CharGroup's split logic
);
Thread Safety
TokenKit is thread-safe through careful design:
Global State Protection
// Single mutex protects configuration and tokenizer
static DEFAULT_CACHE: Lazy<Mutex<TokenizerCache>> = Lazy::new(...)
Tokenizer Trait Requirements
pub trait Tokenizer: Send + Sync {
// Send: Can be transferred between threads
// Sync: Can be shared between threads
}
Immutable Tokenizers
Once created, tokenizers are immutable. Configuration changes create new instances:
fn configure(config_hash: RHash) -> Result<(), Error> {
let config = parse_config_from_hash(config_hash)?;
let mut cache = DEFAULT_CACHE.lock()?;
cache.config = config;
cache.tokenizer = None; // Invalidate old tokenizer
Ok(())
}
Error Handling
Errors flow from Rust to Ruby with proper type conversion:
Rust Side
#[derive(Error, Debug)]
pub enum TokenizerError {
#[error("Invalid regex pattern '{pattern}': {error}")]
InvalidRegex { pattern: String, error: String },
#[error("Invalid n-gram configuration: min_gram ({min}) must be > 0 and <= max_gram ({max})")]
InvalidNgramConfig { min: usize, max: usize },
// ...
}
Conversion to Ruby
impl From<TokenizerError> for magnus::Error {
fn from(e: TokenizerError) -> Self {
match e {
TokenizerError::InvalidRegex { .. } => {
magnus::Error::new(regexp_error_class(), e.to_string())
}
TokenizerError::InvalidNgramConfig { .. } => {
magnus::Error::new(arg_error_class(), e.to_string())
}
// ...
}
}
}
Ruby Side
begin
TokenKit.configure do |c|
c.strategy = :pattern
c.regex = "[invalid("
end
rescue RegexpError => e
puts "Invalid regex: #{e.}"
end
Performance Optimizations
1. Cached Tokenizer Instances
Eliminated regex recompilation on every tokenization:
- 110x speedup for pattern-heavy workloads
- Tokenizer created once, reused many times
- Cache invalidated only on configuration change
2. Zero-Copy String Slicing
Where possible, we work with string slices:
let before = &original_text[pos..start]; // No allocation
3. Capacity Hints
Pre-allocate vectors with estimated sizes:
let mut result = Vec::with_capacity(tokens.len() + preserved_spans.len());
4. Lazy String Allocation
Store indices instead of strings until needed:
// Store only positions during pattern matching
let mut preserved_spans: Vec<(usize, usize)> = Vec::with_capacity(32);
// Allocate strings only when building result
result.push(original_text[start..end].to_string());
Magnus Bridge
Magnus provides seamless Ruby ↔ Rust interop:
Type Conversions
// Ruby String → Rust String
fn tokenize(text: String) -> Result<Vec<String>, Error>
// Ruby Hash → Rust Config
fn configure(config_hash: RHash) -> Result<(), Error>
// Rust Vec<String> → Ruby Array
Ok(tokenizer.tokenize(&text)) // Automatic conversion
Function Export
#[magnus::init]
fn init(_ruby: &magnus::Ruby) -> Result<(), Error> {
let module = define_module("TokenKit")?;
module.define_module_function("_tokenize", function!(tokenize, 1))?;
module.define_module_function("_configure", function!(configure, 1))?;
// ...
}
Memory Management
Magnus handles memory management automatically:
- Rust strings converted to Ruby strings
- Ruby GC manages Ruby objects
- Rust objects dropped when out of scope
Testing Strategy
Ruby Tests (spec/)
- Integration tests for public API
- Test all tokenization strategies
- Verify Ruby-specific behavior
- Edge cases and error conditions
Rust Tests
Currently, we rely on Ruby tests for coverage. Future work could add:
- Unit tests for tokenizer implementations
- Property-based testing for pattern preservation
- Benchmarks in Rust
Coverage
- 94.12% line coverage
- 87.8% branch coverage
- All tokenizers thoroughly tested
- Error paths verified
Build System
Compilation
rake compileinvokes cargo- Cargo builds with release optimizations:
toml [profile.release] lto = true # Link-time optimization codegen-units = 1 # Better optimization - Shared library copied to lib/
Cross-Platform
- macOS:
.bundleextension - Linux:
.soextension - Windows:
.dllextension
Magnus handles platform differences automatically.
Future Architecture Improvements
1. Parallel Tokenization
For very large texts, parallelize using Rayon:
use rayon::prelude::*;
large_text.par_lines()
.flat_map(|line| tokenizer.tokenize(line))
.collect()
2. Streaming API
For huge documents:
pub trait StreamingTokenizer {
fn tokenize_stream(&self, reader: impl BufRead) -> impl Iterator<Item = String>;
}
3. Custom Allocators
Investigate jemalloc or mimalloc for better performance:
[dependencies]
tikv-jemallocator = "0.5"
4. SIMD Optimizations
Use SIMD for character scanning:
use std::simd::*;
// Vectorized character matching
5. Regex Set Optimization
When multiple patterns, use regex::RegexSet:
use regex::RegexSet;
let patterns = RegexSet::new(&[...]).unwrap();
let matches: Vec<_> = patterns.matches(text).into_iter().collect();
Conclusion
TokenKit's architecture balances performance, safety, and usability through:
- Clean separation between Ruby API and Rust implementation
- Trait-based design for extensibility
- Intelligent caching for performance
- Comprehensive error handling
- Thread-safe by design
The hybrid Ruby/Rust approach provides the best of both worlds: Ruby's elegant API and Rust's performance.