Performance Guide

TokenKit is optimized for high-throughput tokenization with minimal memory overhead. This guide covers performance characteristics, optimization techniques, and best practices.

Performance Benchmarks

Baseline Performance

TokenKit can process ~100,000 documents per second for basic Unicode tokenization on modern hardware (Apple M-series, Intel i7+).

Tokenizer	Operations/sec	Relative Speed	Use Case
Lowercase	870,000	1.0x (fastest)	Case normalization
Whitespace	850,000	1.02x	Simple splitting
Unicode	870,000	1.0x	Recommended default
Letter	850,000	1.02x	Aggressive splitting
Pattern (simple)	500,000	1.74x slower	Custom patterns
URL/Email	400,000	2.17x slower	Web content
EdgeNgram	388,000	2.24x slower	Autocomplete
Ngram	350,000	2.49x slower	Fuzzy matching
CharGroup	400,000	2.17x slower	CSV parsing
PathHierarchy	300,000	2.90x slower	Path navigation
Pattern (complex)	24,000	36x slower	Complex regex
Grapheme	200,000	4.35x slower	Emoji handling
Sentence	150,000	5.80x slower	Sentence splitting
Keyword	1,000,000	0.87x faster	No splitting

Pattern Preservation Impact

Pattern preservation adds overhead proportional to pattern complexity:

Configuration	Ops/sec	Impact
No patterns	870,000	Baseline
1 simple pattern	600,000	-31%
4 patterns	409,000	-53%
10 complex patterns	150,000	-83%

Optimization Techniques

1. Tokenizer Instance Caching (110x speedup)

Problem: Creating a new tokenizer and compiling regexes on every call.

Solution: Cache tokenizer instances and invalidate only on configuration changes.

// Before: Created fresh tokenizer every time
fn tokenize(text: String) -> Vec<String> {
    let tokenizer = from_config(config)?;  // Recompiled regexes!
    tokenizer.tokenize(&text)
}

// After: Cached tokenizer instance
static DEFAULT_CACHE: Lazy<Mutex<TokenizerCache>> = Lazy::new(|| {
    Mutex::new(TokenizerCache {
        config: TokenizerConfig::default(),
        tokenizer: None,  // Created once, reused many times
    })
});

Impact:

With preserve patterns: 3,638 → 409,472 ops/sec (110x faster)
Without patterns: 500,000 → 870,000 ops/sec (1.74x faster)

2. Reduced String Allocations (20-30% improvement)

Problem: Creating intermediate string copies during pattern preservation.

Solution: Work with indices, allocate strings only when needed.

// Before: Stored strings eagerly
let mut preserved_spans: Vec<(usize, usize, String)> = Vec::new();
for mat in pattern.find_iter(text) {
    preserved_spans.push((mat.start(), mat.end(), mat.as_str().to_string()));
}

// After: Store indices, extract strings lazily
let mut preserved_spans: Vec<(usize, usize)> = Vec::with_capacity(32);
for mat in pattern.find_iter(text) {
    preserved_spans.push((mat.start(), mat.end()));
}
// Extract string only when building final result
result.push(original_text[start..end].to_string());

3. In-Place Post-Processing

Problem: Creating new vectors for lowercase and punctuation removal.

Solution: Modify vectors in-place.

// Before: Created new vector
tokens = tokens.into_iter().map(|t| t.to_lowercase()).collect();

// After: Modify in place
for token in tokens.iter_mut() {
    *token = token.to_lowercase();
}

4. Pre-Allocated Vectors

Problem: Dynamic vector growth causes reallocations.

Solution: Pre-allocate with estimated capacity.

// Estimate result size
let mut result = Vec::with_capacity(tokens.len() + preserved_spans.len());

5. Optimized Sorting

Problem: Stable sort is slower than necessary.

Solution: Use sort_unstable_by for better performance.

// Before
spans.sort_by(|a, b| a.0.cmp(&b.0));

// After
spans.sort_unstable_by(|a, b| a.0.cmp(&b.0));

Running Benchmarks

TokenKit includes comprehensive benchmarks to measure performance:

# Install benchmark gems
bundle add benchmark-ips benchmark-memory

# Run all benchmarks
ruby benchmarks/tokenizer_benchmark.rb

# Run specific benchmark suites
ruby benchmarks/tokenizer_benchmark.rb tokenizers  # Strategy comparison
ruby benchmarks/tokenizer_benchmark.rb config      # Configuration impact
ruby benchmarks/tokenizer_benchmark.rb size        # Text size scaling
ruby benchmarks/tokenizer_benchmark.rb memory      # Memory usage

Creating Custom Benchmarks

require 'benchmark/ips'
require 'tokenkit'

text = "Your sample text here"

Benchmark.ips do |x|
  x.config(time: 5, warmup: 2)

  x.report("Unicode") do
    TokenKit.configure { |c| c.strategy = :unicode }
    TokenKit.tokenize(text)
  end

  x.report("Pattern") do
    TokenKit.configure { |c| c.strategy = :pattern; c.regex = /\w+/ }
    TokenKit.tokenize(text)
  end

  x.compare!
end

Performance Best Practices

1. Choose the Right Tokenizer

Default to Unicode: Best balance of correctness and performance
Use Whitespace: When you know text is already well-formatted
Avoid Complex Patterns: Each regex pattern has compilation and matching overhead

2. Minimize Pattern Preservation

# Bad: Many overlapping patterns
config.preserve_patterns = [
  /\d+/,
  /\d+mg/,
  /\d+ug/,
  /\d+ml/
]

# Good: Single comprehensive pattern
config.preserve_patterns = [
  /\d+(mg|ug|ml)/
]

3. Reuse Tokenizer Instances

# Good: Configure once, use many times
TokenKit.configure do |config|
  config.strategy = :unicode
  config.preserve_patterns = [...]
end

documents.each do |doc|
  tokens = TokenKit.tokenize(doc)  # Uses cached instance
end

# Avoid: Reconfiguring repeatedly
documents.each do |doc|
  TokenKit.configure { |c| c.strategy = :unicode }  # Invalidates cache!
  tokens = TokenKit.tokenize(doc)
end

4. Use Instance API for Bulk Processing

# For bulk processing with different configurations
tokenizer = TokenKit::Tokenizer.new(
  strategy: :unicode,
  preserve_patterns: [...]
)

# Reuse the same instance
documents.map { |doc| tokenizer.tokenize(doc) }

5. Consider Memory vs Speed Tradeoffs

N-gram tokenizers: Generate many tokens, higher memory usage
Pattern preservation: Increases memory for regex storage
Remove punctuation: Reduces token count, saves memory

Thread Safety and Concurrency

TokenKit is thread-safe and can be used in concurrent environments:

# Safe: Each thread uses the global cached tokenizer
threads = 10.times.map do
  Thread.new do
    100.times do
      TokenKit.tokenize("some text")
    end
  end
end
threads.each(&:join)

Performance in concurrent environments:

Single-threaded: ~870k ops/sec
Multi-threaded (10 threads): ~850k ops/sec (minimal overhead)

Memory Usage

Memory usage varies by tokenizer and options:

Configuration	Memory/Operation	Notes
Basic Unicode	~500 bytes	Minimal overhead
With preserve patterns	~1-2 KB	Regex storage
EdgeNgram (max=10)	~2-5 KB	Multiple tokens generated
Ngram (min=2, max=3)	~3-8 KB	Many substring tokens

Memory Profiling

require 'benchmark/memory'

Benchmark.memory do |x|
  x.report("Unicode") do
    TokenKit.configure { |c| c.strategy = :unicode }
    100.times { TokenKit.tokenize("sample text") }
  end

  x.compare!
end

Compilation Optimizations

The Rust extension is compiled with aggressive optimizations:

[profile.release]
lto = true           # Link-time optimization
codegen-units = 1    # Single codegen unit for better optimization

These settings increase compile time but improve runtime performance by ~15-20%.

Platform-Specific Notes

macOS (Apple Silicon)

Best performance on M1/M2/M3 chips
Native ARM64 compilation
~10-15% faster than Intel Macs

Linux

Consistent performance across distributions
Ensure Rust toolchain is up-to-date
Consider using jemalloc for better memory allocation

Windows

Slightly slower file I/O may affect benchmarks
Use native Windows paths for PathHierarchy tokenizer

Troubleshooting Performance Issues

Slow Tokenization

Check pattern complexity:
```
puts TokenKit.config.preserve_patterns
```

Verify caching is working:

# This should be fast after first call
1000.times { TokenKit.tokenize("test") }

Profile your patterns:
```
require 'benchmark'
```

patterns = [/pattern1/, /pattern2/, ...] text = "your text"

patterns.each do |pattern| time = Benchmark.realtime do 1000.times { pattern.match(text) } end puts "#pattern: #times" end


### High Memory Usage

1. **Reduce n-gram sizes**:
   ```ruby
   config.max_gram = 5  # Instead of 10

Limit preserve patterns:

# Only essential patterns
config.preserve_patterns = [/critical_pattern/]

Use streaming for large documents:

# Process in chunks
text.each_line do |line|
 tokens = TokenKit.tokenize(line)
 process_tokens(tokens)
end

Future Optimizations

Planned performance improvements:

SIMD vectorization for character scanning
Parallel tokenization for very large texts
Lazy pattern compilation for rarely-used patterns
Memory pooling for reduced allocations
Regex set optimization for multiple patterns

Summary

TokenKit achieves high performance through:

Intelligent caching: 110x speedup for pattern-heavy workloads
Minimal allocations: 20-30% throughput improvement
Optimized algorithms: Using efficient Rust implementations
Smart defaults: Unicode tokenizer balances speed and correctness

For most use cases, the default Unicode tokenizer with minimal preserve patterns provides the best performance. Configure once at application startup and let TokenKit's caching handle the rest.