Status
Kotoshu is v0.3.0 — building on the 0.2 cut, this release adds a strict
two-stage resource model (explicit setup, cache-only hot path), XDG base
directory layout, SHA-256 integrity verification, SARIF output, an
--interactive review loop, and ONNX model pipeline wiring.
What works in 0.3
-
Two-stage resource model —
Kotoshu.setup(:en)thenKotoshu.correct?("hello"). The hot path is cache-only and raisesResourceNotSetupErroron miss; downloads are never implicit. -
Kotoshu.check(text, language: "en")/Kotoshu.suggest("helo")— full document check and suggestions -
Kotoshu.spellchecker_for(lang, strict: true)— re-raise on optional-resource failures -
kotoshu check FILECLI with these flags:-
--language en|de|es|fr|pt|ru|auto(default:auto) -
--format text|json|sarif -
--offline— use only cached resources, never download -
--strict— exit 3 if any optional resource (frequency, model) can’t load -
--interactive— review each error after the check -
--verbose
-
-
kotoshu setup LANGUAGE [LANGUAGE …]— pre-warm spelling + frequency + ONNX caches for offline use (fetchis kept as a hidden deprecated alias) -
Local-source setup:
kotoshu setup en --aff path/to.en.aff --dic path/to.en.dicorkotoshu setup en --from /path/to/dict/dir/ -
Exit codes:
0clean,1errors found,2usage error,3resource setup failed -
SHA-256 integrity verification (manifest-based, with graceful degradation when manifest is absent)
-
Offline mode via
KOTOSHU_OFFLINE=1or--offline -
XDG base directory layout — caches in
$XDG_CACHE_HOME/kotoshu/, config in$XDG_CONFIG_HOME/kotoshu/, data in$XDG_LOCAL_HOME/kotoshu/(overridable viaKOTOSHU_CACHE_PATH,KOTOSHU_CONFIG_PATH,KOTOSHU_DATA_PATH)
Planned for 0.4+
-
--output(file output redirection) -
ONNX semantic reranking as default path
-
≥30 language modules wired
-
Grammar rule packs
-
CJK and RTL language support
See the 0.2 cut plan, the 0.3 tasks under
TODO.impl/, and the vision for the path to 1.0.
Purpose
Kotoshu 「言修」 is a pure-Ruby spell checker that aims to work for every language by dynamically downloading the right combination of dictionary, frequency data, and embedding model on demand.
The current release pairs a Ruby port of the Hunspell algorithm (traditional morphological lookup + affix rules) with optional FastText word embeddings converted to ONNX for context-aware reranking.
|
Note
|
The semantic (ONNX) path is an optional feature. gem install kotoshu
works without onnxruntime; install it separately (gem install onnxruntime)
to enable context-aware reranking. Set KOTOSHU_NO_ONNX=1 to opt back out.
|
Features
|
Note
|
The list below describes the design vision. See [status] for exactly what works in 0.2 and what is planned for 0.3+. |
-
Semantic error detection using word embeddings (opt-in via Ruby API in 0.2)
-
Interactive review mode with full navigation (planned for 0.3)
-
Batch processing for CI/CD (JSON in 0.2; SARIF planned for 0.3)
-
Multiple analysis models (Hunspell, FastText, Hybrid) (Hunspell path only in 0.2)
Architecture
Kotoshu is built on a modern, semantic architecture:
╔═══════════════════════════════════════════════════════════════════╗
║ Kotoshu Semantic Architecture ║
╠═══════════════════════════════════════════════════════════════════╣
║ ║
║ ┌─────────────────────────────────────────────────────────────┐ ║
║ │ Interface Layer │ ║
║ │ ┌─────────────────────┐ ┌─────────────────────────────┐ │ ║
║ │ │ CLI (Thor) │ │ Ruby API │ │ ║
║ │ │ lib/kotoshu/cli/ │ │ Kotoshu module methods │ │ ║
║ │ └──────────┬──────────┘ └───────────┬─────────────────┘ │ ║
║ │ │ Auto Language Detect │ │ ║
║ └─────────────┼──────────────────────────┼─────────────────────┘ ║
║ │ │ ║
║ ▼ ▼ ║
║ ┌─────────────────────────────────────────────────────────────┐ ║
║ │ Analysis Layer │ ║
║ │ ┌──────────────┐ ┌─────────────┐ ┌───────────────────┐ │ ║
║ │ │ Hunspell │ │ FastText │ │ Hybrid (Best!) │ │ ║
║ │ │ Dictionary │ │ Embeddings │ │ Combined │ │ ║
║ │ │ (Traditional)│ │ (ONNX) │ │ Approach │ │ ║
║ │ └──────────────┘ └─────────────┘ └───────────────────┘ │ ║
║ └───────────────────────────┬─────────────────────────────────┘ ║
║ │ ║
║ ┌───────────────────────────▼─────────────────────────────────┐ ║
║ │ Model Layer (ONNX) │ ║
║ │ ┌──────────────────────────────────────────────────────┐ │ ║
║ │ │ ONNX Runtime → Fast Embedding Lookup │ │ ║
║ │ │ Semantic Similarity → Context-Aware Suggestions │ │ ║
║ │ │ Nearest Neighbor Search → Smart Corrections │ │ ║
║ │ └──────────────────────────────────────────────────────┘ │ ║
║ └─────────────────────────────────────────────────────────────┘ ║
║ ║
╚═══════════════════════════════════════════════════════════════════╝
Key Components
-
Kotoshu::Models::OnnxModel: ONNX-based word embedding model for fast semantic similarity and nearest neighbor search. -
Kotoshu::Analyzers::SemanticAnalyzer: Unified semantic error detection using word embeddings (no artificial spelling/grammar split). -
Kotoshu::Language::LanguageIdentifier: Automatic language detection using FastText LID model (127 languages). -
Kotoshu::Cli::InteractiveReviewer: Interactive CLI for error review with full navigation (forward, backward, jump, skip, accept). -
Kotoshu::Dictionary::Hunspell: Traditional Hunspell dictionary backend for morphological analysis and affix rules.
Why ONNX?
ONNX Runtime provides:
-
Performance: C++ implementation, 10-100x faster than pure Ruby
-
Portability: Works on CPU, GPU, TPU, mobile devices
-
Optimization: Automatic graph optimization and quantization
-
Interoperability: Models can be trained in Python, deployed in Ruby
Kotoshu uses FastText models converted to ONNX format for semantic spell checking.
Semantic Analysis
Unlike traditional spell checkers that only check dictionary membership and edit distance, Kotoshu uses semantic similarity to:
-
Detect contextually appropriate corrections ("desert" vs "dessert")
-
Handle out-of-vocabulary words via subword embeddings
-
Provide ranked suggestions based on semantic similarity
-
Support compound words and morphological variations
Kotoshu.setup(:en, want: %i[spelling model]) # one-time per language
# Traditional: knows "helo" is wrong and lists edit-distance candidates
Kotoshu.suggest("helo").to_words
# => ["hello", "help", "held", "hell", "hole"]
# Semantic: reranks candidates by context similarity
model = Kotoshu::Models::OnnxModel.from_github("en")
analyzer = Kotoshu::Analyzers::SemanticAnalyzer.new(model)
analyzer.suggest_corrections("helo", context: "I said helo to the world").map(&:word)
# => ["hello"] # "hello" makes more sense in greeting context
|
Note
|
The semantic path requires the optional onnxruntime gem. See
Requirements.
|
Multi-Language Support
Kotoshu supports 6 languages with full semantic analysis:
-
de - German (Deutsch)
-
en - English
-
es - Spanish (Español)
-
fr - French (Français)
-
pt - Portuguese (Português)
-
ru - Russian (Русский)
Automatic language detection is enabled by default:
# Language auto-detected from document content
kotoshu check document.txt
# Detected: en (95% confidence)
# Analyzing document.txt (language: en)...
# Explicit language specification
kotoshu check document.txt --language de
ONNX Models
Kotoshu uses FastText crawl vectors converted to ONNX format:
-
Source: FastText Crawl Vectors
-
Format: ONNX with optimized runtime
-
Vocabulary: 2 million words per language (full coverage)
-
Dimension: 300-dimensional word vectors
-
Size: ~2.4GB per language
FastText File Formats
FastText provides two file formats. Kotoshu uses the .vec format for ONNX conversion.
| Aspect | .vec (Text) |
.bin (Binary) |
|---|---|---|
Content |
Word vectors only (pre-computed embeddings) |
Full FastText model (trained model) |
Structure |
Text: one word + 300 floats per line |
Binary: complete model with matrices |
File Size |
~1.3GB compressed (~2.4GB uncompressed) |
~1.8GB compressed (~4.8GB uncompressed) |
Train New Words |
✗ No (static lookup only) |
✓ Yes (can train/OOV with subword info) |
Subword Embeddings |
✗ No |
✓ Yes (n-gram character embeddings) |
ONNX Converter |
✓ Supported (what we use) |
✗ Not supported |
Use Case |
Simple word vector lookup for spell checking |
Full FastText functionality (training, OOV) |
Kotoshu uses .vec files because:
-
Simpler extraction: Just word → vector mapping
-
No subword complexity needed: Dictionary-based spell checking doesn’t require OOV generation
-
Smaller ONNX models: ~2.4GB vs ~4.8GB
-
Faster conversion: Direct serialization to ONNX
# Set up a language with spelling + ONNX semantic model
kotoshu setup en --want spelling,model
# List what's set up in the cache
kotoshu setup --list
# Re-validate cached resources
kotoshu cache validate
|
Note
|
FastText .vec → ONNX conversion is done upstream in the
`kotoshu/models-fasttext-onnx'
repo. The CLI downloads pre-converted artifacts; users do not run
conversion locally.
|
Interactive Mode
|
Note
|
Interactive mode shipped in 0.3.0. It is navigation-only — the session records which suggestions the user accepted but does not rewrite the source file yet. |
kotoshu check README.md --interactive
Features in 0.3:
-
Navigate: [n] / Enter next, [p] previous, [l] list
-
Accept: [1-9] record suggestion N for the current error
-
Skip: [s] skip the current error
-
Quit: [q] exit the review loop
Batch Processing
For CI/CD and automation, Kotoshu supports JSON and SARIF output in 0.3;
--output file redirection is planned for 0.4+.
# JSON output to stdout (supported in 0.3)
kotoshu check README.md --format json
# SARIF 2.1.0 output (supported in 0.3)
kotoshu check README.md --format sarif
# Exit code for CI
kotoshu check README.md
echo $? # 0 if no errors, 1 if errors found
Document Formats
Kotoshu supports structured documents with AST parsing:
-
Plain text: Line-based error detection
-
Markdown: AST-based using Kramdown parser
-
AsciiDoc: AST-based using Asciidoctor parser
Structured documents preserve node paths for precise error location.
Analysis Models
|
Note
|
In 0.2, the CLI runs the Hunspell traditional path only.
The --model flag and FastText/Hybrid paths are planned for 0.3+.
The Ruby API can opt into the semantic path today via
Kotoshu::Models::OnnxModel (auto-available when onnxruntime is installed).
|
Kotoshu is designed to support three analysis models:
| Model | Description | Best For |
|---|---|---|
hunspell |
Traditional dictionary-based with morphological rules |
Fast checking, compound words, languages with complex morphology |
fasttext |
Pure semantic embeddings via ONNX |
Context awareness, out-of-vocabulary words, semantic similarity |
hybrid |
Hunspell candidates + FastText reranking (recommended) |
Maximum accuracy, best of both worlds |
# Fast dictionary-based checking (default in 0.2)
kotoshu check document.txt # 0.2: Hunspell path
# Semantic / hybrid paths: planned for 0.3
# kotoshu check document.txt --model fasttext
# kotoshu check document.txt --model hybrid
Installation
Add this line to your application’s Gemfile:
gem 'kotoshu'
And then execute:
bundle install
Or install it yourself as:
gem install kotoshu
|
Note
|
onnxruntime is an optional dependency. Install it separately
(gem install onnxruntime) to enable semantic analysis; the
traditional Hunspell path works without it.
|
Quick Start
# One-time per language: download spelling dictionary from
# github.com/kotoshu/dictionaries (idempotent, ~5 MB)
kotoshu setup en
# Then check files instantly, cache-only
kotoshu check README.md
Or skip the explicit setup — the CLI will prompt interactively the
first time you check a file in a non-cached language (TTY only; in
non-TTY or KOTOSHU_OFFLINE=1 mode it exits with code 3).
# Check a file (uses --language, or auto-detects from content)
kotoshu check README.md
# Explicit language
kotoshu check README.md --language en
# JSON output for programmatic use
kotoshu check README.md --format json
# Offline mode — use only cached dictionaries, never download
kotoshu check README.md --offline
# Check stdin
echo "helo wrld" | kotoshu check
Exit codes: 0 (no errors), 1 (errors found), 2 (usage error),
3 (language not set up — run kotoshu setup LANG, or run kotoshu check
in a TTY to be prompted).
require 'kotoshu'
# Stage 1: set up the language once (downloads from github.com/kotoshu/dictionaries)
Kotoshu.setup(:en)
# Stage 2: hot-path checks are cache-only and never touch the network
Kotoshu.correct?("hello") # => true
Kotoshu.correct?("helo") # => false
# Suggestions return a SuggestionSet; call #to_words for an Array
Kotoshu.suggest("helo").to_words # => ["hello", "help", "held", ...]
# Check a document
result = Kotoshu.check("Hello wrold")
result.errors.map(&:word) # => ["wrold"]
# Each error carries position + suggestions
result = Kotoshu.check_file("README.md")
result.errors.each do |error|
puts "#{error.word} at offset #{error.position}: #{error.top_suggestions(3).join(', ')}"
end
# Semantic analysis is optional — requires the onnxruntime gem
# (gem install onnxruntime). Skip this block if you only want Hunspell.
if Kotoshu::Models::OnnxModel::ONNX_LOADED
Kotoshu.setup(:en, want: %i[spelling model])
model = Kotoshu::Models::OnnxModel.from_github('en')
analyzer = Kotoshu::Analyzers::SemanticAnalyzer.new(model)
analyzer.analyze(Kotoshu.check("Hello wrold"))
end
|
Note
|
The library API is strict: calls like Kotoshu.correct? raise
Kotoshu::ResourceNotSetupError until you’ve run Kotoshu.setup. This
prevents surprise downloads on metered networks. The CLI (kotoshu check)
intercepts the error and prompts to download interactively.
|
Requirements
-
Ruby 3.1+
-
onnxruntimegem (optional — enables semantic spell checking; install separately withgem install onnxruntime) -
Python 3 + fasttext (optional, only if you want to convert
.vec→.onnxupstream)
Resource Caching and Language Support
Kotoshu uses a sophisticated multi-layer caching system to manage dictionaries,
frequency lists, and embedding models. Resources are downloaded explicitly via
Kotoshu.setup (or kotoshu setup) and cached under the XDG base directory
layout (~/.cache/kotoshu/ by default; override via KOTOSHU_CACHE_PATH,
KOTOSHU_CONFIG_PATH, KOTOSHU_DATA_PATH, or the XDG_*_HOME vars).
Cache Architecture
┌────────────────────────────────────────────────────────────────────────────┐
│ BaseCache (Abstract) │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ Common: download, metadata, validation, stats, TTL management │ │
│ └────────────────────────────────────────────────────────────────────┘ │
└────────────────────┬───────────────────┬────────────────────┬──────────────┘
│ │ │
┌────────────▼────────┐ ┌──────▼──────┐ ┌───────▼─────────┐
│ LanguageCache │ │ModelCache │ │ FrequencyCache │
│ (Dictionaries) │ │ (Embeddings)│ │ (Kelly Lists) │
└─────────────────────┘ └─────────────┘ └─────────────────┘
│ │ │
┌────────────▼────────┐ ┌──────▼──────┐ ┌───────▼─────────┐
│ ~/.cache/kotoshu/ │ │~/.cache/ │ │ ~/.cache/kotoshu/│
│ languages/ │ │ kotoshu/ │ │frequency-lists/ │
│ │ │ models/ │ │ │
└─────────────────────┘ └─────────────┘ └─────────────────┘
Cache Types
LanguageCache (Dictionaries)
Manages Hunspell dictionaries and grammar rules for spell checking.
-
Cache Path:
~/.cache/kotoshu/languages/{code}/ -
TTL: 7 days (604,800 seconds)
-
Source: kotoshu/dictionaries
-
Resources per language:
-
spelling/**: Hunspell dictionary (index.dic,index.aff) -
grammar/*: Grammar rules (rules.yaml) - *future -
frequency/*: Frequency data - *deprecated, use FrequencyCache
-
# Access via cache
cache = Kotoshu::Cache::LanguageCache.new
dict = cache.get_spelling('en')
# Result:
# {
# aff_path: "~/.cache/kotoshu/languages/en/spelling/index.aff",
# dic_path: "~/.cache/kotoshu/languages/en/spelling/index.dic",
# cached: true,
# metadata: { ... }
# }
FrequencyCache (Kelly Project)
Manages Kelly Project frequency lists for intelligent suggestion ranking.
-
Cache Path:
~/.cache/kotoshu/frequency-lists/{code}/ -
TTL: 7 days (604,800 seconds)
-
Source: kotoshu/frequency-list-kelly
-
Format: JSON with tiered word frequency data
{
"metadata": {
"language": "en",
"source": "Kelly Project (University of Leeds)",
"total_words_analyzed": 1500000
},
"tiers": {
"top_50": {
"words": ["the", "be", "to", "of", "and", ...],
"info": "Most common 50 words"
},
"top_200": {
"words": ["will", "my", "one", "all", ...],
"info": "Most common 200 words"
},
"top_1000": {
"words": ["however", "although", ...],
"info": "Most common 1000 words"
}
}
}
# Access via cache
cache = Kotoshu::Cache::FrequencyCache.new
freq_data = cache.get('en', force_download: true)
# Result:
# {
# frequency_path: "~/.cache/kotoshu/frequency-lists/en/frequency.json",
# tiers: {
# top_50: Set<...>,
# top_200: Set<...>,
# top_1000: Set<...>
# },
# metadata: { ... }
# }
# Integrated into EditDistanceStrategy
strategy = Kotoshu::Suggestions::Strategies::EditDistanceStrategy.new(
language_code: 'en'
)
strategy.frequency_bonus('the') # => 200 (top 50)
strategy.frequency_bonus('hello') # => 100 (top 200)
strategy.frequency_bonus('xyz') # => 0 (not in lists)
ModelCache (Embedding Models)
Manages FastText and ONNX embedding models for semantic spell checking.
-
Cache Path:
~/.cache/kotoshu/models/{code}/models/{type}/ -
TTL: 30 days (2,592,000 seconds)
-
Sources:
-
FastText (.vec): Facebook CDN (dl.fbaipublicfiles.com)
-
ONNX (.onnx): Converted locally from FastText models
-
-
Supported Types:
-
fasttext: FastText word vectors (.vec files, 300D) - Downloaded from Facebook CDN -
onnx: ONNX-converted models (.onnx files) - Auto-converted from FastText
-
|
Note
|
ONNX models are automatically converted from FastText models on first use.
The conversion uses lib/kotoshu/scripts/fasttext_to_onnx.py and requires Python 3 with
numpy and onnx packages installed.
|
Language |
|
ONNX File |
| de (German) | cc.de.300.vec | fasttext.de.onnx | | en (English) | cc.en.300.vec | fasttext.en.onnx | | es (Spanish) | cc.es.300.vec | fasttext.es.onnx | | fr (French) | cc.fr.300.vec | fasttext.fr.onnx | | pt (Portuguese) | cc.pt.300.vec | fasttext.pt.onnx | | ru (Russian) | cc.ru.300.vec | fasttext.ru.onnx | |=========================================|
CLI Cache Management
Kotoshu provides CLI commands for managing cached resources:
# List all cached resources
kotoshu cache list
# List specific cache type
kotoshu cache list language
kotoshu cache list model
kotoshu cache list frequency
# Show cache statistics
kotoshu cache status
# Show detailed status (verbose)
kotoshu cache status --verbose
# Download a resource
kotoshu cache download language en
kotoshu cache download model en:fasttext
kotoshu cache download frequency en
# Get information about a resource
kotoshu cache info language en
kotoshu cache info model en:fasttext
kotoshu cache info frequency en
# Purge cached data
kotoshu cache purge all
kotoshu cache purge language en
kotoshu cache purge frequency
# Clean expired entries
kotoshu cache clean
Cache Statistics
Each cache type tracks statistics:
-
Hits: Number of cache hits (resource found locally)
-
Misses: Number of cache misses (had to download)
-
Hit Rate: Percentage of cache hits
-
Size: Total disk space used
-
Cached Resources: Number of resources cached
$ kotoshu cache status
======================================================================
Kotoshu Cache Status
======================================================================
Language Cache:
Directory: /Users/username/.cache/kotoshu/languages
Resources cached: 2
Size: 2.45 MB
Hits: 15, Misses: 2
Hit rate: 88.2%
Frequency Cache:
Directory: /Users/username/.cache/kotoshu/frequency-lists
Resources cached: 1
Size: 815.84 KB
Hits: 42, Misses: 1
Hit rate: 97.7%
Model Cache:
Directory: /Users/username/.cache/kotoshu/models
Resources cached: 0
Size: 0 B
Hits: 0, Misses: 0
Hit rate: 0.0%
Total:
Total size: 3.26 MB
Overall hit rate: 93.5%
======================================================================
Language Support Matrix
Kotoshu provides multi-language support with varying feature availability.
| Language | Dictionary | Hunspell Affix Rules | Kelly Frequency | FastText Model | ONNX Model | Notes |
|---|
| de (German) | ✓ (75,873 words) | ✓ | ✗ | ✓ (2.5 GB) | ✓ (~230 MB) | QWERTZ keyboard support | | en (English) | ✓ (49,568 words) | ✓ | ✓ (815 KB) | ✓ (4.3 GB) | ✓ (~460 MB) | QWERTY keyboard support | | es (Spanish) | ✓ (57,344 words) | ✓ | ✗ | ✓ (2.5 GB) | ✓ (~230 MB) | QWERTY keyboard support | | fr (French) | ✓ (84,310 words) ✓ | ✗ | ✓ (2.5 GB) | ✓ (~230 MB) | AZERTY keyboard support | | pt (Portuguese) | ✓ (312,368 words) | ✓ | ✗ | ✓ (2.5 GB) | ✓ (~230 MB) | QWERTY keyboard support | | ru (Russian) | ✓ (146,269 words) | ✓ | ✓ (780 KB) | ✓ (2.5 GB) | ✓ (~230 MB) | JCUKEN keyboard support | | ar (Arabic) | ✗ | ✗ | ✓ | ✗ | ✗ | Kelly frequency only | | zh (Chinese) | ✗ | ✗ | ✓ | ✗ | ✗ | Kelly frequency only | | el (Greek) | ✗ | ✗ | ✓ | ✗ | ✗ | Kelly frequency only | | it (Italian) | ✗ | ✗ | ✓ | ✗ | ✗ | Kelly frequency only | | no (Norwegian) | ✗ | ✗ | ✓ | ✗ | ✗ | Kelly frequency only | | sv (Swedish) | ✗ | ✗ | ✓ | ✗ | ✗ | Kelly frequency only | |=========================================]
| Language | Word Count | License | Source |
|---|
| de (German) | 75,873 | GPL | igerman98 | | en (English) | 49,568 | LGPL/MPL/GPL | SCOWL | | es (Spanish) | 57,344 | GPL | LibreOffice | | fr (French) | 84,310 | MPL 2.0 | Grammalecte | | pt (Portuguese) | 312,368 | LGPLv3 + MPL | VERO | | ru (Russian) | 146,269 | BSD-style | Alexander Lebedev | |=========================================+
| Language | Size | Coverage |
|---|
| ar (Arabic) | ~750 KB | Top 1000 words | | zh (Chinese) | ~800 KB | Top 1000 words | | en (English) | 815 KB | Top 1000 words | | el (Greek) | ~780 KB | Top 1000 words | | it (Italian) | ~790 KB | Top 1000 words | | no (Norwegian) | ~770 KB | Top 1000 words | | ru (Russian) | 780 KB | Top 1000 words | | sv (Swedish) | ~775 KB | Top 1000 words | |=========================================+
|
Note
|
Kelly frequency lists provide the top 1000 most common words from the Kelly Project (University of Leeds & University of Gothenburg). Languages not listed here require external frequency data sources. |
Programmatic Usage
Using Language Cache
require 'kotoshu/cache/language_cache'
cache = Kotoshu::Cache::LanguageCache.new
# Get spelling dictionary
dict = cache.get_spelling('en')
puts "Dictionary: #{dict[:dic_path]}"
puts "Words: #{File.readlines(dict[:dic_path]).count}"
# Get available languages
cache.available_languages # => ["de", "en", "es", "fr", "pt", "ru"]
# Check if resource is cached
cache.available?('en:spelling') # => true
# Get language info
info = cache.language_info('en')
puts "Language: #{info[:name]}"
puts "Words: #{info[:word_count]}"
puts "License: #{info[:license]}"
Using Frequency Cache
require 'kotoshu/cache/frequency_cache'
cache = Kotoshu::Cache::FrequencyCache.new
# Get frequency data
freq_data = cache.get('en')
# Access frequency tiers
top_50 = freq_data[:tiers][:top_50]
top_50.include?('the') # => true
top_50.include?('hello') # => true (in top 200)
# Get available languages
cache.available_languages # => ["ar", "zh", "en", "el", "it", "no", "ru", "sv"]
Integration with Suggestion Strategies
require 'kotoshu/suggestions/strategies/edit_distance_strategy'
# Frequency bonuses automatically applied
strategy = Kotoshu::Suggestions::Strategies::EditDistanceStrategy.new(
language_code: 'en'
)
# Suggestions are ranked by frequency
suggestions = strategy.suggest('helo', max_results: 5)
# => [
# { word: "hello", score: 1200 }, # High frequency word
# { word: "help", score: 1150 }, # Medium frequency word
# ...
# ]
Cache TTL and Expiration
All cached resources have a Time-To-Live (TTL) and automatically expire:
-
LanguageCache: 7 days (dictionaries change infrequently)
-
FrequencyCache: 7 days (frequency lists are stable)
-
ModelCache: 30 days (models are large and change rarely)
Expired resources are automatically re-downloaded on next access.
cache = Kotoshu::Cache::FrequencyCache.new
# Force re-download (ignores cache)
freq_data = cache.get('en', force_download: true)
# Clean expired entries manually
cache.clean
Manual Cache Management
cache = Kotoshu::Cache::LanguageCache.new
# Clear specific resource
cache.clear('en:spelling')
# Clear all resources
cache.clear_all
# Check if resource exists
cache.available?('en:spelling') # => true after download
# Get statistics
stats = cache.stats
puts "Hit rate: #{stats[:hit_rate] * 100}%"
puts "Size: #{stats[:size_bytes]} bytes"
GitHub Repository Structure
The kotoshu/dictionaries repository follows this structure:
kotoshu/dictionaries/
├── en/
│ ├── spelling/
│ │ ├── index.dic # Hunspell dictionary
│ │ ├── index.aff # Hunspell affix rules
│ │ └── metadata.json # Version info
│ ├── grammar/
│ │ └── rules.yaml # Grammar rules (future)
│ └── models/
│ ├── fasttext/
│ │ └── cc.en.300.vec # FastText vectors
│ └── onnx/
│ └── fasttext.en.onnx # ONNX model
├── de/
│ └── ... (same structure)
└── README.md
kotoshu/frequency-list-kelly/
├── data/
│ ├── en.json # Kelly frequency data
│ ├── ru.json
│ └── ...
└── README.md
Adding New Languages
To add support for a new language:
-
Dictionary: Add Hunspell dictionary to
kotoshu/dictionaries/{code}/spelling/ -
Frequency: Add Kelly frequency data to
kotoshu/frequency-list-kelly/data/{code}.json -
Register: Add to
AVAILABLE_LANGUAGESinLanguageCache -
Test: Run integration tests to verify
See CONTRIBUTING.adoc for detailed guidelines.
Model Repository
ONNX models are hosted at: kotoshu/dictionaries
Download and setup:
# Preferred: let kotoshu fetch and verify the model
kotoshu setup en --want spelling,model
# Manual clone (advanced; bypasses manifest verification)
git clone https://github.com/kotoshu/dictionaries.git ~/src/kotoshu/dictionaries
License
BSD 2-Clause — see the LICENSE file for details.
Bundled dictionaries and frequency lists carry their own licenses; see the
per-language license files in kotoshu/dictionaries.
Contributing
Contributions are welcome! Please see CONTRIBUTING.adoc for guidelines.
Acknowledgments
-
FastText: Facebook Research
-
ONNX Runtime: Microsoft
-
Hunspell: László Németh