File: README — Documentation by YARD 0.9.38

RubyGems Version License

Status

Kotoshu is v0.3.0 — building on the 0.2 cut, this release adds a strict two-stage resource model (explicit setup, cache-only hot path), XDG base directory layout, SHA-256 integrity verification, SARIF output, an --interactive review loop, and ONNX model pipeline wiring.

What works in 0.3

Two-stage resource model — Kotoshu.setup(:en) then Kotoshu.correct?("hello"). The hot path is cache-only and raises ResourceNotSetupError on miss; downloads are never implicit.
Kotoshu.check(text, language: "en") / Kotoshu.suggest("helo") — full document check and suggestions
Kotoshu.spellchecker_for(lang, strict: true) — re-raise on optional-resource failures
kotoshu check FILE CLI with these flags:
- --language en|de|es|fr|pt|ru|auto (default: auto)
- --format text|json|sarif
- --offline — use only cached resources, never download
- --strict — exit 3 if any optional resource (frequency, model) can’t load
- --interactive — review each error after the check
- --verbose
kotoshu setup LANGUAGE [LANGUAGE …] — pre-warm spelling + frequency + ONNX caches for offline use (fetch is kept as a hidden deprecated alias)
Local-source setup: kotoshu setup en --aff path/to.en.aff --dic path/to.en.dic or kotoshu setup en --from /path/to/dict/dir/
Exit codes: 0 clean, 1 errors found, 2 usage error, 3 resource setup failed
SHA-256 integrity verification (manifest-based, with graceful degradation when manifest is absent)
Offline mode via KOTOSHU_OFFLINE=1 or --offline
XDG base directory layout — caches in $XDG_CACHE_HOME/kotoshu/, config in $XDG_CONFIG_HOME/kotoshu/, data in $XDG_LOCAL_HOME/kotoshu/ (overridable via KOTOSHU_CACHE_PATH, KOTOSHU_CONFIG_PATH, KOTOSHU_DATA_PATH)

Planned for 0.4+

--output (file output redirection)
ONNX semantic reranking as default path
≥30 language modules wired
Grammar rule packs
CJK and RTL language support

See the 0.2 cut plan, the 0.3 tasks under TODO.impl/, and the vision for the path to 1.0.

Purpose

Kotoshu 「言修」 is a pure-Ruby spell checker that aims to work for every language by dynamically downloading the right combination of dictionary, frequency data, and embedding model on demand.

The current release pairs a Ruby port of the Hunspell algorithm (traditional morphological lookup + affix rules) with optional FastText word embeddings converted to ONNX for context-aware reranking.

Note	The semantic (ONNX) path is an optional feature. `gem install kotoshu` works without `onnxruntime`; install it separately (`gem install onnxruntime`) to enable context-aware reranking. Set `KOTOSHU_NO_ONNX=1` to opt back out.

Features

Note	The list below describes the design vision. See [status] for exactly what works in 0.2 and what is planned for 0.3+.

Multi-language support with automatic detection
Semantic error detection using word embeddings (opt-in via Ruby API in 0.2)
Interactive review mode with full navigation (planned for 0.3)
Batch processing for CI/CD (JSON in 0.2; SARIF planned for 0.3)
Fast ONNX inference via ONNX Runtime
Support for Markdown, AsciiDoc, and plain text
Multiple analysis models (Hunspell, FastText, Hybrid) (Hunspell path only in 0.2)

Architecture

Kotoshu is built on a modern, semantic architecture:

Architecture overview

╔═══════════════════════════════════════════════════════════════════╗
║                    Kotoshu Semantic Architecture                ║
╠═══════════════════════════════════════════════════════════════════╣
║                                                                   ║
║  ┌─────────────────────────────────────────────────────────────┐  ║
║  │                     Interface Layer                          │  ║
║  │  ┌─────────────────────┐  ┌─────────────────────────────┐   │  ║
║  │  │   CLI (Thor)        │  │      Ruby API               │   │  ║
║  │  │   lib/kotoshu/cli/  │  │   Kotoshu module methods    │   │  ║
║  │  └──────────┬──────────┘  └───────────┬─────────────────┘   │  ║
║  │             │  Auto Language Detect   │                     │  ║
║  └─────────────┼──────────────────────────┼─────────────────────┘  ║
║                │                          │                        ║
║                ▼                          ▼                        ║
║  ┌─────────────────────────────────────────────────────────────┐  ║
║  │                   Analysis Layer                             │  ║
║  │  ┌──────────────┐  ┌─────────────┐  ┌───────────────────┐  │  ║
║  │  │    Hunspell  │  │  FastText   │  │  Hybrid (Best!)   │  │  ║
║  │  │  Dictionary  │  │  Embeddings │  │  Combined         │  │  ║
║  │  │  (Traditional)│  │  (ONNX)    │  │  Approach         │  │  ║
║  │  └──────────────┘  └─────────────┘  └───────────────────┘  │  ║
║  └───────────────────────────┬─────────────────────────────────┘  ║
║                              │                                    ║
║  ┌───────────────────────────▼─────────────────────────────────┐  ║
║  │                  Model Layer (ONNX)                          │  ║
║  │  ┌──────────────────────────────────────────────────────┐  │  ║
║  │  │  ONNX Runtime → Fast Embedding Lookup                 │  │  ║
║  │  │  Semantic Similarity → Context-Aware Suggestions      │  │  ║
║  │  │  Nearest Neighbor Search → Smart Corrections          │  │  ║
║  │  └──────────────────────────────────────────────────────┘  │  ║
║  └─────────────────────────────────────────────────────────────┘  ║
║                                                                   ║
╚═══════════════════════════════════════════════════════════════════╝

Key Components

Kotoshu::Models::OnnxModel: ONNX-based word embedding model for fast semantic similarity and nearest neighbor search.
Kotoshu::Analyzers::SemanticAnalyzer: Unified semantic error detection using word embeddings (no artificial spelling/grammar split).
Kotoshu::Language::LanguageIdentifier: Automatic language detection using FastText LID model (127 languages).
Kotoshu::Cli::InteractiveReviewer: Interactive CLI for error review with full navigation (forward, backward, jump, skip, accept).
Kotoshu::Dictionary::Hunspell: Traditional Hunspell dictionary backend for morphological analysis and affix rules.

Why ONNX?

ONNX Runtime provides:

Performance: C++ implementation, 10-100x faster than pure Ruby
Portability: Works on CPU, GPU, TPU, mobile devices
Optimization: Automatic graph optimization and quantization
Interoperability: Models can be trained in Python, deployed in Ruby

Kotoshu uses FastText models converted to ONNX format for semantic spell checking.

Semantic Analysis

Unlike traditional spell checkers that only check dictionary membership and edit distance, Kotoshu uses semantic similarity to:

Detect contextually appropriate corrections ("desert" vs "dessert")
Handle out-of-vocabulary words via subword embeddings
Provide ranked suggestions based on semantic similarity
Support compound words and morphological variations

Example 1. Usage example

Kotoshu.setup(:en, want: %i[spelling model])  # one-time per language

# Traditional: knows "helo" is wrong and lists edit-distance candidates
Kotoshu.suggest("helo").to_words
# => ["hello", "help", "held", "hell", "hole"]

# Semantic: reranks candidates by context similarity
model = Kotoshu::Models::OnnxModel.from_github("en")
analyzer = Kotoshu::Analyzers::SemanticAnalyzer.new(model)
analyzer.suggest_corrections("helo", context: "I said helo to the world").map(&:word)
# => ["hello"]  # "hello" makes more sense in greeting context

Note	The semantic path requires the optional `onnxruntime` gem. See Requirements.

Multi-Language Support

Kotoshu supports 6 languages with full semantic analysis:

de - German (Deutsch)
en - English
es - Spanish (Español)
fr - French (Français)
pt - Portuguese (Português)
ru - Russian (Русский)

Automatic language detection is enabled by default:

Example 2. Usage example

# Language auto-detected from document content
kotoshu check document.txt
# Detected: en (95% confidence)
# Analyzing document.txt (language: en)...

# Explicit language specification
kotoshu check document.txt --language de

ONNX Models

Kotoshu uses FastText crawl vectors converted to ONNX format:

Source: FastText Crawl Vectors
Format: ONNX with optimized runtime
Vocabulary: 2 million words per language (full coverage)
Dimension: 300-dimensional word vectors
Size: ~2.4GB per language

FastText File Formats

FastText provides two file formats. Kotoshu uses the .vec format for ONNX conversion.

Aspect .vec (Text) .bin (Binary)

Aspect	`.vec` (Text)	`.bin` (Binary)
Content	Word vectors only (pre-computed embeddings)	Full FastText model (trained model)
Structure	Text: one word + 300 floats per line	Binary: complete model with matrices
File Size	~1.3GB compressed (~2.4GB uncompressed)	~1.8GB compressed (~4.8GB uncompressed)
Train New Words	✗ No (static lookup only)	✓ Yes (can train/OOV with subword info)
Subword Embeddings	✗ No	✓ Yes (n-gram character embeddings)
ONNX Converter	✓ Supported (what we use)	✗ Not supported
Use Case	Simple word vector lookup for spell checking	Full FastText functionality (training, OOV)

Content

Word vectors only (pre-computed embeddings)

Full FastText model (trained model)

Structure

Text: one word + 300 floats per line

Binary: complete model with matrices

File Size

~1.3GB compressed (~2.4GB uncompressed)

~1.8GB compressed (~4.8GB uncompressed)

Train New Words

✗ No (static lookup only)

✓ Yes (can train/OOV with subword info)

Subword Embeddings

✗ No

✓ Yes (n-gram character embeddings)

ONNX Converter

✓ Supported (what we use)

✗ Not supported

Use Case

Simple word vector lookup for spell checking

Full FastText functionality (training, OOV)

Kotoshu uses .vec files because:

Simpler extraction: Just word → vector mapping
No subword complexity needed: Dictionary-based spell checking doesn’t require OOV generation
Smaller ONNX models: ~2.4GB vs ~4.8GB
Faster conversion: Direct serialization to ONNX

Example 3. Model management

# Set up a language with spelling + ONNX semantic model
kotoshu setup en --want spelling,model

# List what's set up in the cache
kotoshu setup --list

# Re-validate cached resources
kotoshu cache validate

Note	FastText `.vec` → ONNX conversion is done upstream in the `kotoshu/models-fasttext-onnx' repo. The CLI downloads pre-converted artifacts; users do not run conversion locally.

Interactive Mode

Note	Interactive mode shipped in 0.3.0. It is navigation-only — the session records which suggestions the user accepted but does not rewrite the source file yet.

kotoshu check README.md --interactive

Features in 0.3:

Navigate: [n] / Enter next, [p] previous, [l] list
Accept: [1-9] record suggestion N for the current error
Skip: [s] skip the current error
Quit: [q] exit the review loop

Batch Processing

For CI/CD and automation, Kotoshu supports JSON and SARIF output in 0.3; --output file redirection is planned for 0.4+.

Example 4. JSON output for CI/CD

# JSON output to stdout (supported in 0.3)
kotoshu check README.md --format json

# SARIF 2.1.0 output (supported in 0.3)
kotoshu check README.md --format sarif

# Exit code for CI
kotoshu check README.md
echo $?  # 0 if no errors, 1 if errors found

Document Formats

Kotoshu supports structured documents with AST parsing:

Plain text: Line-based error detection
Markdown: AST-based using Kramdown parser
AsciiDoc: AST-based using Asciidoctor parser

Structured documents preserve node paths for precise error location.

Analysis Models

Note	In 0.2, the CLI runs the Hunspell traditional path only. The `--model` flag and FastText/Hybrid paths are planned for 0.3+. The Ruby API can opt into the semantic path today via `Kotoshu::Models::OnnxModel` (auto-available when `onnxruntime` is installed).

Kotoshu is designed to support three analysis models:

Table 1. Dictionary backend comparison
Model	Description	Best For
hunspell	Traditional dictionary-based with morphological rules	Fast checking, compound words, languages with complex morphology
fasttext	Pure semantic embeddings via ONNX	Context awareness, out-of-vocabulary words, semantic similarity
hybrid	Hunspell candidates + FastText reranking (recommended)	Maximum accuracy, best of both worlds

Example 5. Intended usage (0.3+)

# Fast dictionary-based checking (default in 0.2)
kotoshu check document.txt                # 0.2: Hunspell path

# Semantic / hybrid paths: planned for 0.3
# kotoshu check document.txt --model fasttext
# kotoshu check document.txt --model hybrid

Installation

Add this line to your application’s Gemfile:

gem 'kotoshu'

And then execute:

bundle install

Or install it yourself as:

gem install kotoshu

Note	`onnxruntime` is an optional dependency. Install it separately (`gem install onnxruntime`) to enable semantic analysis; the traditional Hunspell path works without it.

Quick Start

# One-time per language: download spelling dictionary from
# github.com/kotoshu/dictionaries (idempotent, ~5 MB)
kotoshu setup en

# Then check files instantly, cache-only
kotoshu check README.md

Or skip the explicit setup — the CLI will prompt interactively the first time you check a file in a non-cached language (TTY only; in non-TTY or KOTOSHU_OFFLINE=1 mode it exits with code 3).

Command-line usage

# Check a file (uses --language, or auto-detects from content)
kotoshu check README.md

# Explicit language
kotoshu check README.md --language en

# JSON output for programmatic use
kotoshu check README.md --format json

# Offline mode — use only cached dictionaries, never download
kotoshu check README.md --offline

# Check stdin
echo "helo wrld" | kotoshu check

Exit codes: 0 (no errors), 1 (errors found), 2 (usage error), 3 (language not set up — run kotoshu setup LANG, or run kotoshu check in a TTY to be prompted).

Ruby API usage

require 'kotoshu'

# Stage 1: set up the language once (downloads from github.com/kotoshu/dictionaries)
Kotoshu.setup(:en)

# Stage 2: hot-path checks are cache-only and never touch the network
Kotoshu.correct?("hello")  # => true
Kotoshu.correct?("helo")   # => false

# Suggestions return a SuggestionSet; call #to_words for an Array
Kotoshu.suggest("helo").to_words  # => ["hello", "help", "held", ...]

# Check a document
result = Kotoshu.check("Hello wrold")
result.errors.map(&:word)  # => ["wrold"]

# Each error carries position + suggestions
result = Kotoshu.check_file("README.md")
result.errors.each do |error|
  puts "#{error.word} at offset #{error.position}: #{error.top_suggestions(3).join(', ')}"
end

# Semantic analysis is optional — requires the onnxruntime gem
# (gem install onnxruntime). Skip this block if you only want Hunspell.
if Kotoshu::Models::OnnxModel::ONNX_LOADED
  Kotoshu.setup(:en, want: %i[spelling model])
  model = Kotoshu::Models::OnnxModel.from_github('en')
  analyzer = Kotoshu::Analyzers::SemanticAnalyzer.new(model)
  analyzer.analyze(Kotoshu.check("Hello wrold"))
end

Note	The library API is strict: calls like `Kotoshu.correct?` raise `Kotoshu::ResourceNotSetupError` until you’ve run `Kotoshu.setup`. This prevents surprise downloads on metered networks. The CLI (`kotoshu check`) intercepts the error and prompts to download interactively.

Requirements

Ruby 3.1+
onnxruntime gem (optional — enables semantic spell checking; install separately with gem install onnxruntime)
Python 3 + fasttext (optional, only if you want to convert .vec → .onnx upstream)

Resource Caching and Language Support

Kotoshu uses a sophisticated multi-layer caching system to manage dictionaries, frequency lists, and embedding models. Resources are downloaded explicitly via Kotoshu.setup (or kotoshu setup) and cached under the XDG base directory layout (~/.cache/kotoshu/ by default; override via KOTOSHU_CACHE_PATH, KOTOSHU_CONFIG_PATH, KOTOSHU_DATA_PATH, or the XDG_*_HOME vars).

Cache Architecture

Cache System Class Diagram

┌────────────────────────────────────────────────────────────────────────────┐
│                              BaseCache (Abstract)                        │
│  ┌────────────────────────────────────────────────────────────────────┐   │
│  │ Common: download, metadata, validation, stats, TTL management     │   │
│  └────────────────────────────────────────────────────────────────────┘   │
└────────────────────┬───────────────────┬────────────────────┬──────────────┘
                     │                   │                    │
        ┌────────────▼────────┐  ┌──────▼──────┐  ┌───────▼─────────┐
        │   LanguageCache     │  │ModelCache   │  │ FrequencyCache  │
        │  (Dictionaries)     │  │ (Embeddings)│  │  (Kelly Lists)  │
        └─────────────────────┘  └─────────────┘  └─────────────────┘
                     │                   │                    │
        ┌────────────▼────────┐  ┌──────▼──────┐  ┌───────▼─────────┐
        │ ~/.cache/kotoshu/  │  │~/.cache/    │  │ ~/.cache/kotoshu/│
        │   languages/       │  │  kotoshu/   │  │frequency-lists/ │
        │                    │  │  models/    │  │                 │
        └─────────────────────┘  └─────────────┘  └─────────────────┘

Cache Types

LanguageCache (Dictionaries)

Manages Hunspell dictionaries and grammar rules for spell checking.

Cache Path: ~/.cache/kotoshu/languages/{code}/
TTL: 7 days (604,800 seconds)
Source: kotoshu/dictionaries
Resources per language:
- spelling/**: Hunspell dictionary (index.dic, index.aff)
- grammar/*: Grammar rules (rules.yaml) - *future
- frequency/*: Frequency data - *deprecated, use FrequencyCache

Usage

# Access via cache
cache = Kotoshu::Cache::LanguageCache.new
dict = cache.get_spelling('en')

# Result:
# {
#   aff_path: "~/.cache/kotoshu/languages/en/spelling/index.aff",
#   dic_path: "~/.cache/kotoshu/languages/en/spelling/index.dic",
#   cached: true,
#   metadata: { ... }
# }

FrequencyCache (Kelly Project)

Manages Kelly Project frequency lists for intelligent suggestion ranking.

Cache Path: ~/.cache/kotoshu/frequency-lists/{code}/
TTL: 7 days (604,800 seconds)
Source: kotoshu/frequency-list-kelly
Format: JSON with tiered word frequency data

Kelly Frequency Data Structure

{
  "metadata": {
    "language": "en",
    "source": "Kelly Project (University of Leeds)",
    "total_words_analyzed": 1500000
  },
  "tiers": {
    "top_50": {
      "words": ["the", "be", "to", "of", "and", ...],
      "info": "Most common 50 words"
    },
    "top_200": {
      "words": ["will", "my", "one", "all", ...],
      "info": "Most common 200 words"
    },
    "top_1000": {
      "words": ["however", "although", ...],
      "info": "Most common 1000 words"
    }
  }
}

Usage

# Access via cache
cache = Kotoshu::Cache::FrequencyCache.new
freq_data = cache.get('en', force_download: true)

# Result:
# {
#   frequency_path: "~/.cache/kotoshu/frequency-lists/en/frequency.json",
#   tiers: {
#     top_50: Set<...>,
#     top_200: Set<...>,
#     top_1000: Set<...>
#   },
#   metadata: { ... }
# }

# Integrated into EditDistanceStrategy
strategy = Kotoshu::Suggestions::Strategies::EditDistanceStrategy.new(
  language_code: 'en'
)
strategy.frequency_bonus('the')   # => 200 (top 50)
strategy.frequency_bonus('hello') # => 100 (top 200)
strategy.frequency_bonus('xyz')   # => 0 (not in lists)

ModelCache (Embedding Models)

Manages FastText and ONNX embedding models for semantic spell checking.

Cache Path: ~/.cache/kotoshu/models/{code}/models/{type}/
TTL: 30 days (2,592,000 seconds)
Sources:
- FastText (.vec): Facebook CDN (dl.fbaipublicfiles.com)
- ONNX (.onnx): Converted locally from FastText models
Supported Types:
- fasttext: FastText word vectors (.vec files, 300D) - Downloaded from Facebook CDN
- onnx: ONNX-converted models (.onnx files) - Auto-converted from FastText

Note	ONNX models are automatically converted from FastText models on first use. The conversion uses `lib/kotoshu/scripts/fasttext_to_onnx.py` and requires Python 3 with `numpy` and `onnx` packages installed.

Table 2. Model Files by Language
Language	`FastText File`
ONNX File

CLI Cache Management

Kotoshu provides CLI commands for managing cached resources:

# List all cached resources
kotoshu cache list

# List specific cache type
kotoshu cache list language
kotoshu cache list model
kotoshu cache list frequency

# Show cache statistics
kotoshu cache status

# Show detailed status (verbose)
kotoshu cache status --verbose

# Download a resource
kotoshu cache download language en
kotoshu cache download model en:fasttext
kotoshu cache download frequency en

# Get information about a resource
kotoshu cache info language en
kotoshu cache info model en:fasttext
kotoshu cache info frequency en

# Purge cached data
kotoshu cache purge all
kotoshu cache purge language en
kotoshu cache purge frequency

# Clean expired entries
kotoshu cache clean

Cache Statistics

Each cache type tracks statistics:

Hits: Number of cache hits (resource found locally)
Misses: Number of cache misses (had to download)
Hit Rate: Percentage of cache hits
Size: Total disk space used
Cached Resources: Number of resources cached

$ kotoshu cache status
======================================================================
Kotoshu Cache Status
======================================================================

Language Cache:
  Directory: /Users/username/.cache/kotoshu/languages
  Resources cached: 2
  Size: 2.45 MB
  Hits: 15, Misses: 2
  Hit rate: 88.2%

Frequency Cache:
  Directory: /Users/username/.cache/kotoshu/frequency-lists
  Resources cached: 1
  Size: 815.84 KB
  Hits: 42, Misses: 1
  Hit rate: 97.7%

Model Cache:
  Directory: /Users/username/.cache/kotoshu/models
  Resources cached: 0
  Size: 0 B
  Hits: 0, Misses: 0
  Hit rate: 0.0%

Total:
  Total size: 3.26 MB
  Overall hit rate: 93.5%
======================================================================

Language Support Matrix

Kotoshu provides multi-language support with varying feature availability.

Table 3. Complete Language Support Matrix
Language	Dictionary	Hunspell Affix Rules	Kelly Frequency	FastText Model	ONNX Model	Notes

| de (German) | ✓ (75,873 words) | ✓ | ✗ | ✓ (2.5 GB) | ✓ (~230 MB) | QWERTZ keyboard support | | en (English) | ✓ (49,568 words) | ✓ | ✓ (815 KB) | ✓ (4.3 GB) | ✓ (~460 MB) | QWERTY keyboard support | | es (Spanish) | ✓ (57,344 words) | ✓ | ✗ | ✓ (2.5 GB) | ✓ (~230 MB) | QWERTY keyboard support | | fr (French) | ✓ (84,310 words) ✓ | ✗ | ✓ (2.5 GB) | ✓ (~230 MB) | AZERTY keyboard support | | pt (Portuguese) | ✓ (312,368 words) | ✓ | ✗ | ✓ (2.5 GB) | ✓ (~230 MB) | QWERTY keyboard support | | ru (Russian) | ✓ (146,269 words) | ✓ | ✓ (780 KB) | ✓ (2.5 GB) | ✓ (~230 MB) | JCUKEN keyboard support | | ar (Arabic) | ✗ | ✗ | ✓ | ✗ | ✗ | Kelly frequency only | | zh (Chinese) | ✗ | ✗ | ✓ | ✗ | ✗ | Kelly frequency only | | el (Greek) | ✗ | ✗ | ✓ | ✗ | ✗ | Kelly frequency only | | it (Italian) | ✗ | ✗ | ✓ | ✗ | ✗ | Kelly frequency only | | no (Norwegian) | ✗ | ✗ | ✓ | ✗ | ✗ | Kelly frequency only | | sv (Swedish) | ✗ | ✗ | ✓ | ✗ | ✗ | Kelly frequency only | |=========================================]

Table 4. Dictionary Sources
Language	Word Count	License	Source

Table 5. Kelly Frequency Lists
Language	Size	Coverage

Note	Kelly frequency lists provide the top 1000 most common words from the Kelly Project (University of Leeds & University of Gothenburg). Languages not listed here require external frequency data sources.

Programmatic Usage

Using Language Cache

require 'kotoshu/cache/language_cache'

cache = Kotoshu::Cache::LanguageCache.new

# Get spelling dictionary
dict = cache.get_spelling('en')
puts "Dictionary: #{dict[:dic_path]}"
puts "Words: #{File.readlines(dict[:dic_path]).count}"

# Get available languages
cache.available_languages  # => ["de", "en", "es", "fr", "pt", "ru"]

# Check if resource is cached
cache.available?('en:spelling')  # => true

# Get language info
info = cache.language_info('en')
puts "Language: #{info[:name]}"
puts "Words: #{info[:word_count]}"
puts "License: #{info[:license]}"

Using Frequency Cache

require 'kotoshu/cache/frequency_cache'

cache = Kotoshu::Cache::FrequencyCache.new

# Get frequency data
freq_data = cache.get('en')

# Access frequency tiers
top_50 = freq_data[:tiers][:top_50]
top_50.include?('the')  # => true
top_50.include?('hello')  # => true (in top 200)

# Get available languages
cache.available_languages  # => ["ar", "zh", "en", "el", "it", "no", "ru", "sv"]

Integration with Suggestion Strategies

require 'kotoshu/suggestions/strategies/edit_distance_strategy'

# Frequency bonuses automatically applied
strategy = Kotoshu::Suggestions::Strategies::EditDistanceStrategy.new(
  language_code: 'en'
)

# Suggestions are ranked by frequency
suggestions = strategy.suggest('helo', max_results: 5)
# => [
#   { word: "hello", score: 1200 },  # High frequency word
#   { word: "help", score: 1150 },   # Medium frequency word
#   ...
# ]

Cache TTL and Expiration

All cached resources have a Time-To-Live (TTL) and automatically expire:

LanguageCache: 7 days (dictionaries change infrequently)
FrequencyCache: 7 days (frequency lists are stable)
ModelCache: 30 days (models are large and change rarely)

Expired resources are automatically re-downloaded on next access.

cache = Kotoshu::Cache::FrequencyCache.new

# Force re-download (ignores cache)
freq_data = cache.get('en', force_download: true)

# Clean expired entries manually
cache.clean

Manual Cache Management

cache = Kotoshu::Cache::LanguageCache.new

# Clear specific resource
cache.clear('en:spelling')

# Clear all resources
cache.clear_all

# Check if resource exists
cache.available?('en:spelling')  # => true after download

# Get statistics
stats = cache.stats
puts "Hit rate: #{stats[:hit_rate] * 100}%"
puts "Size: #{stats[:size_bytes]} bytes"

GitHub Repository Structure

The kotoshu/dictionaries repository follows this structure:

kotoshu/dictionaries/
├── en/
│   ├── spelling/
│   │   ├── index.dic          # Hunspell dictionary
│   │   ├── index.aff          # Hunspell affix rules
│   │   └── metadata.json      # Version info
│   ├── grammar/
│   │   └── rules.yaml         # Grammar rules (future)
│   └── models/
│       ├── fasttext/
│       │   └── cc.en.300.vec  # FastText vectors
│       └── onnx/
│           └── fasttext.en.onnx # ONNX model
├── de/
│   └── ... (same structure)
└── README.md

kotoshu/frequency-list-kelly/
├── data/
│   ├── en.json               # Kelly frequency data
│   ├── ru.json
│   └── ...
└── README.md

Adding New Languages

To add support for a new language:

Dictionary: Add Hunspell dictionary to kotoshu/dictionaries/{code}/spelling/
Frequency: Add Kelly frequency data to kotoshu/frequency-list-kelly/data/{code}.json
Register: Add to AVAILABLE_LANGUAGES in LanguageCache
Test: Run integration tests to verify

See CONTRIBUTING.adoc for detailed guidelines.

Model Repository

ONNX models are hosted at: kotoshu/dictionaries

Download and setup:

# Preferred: let kotoshu fetch and verify the model
kotoshu setup en --want spelling,model

# Manual clone (advanced; bypasses manifest verification)
git clone https://github.com/kotoshu/dictionaries.git ~/src/kotoshu/dictionaries

License

BSD 2-Clause — see the LICENSE file for details.

Bundled dictionaries and frequency lists carry their own licenses; see the per-language license files in kotoshu/dictionaries.

Contributing

Contributions are welcome! Please see CONTRIBUTING.adoc for guidelines.

Acknowledgments

FastText: Facebook Research
ONNX Runtime: Microsoft
Hunspell: László Németh