RubyGems Version License CI

Status

Kotoshu is v0.3.0 — building on the 0.2 cut, this release adds a strict two-stage resource model (explicit setup, cache-only hot path), XDG base directory layout, SHA-256 integrity verification, SARIF output, an --interactive review loop, and ONNX model pipeline wiring.

What works in 0.3

  • Two-stage resource model — Kotoshu.setup(:en) then Kotoshu.correct?("hello"). The hot path is cache-only and raises ResourceNotSetupError on miss; downloads are never implicit.

  • Kotoshu.check(text, language: "en") / Kotoshu.suggest("helo") — full document check and suggestions

  • Kotoshu.spellchecker_for(lang, strict: true) — re-raise on optional-resource failures

  • kotoshu check FILE CLI with these flags:

    • --language en|de|es|fr|pt|ru|auto (default: auto)

    • --format text|json|sarif

    • --offline — use only cached resources, never download

    • --strict — exit 3 if any optional resource (frequency, model) can’t load

    • --interactive — review each error after the check

    • --verbose

  • kotoshu setup LANGUAGE [LANGUAGE …​] — pre-warm spelling + frequency + ONNX caches for offline use (fetch is kept as a hidden deprecated alias)

  • Local-source setup: kotoshu setup en --aff path/to.en.aff --dic path/to.en.dic or kotoshu setup en --from /path/to/dict/dir/

  • Exit codes: 0 clean, 1 errors found, 2 usage error, 3 resource setup failed

  • SHA-256 integrity verification (manifest-based, with graceful degradation when manifest is absent)

  • Offline mode via KOTOSHU_OFFLINE=1 or --offline

  • XDG base directory layout — caches in $XDG_CACHE_HOME/kotoshu/, config in $XDG_CONFIG_HOME/kotoshu/, data in $XDG_LOCAL_HOME/kotoshu/ (overridable via KOTOSHU_CACHE_PATH, KOTOSHU_CONFIG_PATH, KOTOSHU_DATA_PATH)

Planned for 0.4+

  • --output (file output redirection)

  • ONNX semantic reranking as default path

  • ≥30 language modules wired

  • Grammar rule packs

  • CJK and RTL language support

See the 0.2 cut plan, the 0.3 tasks under TODO.impl/, and the vision for the path to 1.0.

Purpose

Kotoshu 「言修」 is a pure-Ruby spell checker that aims to work for every language by dynamically downloading the right combination of dictionary, frequency data, and embedding model on demand.

The current release pairs a Ruby port of the Hunspell algorithm (traditional morphological lookup + affix rules) with optional FastText word embeddings converted to ONNX for context-aware reranking.

Note
The semantic (ONNX) path is an optional feature. gem install kotoshu works without onnxruntime; install it separately (gem install onnxruntime) to enable context-aware reranking. Set KOTOSHU_NO_ONNX=1 to opt back out.

Features

Note
The list below describes the design vision. See [status] for exactly what works in 0.2 and what is planned for 0.3+.

Architecture

Kotoshu is built on a modern, semantic architecture:

Architecture overview
╔═══════════════════════════════════════════════════════════════════╗
║                    Kotoshu Semantic Architecture                ║
╠═══════════════════════════════════════════════════════════════════╣
║                                                                   ║
║  ┌─────────────────────────────────────────────────────────────┐  ║
║  │                     Interface Layer                          │  ║
║  │  ┌─────────────────────┐  ┌─────────────────────────────┐   │  ║
║  │  │   CLI (Thor)        │  │      Ruby API               │   │  ║
║  │  │   lib/kotoshu/cli/  │  │   Kotoshu module methods    │   │  ║
║  │  └──────────┬──────────┘  └───────────┬─────────────────┘   │  ║
║  │             │  Auto Language Detect   │                     │  ║
║  └─────────────┼──────────────────────────┼─────────────────────┘  ║
║                │                          │                        ║
║                ▼                          ▼                        ║
║  ┌─────────────────────────────────────────────────────────────┐  ║
║  │                   Analysis Layer                             │  ║
║  │  ┌──────────────┐  ┌─────────────┐  ┌───────────────────┐  │  ║
║  │  │    Hunspell  │  │  FastText   │  │  Hybrid (Best!)   │  │  ║
║  │  │  Dictionary  │  │  Embeddings │  │  Combined         │  │  ║
║  │  │  (Traditional)│  │  (ONNX)    │  │  Approach         │  │  ║
║  │  └──────────────┘  └─────────────┘  └───────────────────┘  │  ║
║  └───────────────────────────┬─────────────────────────────────┘  ║
║                              │                                    ║
║  ┌───────────────────────────▼─────────────────────────────────┐  ║
║  │                  Model Layer (ONNX)                          │  ║
║  │  ┌──────────────────────────────────────────────────────┐  │  ║
║  │  │  ONNX Runtime → Fast Embedding Lookup                 │  │  ║
║  │  │  Semantic Similarity → Context-Aware Suggestions      │  │  ║
║  │  │  Nearest Neighbor Search → Smart Corrections          │  │  ║
║  │  └──────────────────────────────────────────────────────┘  │  ║
║  └─────────────────────────────────────────────────────────────┘  ║
║                                                                   ║
╚═══════════════════════════════════════════════════════════════════╝

Key Components

  • Kotoshu::Models::OnnxModel: ONNX-based word embedding model for fast semantic similarity and nearest neighbor search.

  • Kotoshu::Analyzers::SemanticAnalyzer: Unified semantic error detection using word embeddings (no artificial spelling/grammar split).

  • Kotoshu::Language::LanguageIdentifier: Automatic language detection using FastText LID model (127 languages).

  • Kotoshu::Cli::InteractiveReviewer: Interactive CLI for error review with full navigation (forward, backward, jump, skip, accept).

  • Kotoshu::Dictionary::Hunspell: Traditional Hunspell dictionary backend for morphological analysis and affix rules.

Why ONNX?

ONNX Runtime provides:

  • Performance: C++ implementation, 10-100x faster than pure Ruby

  • Portability: Works on CPU, GPU, TPU, mobile devices

  • Optimization: Automatic graph optimization and quantization

  • Interoperability: Models can be trained in Python, deployed in Ruby

Kotoshu uses FastText models converted to ONNX format for semantic spell checking.

Semantic Analysis

Unlike traditional spell checkers that only check dictionary membership and edit distance, Kotoshu uses semantic similarity to:

  • Detect contextually appropriate corrections ("desert" vs "dessert")

  • Handle out-of-vocabulary words via subword embeddings

  • Provide ranked suggestions based on semantic similarity

  • Support compound words and morphological variations

Example 1. Usage example
Kotoshu.setup(:en, want: %i[spelling model])  # one-time per language

# Traditional: knows "helo" is wrong and lists edit-distance candidates
Kotoshu.suggest("helo").to_words
# => ["hello", "help", "held", "hell", "hole"]

# Semantic: reranks candidates by context similarity
model = Kotoshu::Models::OnnxModel.from_github("en")
analyzer = Kotoshu::Analyzers::SemanticAnalyzer.new(model)
analyzer.suggest_corrections("helo", context: "I said helo to the world").map(&:word)
# => ["hello"]  # "hello" makes more sense in greeting context
Note
The semantic path requires the optional onnxruntime gem. See Requirements.

Multi-Language Support

Kotoshu supports 6 languages with full semantic analysis:

  • de - German (Deutsch)

  • en - English

  • es - Spanish (Español)

  • fr - French (Français)

  • pt - Portuguese (Português)

  • ru - Russian (Русский)

Automatic language detection is enabled by default:

Example 2. Usage example
# Language auto-detected from document content
kotoshu check document.txt
# Detected: en (95% confidence)
# Analyzing document.txt (language: en)...

# Explicit language specification
kotoshu check document.txt --language de

ONNX Models

Kotoshu uses FastText crawl vectors converted to ONNX format:

  • Source: FastText Crawl Vectors

  • Format: ONNX with optimized runtime

  • Vocabulary: 2 million words per language (full coverage)

  • Dimension: 300-dimensional word vectors

  • Size: ~2.4GB per language

FastText File Formats

FastText provides two file formats. Kotoshu uses the .vec format for ONNX conversion.

Aspect .vec (Text) .bin (Binary)

Content

Word vectors only (pre-computed embeddings)

Full FastText model (trained model)

Structure

Text: one word + 300 floats per line

Binary: complete model with matrices

File Size

~1.3GB compressed (~2.4GB uncompressed)

~1.8GB compressed (~4.8GB uncompressed)

Train New Words

✗ No (static lookup only)

✓ Yes (can train/OOV with subword info)

Subword Embeddings

✗ No

✓ Yes (n-gram character embeddings)

ONNX Converter

✓ Supported (what we use)

✗ Not supported

Use Case

Simple word vector lookup for spell checking

Full FastText functionality (training, OOV)

Kotoshu uses .vec files because:

  • Simpler extraction: Just word → vector mapping

  • No subword complexity needed: Dictionary-based spell checking doesn’t require OOV generation

  • Smaller ONNX models: ~2.4GB vs ~4.8GB

  • Faster conversion: Direct serialization to ONNX

Example 3. Model management
# Set up a language with spelling + ONNX semantic model
kotoshu setup en --want spelling,model

# List what's set up in the cache
kotoshu setup --list

# Re-validate cached resources
kotoshu cache validate
Note
FastText .vec → ONNX conversion is done upstream in the `kotoshu/models-fasttext-onnx' repo. The CLI downloads pre-converted artifacts; users do not run conversion locally.

Interactive Mode

Note
Interactive mode shipped in 0.3.0. It is navigation-only — the session records which suggestions the user accepted but does not rewrite the source file yet.
kotoshu check README.md --interactive

Features in 0.3:

  • Navigate: [n] / Enter next, [p] previous, [l] list

  • Accept: [1-9] record suggestion N for the current error

  • Skip: [s] skip the current error

  • Quit: [q] exit the review loop

Batch Processing

For CI/CD and automation, Kotoshu supports JSON and SARIF output in 0.3; --output file redirection is planned for 0.4+.

Example 4. JSON output for CI/CD
# JSON output to stdout (supported in 0.3)
kotoshu check README.md --format json

# SARIF 2.1.0 output (supported in 0.3)
kotoshu check README.md --format sarif

# Exit code for CI
kotoshu check README.md
echo $?  # 0 if no errors, 1 if errors found

Document Formats

Kotoshu supports structured documents with AST parsing:

  • Plain text: Line-based error detection

  • Markdown: AST-based using Kramdown parser

  • AsciiDoc: AST-based using Asciidoctor parser

Structured documents preserve node paths for precise error location.

Analysis Models

Note
In 0.2, the CLI runs the Hunspell traditional path only. The --model flag and FastText/Hybrid paths are planned for 0.3+. The Ruby API can opt into the semantic path today via Kotoshu::Models::OnnxModel (auto-available when onnxruntime is installed).

Kotoshu is designed to support three analysis models:

Table 1. Dictionary backend comparison
Model Description Best For

hunspell

Traditional dictionary-based with morphological rules

Fast checking, compound words, languages with complex morphology

fasttext

Pure semantic embeddings via ONNX

Context awareness, out-of-vocabulary words, semantic similarity

hybrid

Hunspell candidates + FastText reranking (recommended)

Maximum accuracy, best of both worlds

Example 5. Intended usage (0.3+)
# Fast dictionary-based checking (default in 0.2)
kotoshu check document.txt                # 0.2: Hunspell path

# Semantic / hybrid paths: planned for 0.3
# kotoshu check document.txt --model fasttext
# kotoshu check document.txt --model hybrid

Installation

Add this line to your application’s Gemfile:

gem 'kotoshu'

And then execute:

bundle install

Or install it yourself as:

gem install kotoshu
Note
onnxruntime is an optional dependency. Install it separately (gem install onnxruntime) to enable semantic analysis; the traditional Hunspell path works without it.

Quick Start

# One-time per language: download spelling dictionary from
# github.com/kotoshu/dictionaries (idempotent, ~5 MB)
kotoshu setup en

# Then check files instantly, cache-only
kotoshu check README.md

Or skip the explicit setup — the CLI will prompt interactively the first time you check a file in a non-cached language (TTY only; in non-TTY or KOTOSHU_OFFLINE=1 mode it exits with code 3).

Command-line usage
# Check a file (uses --language, or auto-detects from content)
kotoshu check README.md

# Explicit language
kotoshu check README.md --language en

# JSON output for programmatic use
kotoshu check README.md --format json

# Offline mode — use only cached dictionaries, never download
kotoshu check README.md --offline

# Check stdin
echo "helo wrld" | kotoshu check

Exit codes: 0 (no errors), 1 (errors found), 2 (usage error), 3 (language not set up — run kotoshu setup LANG, or run kotoshu check in a TTY to be prompted).

Ruby API usage
require 'kotoshu'

# Stage 1: set up the language once (downloads from github.com/kotoshu/dictionaries)
Kotoshu.setup(:en)

# Stage 2: hot-path checks are cache-only and never touch the network
Kotoshu.correct?("hello")  # => true
Kotoshu.correct?("helo")   # => false

# Suggestions return a SuggestionSet; call #to_words for an Array
Kotoshu.suggest("helo").to_words  # => ["hello", "help", "held", ...]

# Check a document
result = Kotoshu.check("Hello wrold")
result.errors.map(&:word)  # => ["wrold"]

# Each error carries position + suggestions
result = Kotoshu.check_file("README.md")
result.errors.each do |error|
  puts "#{error.word} at offset #{error.position}: #{error.top_suggestions(3).join(', ')}"
end

# Semantic analysis is optional — requires the onnxruntime gem
# (gem install onnxruntime). Skip this block if you only want Hunspell.
if Kotoshu::Models::OnnxModel::ONNX_LOADED
  Kotoshu.setup(:en, want: %i[spelling model])
  model = Kotoshu::Models::OnnxModel.from_github('en')
  analyzer = Kotoshu::Analyzers::SemanticAnalyzer.new(model)
  analyzer.analyze(Kotoshu.check("Hello wrold"))
end
Note
The library API is strict: calls like Kotoshu.correct? raise Kotoshu::ResourceNotSetupError until you’ve run Kotoshu.setup. This prevents surprise downloads on metered networks. The CLI (kotoshu check) intercepts the error and prompts to download interactively.

Requirements

  • Ruby 3.1+

  • onnxruntime gem (optional — enables semantic spell checking; install separately with gem install onnxruntime)

  • Python 3 + fasttext (optional, only if you want to convert .vec.onnx upstream)

Resource Caching and Language Support

Kotoshu uses a sophisticated multi-layer caching system to manage dictionaries, frequency lists, and embedding models. Resources are downloaded explicitly via Kotoshu.setup (or kotoshu setup) and cached under the XDG base directory layout (~/.cache/kotoshu/ by default; override via KOTOSHU_CACHE_PATH, KOTOSHU_CONFIG_PATH, KOTOSHU_DATA_PATH, or the XDG_*_HOME vars).

Cache Architecture

Cache System Class Diagram
┌────────────────────────────────────────────────────────────────────────────┐
│                              BaseCache (Abstract)                        │
│  ┌────────────────────────────────────────────────────────────────────┐   │
│  │ Common: download, metadata, validation, stats, TTL management     │   │
│  └────────────────────────────────────────────────────────────────────┘   │
└────────────────────┬───────────────────┬────────────────────┬──────────────┘
                     │                   │                    │
        ┌────────────▼────────┐  ┌──────▼──────┐  ┌───────▼─────────┐
        │   LanguageCache     │  │ModelCache   │  │ FrequencyCache  │
        │  (Dictionaries)     │  │ (Embeddings)│  │  (Kelly Lists)  │
        └─────────────────────┘  └─────────────┘  └─────────────────┘
                     │                   │                    │
        ┌────────────▼────────┐  ┌──────▼──────┐  ┌───────▼─────────┐
        │ ~/.cache/kotoshu/  │  │~/.cache/    │  │ ~/.cache/kotoshu/│
        │   languages/       │  │  kotoshu/   │  │frequency-lists/ │
        │                    │  │  models/    │  │                 │
        └─────────────────────┘  └─────────────┘  └─────────────────┘

Cache Types

LanguageCache (Dictionaries)

Manages Hunspell dictionaries and grammar rules for spell checking.

  • Cache Path: ~/.cache/kotoshu/languages/{code}/

  • TTL: 7 days (604,800 seconds)

  • Source: kotoshu/dictionaries

  • Resources per language:

    • spelling/**: Hunspell dictionary (index.dic, index.aff)

    • grammar/*: Grammar rules (rules.yaml) - *future

    • frequency/*: Frequency data - *deprecated, use FrequencyCache

Usage
# Access via cache
cache = Kotoshu::Cache::LanguageCache.new
dict = cache.get_spelling('en')

# Result:
# {
#   aff_path: "~/.cache/kotoshu/languages/en/spelling/index.aff",
#   dic_path: "~/.cache/kotoshu/languages/en/spelling/index.dic",
#   cached: true,
#   metadata: { ... }
# }

FrequencyCache (Kelly Project)

Manages Kelly Project frequency lists for intelligent suggestion ranking.

  • Cache Path: ~/.cache/kotoshu/frequency-lists/{code}/

  • TTL: 7 days (604,800 seconds)

  • Source: kotoshu/frequency-list-kelly

  • Format: JSON with tiered word frequency data

Kelly Frequency Data Structure
{
  "metadata": {
    "language": "en",
    "source": "Kelly Project (University of Leeds)",
    "total_words_analyzed": 1500000
  },
  "tiers": {
    "top_50": {
      "words": ["the", "be", "to", "of", "and", ...],
      "info": "Most common 50 words"
    },
    "top_200": {
      "words": ["will", "my", "one", "all", ...],
      "info": "Most common 200 words"
    },
    "top_1000": {
      "words": ["however", "although", ...],
      "info": "Most common 1000 words"
    }
  }
}
Usage
# Access via cache
cache = Kotoshu::Cache::FrequencyCache.new
freq_data = cache.get('en', force_download: true)

# Result:
# {
#   frequency_path: "~/.cache/kotoshu/frequency-lists/en/frequency.json",
#   tiers: {
#     top_50: Set<...>,
#     top_200: Set<...>,
#     top_1000: Set<...>
#   },
#   metadata: { ... }
# }

# Integrated into EditDistanceStrategy
strategy = Kotoshu::Suggestions::Strategies::EditDistanceStrategy.new(
  language_code: 'en'
)
strategy.frequency_bonus('the')   # => 200 (top 50)
strategy.frequency_bonus('hello') # => 100 (top 200)
strategy.frequency_bonus('xyz')   # => 0 (not in lists)

ModelCache (Embedding Models)

Manages FastText and ONNX embedding models for semantic spell checking.

  • Cache Path: ~/.cache/kotoshu/models/{code}/models/{type}/

  • TTL: 30 days (2,592,000 seconds)

  • Sources:

  • Supported Types:

    • fasttext: FastText word vectors (.vec files, 300D) - Downloaded from Facebook CDN

    • onnx: ONNX-converted models (.onnx files) - Auto-converted from FastText

Note
ONNX models are automatically converted from FastText models on first use. The conversion uses lib/kotoshu/scripts/fasttext_to_onnx.py and requires Python 3 with numpy and onnx packages installed.
Table 2. Model Files by Language

Language

FastText File

ONNX File

| de (German) | cc.de.300.vec | fasttext.de.onnx | | en (English) | cc.en.300.vec | fasttext.en.onnx | | es (Spanish) | cc.es.300.vec | fasttext.es.onnx | | fr (French) | cc.fr.300.vec | fasttext.fr.onnx | | pt (Portuguese) | cc.pt.300.vec | fasttext.pt.onnx | | ru (Russian) | cc.ru.300.vec | fasttext.ru.onnx | |=========================================|

CLI Cache Management

Kotoshu provides CLI commands for managing cached resources:

# List all cached resources
kotoshu cache list

# List specific cache type
kotoshu cache list language
kotoshu cache list model
kotoshu cache list frequency

# Show cache statistics
kotoshu cache status

# Show detailed status (verbose)
kotoshu cache status --verbose

# Download a resource
kotoshu cache download language en
kotoshu cache download model en:fasttext
kotoshu cache download frequency en

# Get information about a resource
kotoshu cache info language en
kotoshu cache info model en:fasttext
kotoshu cache info frequency en

# Purge cached data
kotoshu cache purge all
kotoshu cache purge language en
kotoshu cache purge frequency

# Clean expired entries
kotoshu cache clean

Cache Statistics

Each cache type tracks statistics:

  • Hits: Number of cache hits (resource found locally)

  • Misses: Number of cache misses (had to download)

  • Hit Rate: Percentage of cache hits

  • Size: Total disk space used

  • Cached Resources: Number of resources cached

$ kotoshu cache status
======================================================================
Kotoshu Cache Status
======================================================================

Language Cache:
  Directory: /Users/username/.cache/kotoshu/languages
  Resources cached: 2
  Size: 2.45 MB
  Hits: 15, Misses: 2
  Hit rate: 88.2%

Frequency Cache:
  Directory: /Users/username/.cache/kotoshu/frequency-lists
  Resources cached: 1
  Size: 815.84 KB
  Hits: 42, Misses: 1
  Hit rate: 97.7%

Model Cache:
  Directory: /Users/username/.cache/kotoshu/models
  Resources cached: 0
  Size: 0 B
  Hits: 0, Misses: 0
  Hit rate: 0.0%

Total:
  Total size: 3.26 MB
  Overall hit rate: 93.5%
======================================================================

Language Support Matrix

Kotoshu provides multi-language support with varying feature availability.

Table 3. Complete Language Support Matrix
Language Dictionary Hunspell Affix Rules Kelly Frequency FastText Model ONNX Model Notes

| de (German) | ✓ (75,873 words) | ✓ | ✗ | ✓ (2.5 GB) | ✓ (~230 MB) | QWERTZ keyboard support | | en (English) | ✓ (49,568 words) | ✓ | ✓ (815 KB) | ✓ (4.3 GB) | ✓ (~460 MB) | QWERTY keyboard support | | es (Spanish) | ✓ (57,344 words) | ✓ | ✗ | ✓ (2.5 GB) | ✓ (~230 MB) | QWERTY keyboard support | | fr (French) | ✓ (84,310 words) ✓ | ✗ | ✓ (2.5 GB) | ✓ (~230 MB) | AZERTY keyboard support | | pt (Portuguese) | ✓ (312,368 words) | ✓ | ✗ | ✓ (2.5 GB) | ✓ (~230 MB) | QWERTY keyboard support | | ru (Russian) | ✓ (146,269 words) | ✓ | ✓ (780 KB) | ✓ (2.5 GB) | ✓ (~230 MB) | JCUKEN keyboard support | | ar (Arabic) | ✗ | ✗ | ✓ | ✗ | ✗ | Kelly frequency only | | zh (Chinese) | ✗ | ✗ | ✓ | ✗ | ✗ | Kelly frequency only | | el (Greek) | ✗ | ✗ | ✓ | ✗ | ✗ | Kelly frequency only | | it (Italian) | ✗ | ✗ | ✓ | ✗ | ✗ | Kelly frequency only | | no (Norwegian) | ✗ | ✗ | ✓ | ✗ | ✗ | Kelly frequency only | | sv (Swedish) | ✗ | ✗ | ✓ | ✗ | ✗ | Kelly frequency only | |=========================================]

Table 4. Dictionary Sources
Language Word Count License Source

| de (German) | 75,873 | GPL | igerman98 | | en (English) | 49,568 | LGPL/MPL/GPL | SCOWL | | es (Spanish) | 57,344 | GPL | LibreOffice | | fr (French) | 84,310 | MPL 2.0 | Grammalecte | | pt (Portuguese) | 312,368 | LGPLv3 + MPL | VERO | | ru (Russian) | 146,269 | BSD-style | Alexander Lebedev | |=========================================+

Table 5. Kelly Frequency Lists
Language Size Coverage

| ar (Arabic) | ~750 KB | Top 1000 words | | zh (Chinese) | ~800 KB | Top 1000 words | | en (English) | 815 KB | Top 1000 words | | el (Greek) | ~780 KB | Top 1000 words | | it (Italian) | ~790 KB | Top 1000 words | | no (Norwegian) | ~770 KB | Top 1000 words | | ru (Russian) | 780 KB | Top 1000 words | | sv (Swedish) | ~775 KB | Top 1000 words | |=========================================+

Note
Kelly frequency lists provide the top 1000 most common words from the Kelly Project (University of Leeds & University of Gothenburg). Languages not listed here require external frequency data sources.

Programmatic Usage

Using Language Cache

require 'kotoshu/cache/language_cache'

cache = Kotoshu::Cache::LanguageCache.new

# Get spelling dictionary
dict = cache.get_spelling('en')
puts "Dictionary: #{dict[:dic_path]}"
puts "Words: #{File.readlines(dict[:dic_path]).count}"

# Get available languages
cache.available_languages  # => ["de", "en", "es", "fr", "pt", "ru"]

# Check if resource is cached
cache.available?('en:spelling')  # => true

# Get language info
info = cache.language_info('en')
puts "Language: #{info[:name]}"
puts "Words: #{info[:word_count]}"
puts "License: #{info[:license]}"

Using Frequency Cache

require 'kotoshu/cache/frequency_cache'

cache = Kotoshu::Cache::FrequencyCache.new

# Get frequency data
freq_data = cache.get('en')

# Access frequency tiers
top_50 = freq_data[:tiers][:top_50]
top_50.include?('the')  # => true
top_50.include?('hello')  # => true (in top 200)

# Get available languages
cache.available_languages  # => ["ar", "zh", "en", "el", "it", "no", "ru", "sv"]

Integration with Suggestion Strategies

require 'kotoshu/suggestions/strategies/edit_distance_strategy'

# Frequency bonuses automatically applied
strategy = Kotoshu::Suggestions::Strategies::EditDistanceStrategy.new(
  language_code: 'en'
)

# Suggestions are ranked by frequency
suggestions = strategy.suggest('helo', max_results: 5)
# => [
#   { word: "hello", score: 1200 },  # High frequency word
#   { word: "help", score: 1150 },   # Medium frequency word
#   ...
# ]

Cache TTL and Expiration

All cached resources have a Time-To-Live (TTL) and automatically expire:

  • LanguageCache: 7 days (dictionaries change infrequently)

  • FrequencyCache: 7 days (frequency lists are stable)

  • ModelCache: 30 days (models are large and change rarely)

Expired resources are automatically re-downloaded on next access.

cache = Kotoshu::Cache::FrequencyCache.new

# Force re-download (ignores cache)
freq_data = cache.get('en', force_download: true)

# Clean expired entries manually
cache.clean

Manual Cache Management

cache = Kotoshu::Cache::LanguageCache.new

# Clear specific resource
cache.clear('en:spelling')

# Clear all resources
cache.clear_all

# Check if resource exists
cache.available?('en:spelling')  # => true after download

# Get statistics
stats = cache.stats
puts "Hit rate: #{stats[:hit_rate] * 100}%"
puts "Size: #{stats[:size_bytes]} bytes"

GitHub Repository Structure

The kotoshu/dictionaries repository follows this structure:

kotoshu/dictionaries/
├── en/
│   ├── spelling/
│   │   ├── index.dic          # Hunspell dictionary
│   │   ├── index.aff          # Hunspell affix rules
│   │   └── metadata.json      # Version info
│   ├── grammar/
│   │   └── rules.yaml         # Grammar rules (future)
│   └── models/
│       ├── fasttext/
│       │   └── cc.en.300.vec  # FastText vectors
│       └── onnx/
│           └── fasttext.en.onnx # ONNX model
├── de/
│   └── ... (same structure)
└── README.md

kotoshu/frequency-list-kelly/
├── data/
│   ├── en.json               # Kelly frequency data
│   ├── ru.json
│   └── ...
└── README.md

Adding New Languages

To add support for a new language:

  1. Dictionary: Add Hunspell dictionary to kotoshu/dictionaries/{code}/spelling/

  2. Frequency: Add Kelly frequency data to kotoshu/frequency-list-kelly/data/{code}.json

  3. Register: Add to AVAILABLE_LANGUAGES in LanguageCache

  4. Test: Run integration tests to verify

See CONTRIBUTING.adoc for detailed guidelines.

Model Repository

ONNX models are hosted at: kotoshu/dictionaries

Download and setup:

# Preferred: let kotoshu fetch and verify the model
kotoshu setup en --want spelling,model

# Manual clone (advanced; bypasses manifest verification)
git clone https://github.com/kotoshu/dictionaries.git ~/src/kotoshu/dictionaries

License

BSD 2-Clause — see the LICENSE file for details.

Bundled dictionaries and frequency lists carry their own licenses; see the per-language license files in kotoshu/dictionaries.

Contributing

Contributions are welcome! Please see CONTRIBUTING.adoc for guidelines.

Acknowledgments