Module: Kotoshu::Algorithms::NgramSuggest

Defined in:: lib/kotoshu/algorithms/ngram_suggest.rb

Overview

N-gram based suggestion algorithm.

Ported from Spylls (Python) ngram_suggest.py

This is the core Hunspell suggestion algorithm that uses n-gram similarity to rank and filter spelling corrections.

The algorithm works in three stages:

root_score: Quick n-gram score + left common substring
rough_affix_score: Affixed form n-gram score
precise_affix_score: Full scoring with LCS, bigrams, etc.

Constant Summary collapse

MAX_ROOTS = Maximum number of root words to consider in first pass

MAX_GUESSES = Maximum number of suggestions to generate

Class Method Summary collapse

.detect_threshold(word) ⇒ Float

Calculate minimum threshold for passable suggestions.
.filter_guesses(guesses, known:, onlymaxdiff: true) {|String| ... } ⇒ Object

Filter guesses by score into quality buckets.
.forms_for(word_entry, all_prefixes, all_suffixes, similar_to:) ⇒ Array<String>

Generate all possible affixed forms for a dictionary word.
.precise_affix_score(word1, word2, diff_factor, base:, has_phonetic: false) ⇒ Float

Stage 3 scoring: Full precise scoring.
.root_score(word1, word2) ⇒ Float

Stage 1 scoring: 3-gram score + left common substring.
.rough_affix_score(word1, word2) ⇒ Float

Stage 2 scoring: N-gram score with n=len(word1) + left common substring.
.suggest(misspelling, dictionary_words:, prefixes: {}, suffixes: {}, known: Set.new, maxdiff: 2, onlymaxdiff: true, has_phonetic: false) {|String| ... } ⇒ Object

Main entry point for n-gram based suggestions.

Class Method Details

.detect_threshold(word) ⇒ `Float`

Calculate minimum threshold for passable suggestions.

Mangles the word in 3 different ways (replacing each 4th char with ‘*’) and scores them to generate a minimum acceptable score.

Parameters:

word (String) —

The misspelled word

Returns:

(Float) —

Minimum threshold score

# File 'lib/kotoshu/algorithms/ngram_suggest.rb', line 191

def detect_threshold(word)
  thresh = 0.0

  (1..3).each do |start_pos|
    mangled = word.chars.map.with_index do |char, pos|
      ((pos - start_pos) % 4).zero? && pos >= start_pos ? "*" : char
    end.join

    thresh += StringMetrics.ngram(word.length, word, mangled, any_mismatch: true)
  end

  # Take average of the three scores and subtract 1
  (thresh / 3.0) - 1
end

.filter_guesses(guesses, known:, onlymaxdiff: true) {|String| ... } ⇒ `Object`

Filter guesses by score into quality buckets.

Score buckets:

> 1000: Very good (same word, different casing)
1000 to -100: Normal suggestions
< -100: Questionable (too different)

Stops yielding when:

A very good suggestion was found and then a normal one
A questionable suggestion was found (only yields one)

Parameters:

guesses (Array<Array>) —

Array of [score, value] pairs
known (Set<String>) —

Already suggested words
onlymaxdiff (Boolean) (defaults to: true) —

Whether to exclude questionable

Yields:

(String) —

Each filtered suggestion

# File 'lib/kotoshu/algorithms/ngram_suggest.rb', line 242

def filter_guesses(guesses, known:, onlymaxdiff: true)
  seen = false
  found = 0

  guesses.each do |score, value|
    # Stop if we saw very good and now have normal suggestions
    return if seen && score <= 1000

    if score > 1000
      # Very good suggestion - set flag to only accept other very good ones
      seen = true
    elsif score < -100
      # Questionable suggestion
      # Stop if we already found good ones, or if we're excluding questionable
      return if found.positive? || onlymaxdiff
      seen = true
    end

    # Skip if this word was already suggested
    next if known.any? { |known_word| value.include?(known_word) }

    found += 1
    yield value
  end
end

.forms_for(word_entry, all_prefixes, all_suffixes, similar_to:) ⇒ `Array<String>`

Generate all possible affixed forms for a dictionary word.

Parameters:

word_entry (Hash) —

Dictionary word with stem and flags
all_prefixes (Hash) —

Available prefixes
all_suffixes (Hash) —

Available suffixes
similar_to (String) —

Original misspelling (for filtering)

Returns:

(Array<String>) —

Generated forms

# File 'lib/kotoshu/algorithms/ngram_suggest.rb', line 213

def forms_for(word_entry, all_prefixes, all_suffixes, similar_to:)
  stem = word_entry[:stem] || word_entry
  flags = word_entry[:flags] || []

  # Base form without affixes
  res = [stem]

  # Generate suffix forms
  # Simplified: just return base form for now
  # Full implementation would parse affix flags and apply them

  res
end

.precise_affix_score(word1, word2, diff_factor, base:, has_phonetic: false) ⇒ `Float`

Stage 3 scoring: Full precise scoring.

Returns one of three “score groups”:

> 1000: Very good (same word, different casing)
< -100: Questionable (too different)
-100 to 1000: Normal suggestion

Parameters:

word1 (String) —

Misspelled word
word2 (String) —

Possible suggestion
diff_factor (Float) —

Factor based on MAXDIFF (0-2)
base (Float) —

Base score from stage 2
has_phonetic (Boolean) (defaults to: false) —

Whether PHONE table exists

Returns:

(Float) —

Precise affix score

# File 'lib/kotoshu/algorithms/ngram_suggest.rb', line 141

def precise_affix_score(word1, word2, diff_factor, base:, has_phonetic: false)
  # Use lowercase for LCS to catch case-only differences
  word1_lower = word1.downcase
  word2_lower = word2.downcase

  lcs = StringMetrics.lcslen(word1_lower, word2_lower)

  # Same characters with different casing -- "very good" suggestion
  if word1.length == word2.length && word1.length == lcs
    return base + 2000
  end

  # Score is: 2 * LCS - length difference
  result = 2 * lcs - (word1.length - word2.length).abs

  # Add common start substring length
  result += StringMetrics.leftcommonsubstring(word1_lower, word2_lower)

  # Add 1 if any characters match at same positions
  result += 1 if StringMetrics.commoncharacters(word1_lower, word2_lower) > 0

  # Add regular 4-gram score
  result += StringMetrics.ngram(4, word1_lower, word2_lower, any_mismatch: true)

  # Add weighted bigrams (both directions)
  bigrams = (
    StringMetrics.ngram(2, word1_lower, word2_lower, any_mismatch: true, weighted: true) +
    StringMetrics.ngram(2, word2_lower, word1_lower, any_mismatch: true, weighted: true)
  )
  result += bigrams

  # Apply "questionable" threshold based on diff_factor and has_phonetic
  questionable_limit = if has_phonetic
                        word2.length * diff_factor
                      else
                        (word1.length + word2.length) * diff_factor
                      end

  result -= 1000 if bigrams < questionable_limit

  result
end

.root_score(word1, word2) ⇒ `Float`

Stage 1 scoring: 3-gram score + left common substring.

Parameters:

word1 (String) —

Misspelled word
word2 (String) —

Possible suggestion

Returns:

(Float) —

Root score

# File 'lib/kotoshu/algorithms/ngram_suggest.rb', line 107

def root_score(word1, word2)
  # Use lowercase for comparison as per Hunspell
  word2_lower = word2.downcase

  StringMetrics.ngram(3, word1, word2_lower, longer_worse: true) +
    StringMetrics.leftcommonsubstring(word1, word2_lower).to_f
end

.rough_affix_score(word1, word2) ⇒ `Float`

Stage 2 scoring: N-gram score with n=len(word1) + left common substring.

Parameters:

word1 (String) —

Misspelled word
word2 (String) —

Possible suggestion

Returns:

(Float) —

Rough affix score

# File 'lib/kotoshu/algorithms/ngram_suggest.rb', line 120

def rough_affix_score(word1, word2)
  # Use lowercase for comparison as per Hunspell
  word2_lower = word2.downcase

  StringMetrics.ngram(word1.length, word1, word2_lower, any_mismatch: true) +
    StringMetrics.leftcommonsubstring(word1, word2_lower).to_f
end

.suggest(misspelling, dictionary_words:, prefixes: {}, suffixes: {}, known: Set.new, maxdiff: 2, onlymaxdiff: true, has_phonetic: false) {|String| ... } ⇒ `Object`

Main entry point for n-gram based suggestions.

This is a simplified version that works with basic dictionary structures. Full implementation would need affix flag parsing and Word model objects.

Parameters:

misspelling (String) —

The misspelled word
dictionary_words (Array<Hash>) —

Dictionary entries with stem and flags
prefixes (Hash) (defaults to: {}) —

Prefix flags to prefix objects mapping
suffixes (Hash) (defaults to: {}) —

Suffix flags to suffix objects mapping
known (Set<String>) (defaults to: Set.new) —

Already suggested words (to avoid duplicates)
maxdiff (Integer) (defaults to: 2) —

MAXDIFF value from aff file (0-10)
onlymaxdiff (Boolean) (defaults to: true) —

ONLYMAXDIFF flag
has_phonetic (Boolean) (defaults to: false) —

Whether PHONE table exists in aff file

Yields:

(String) —

Each suggestion

# File 'lib/kotoshu/algorithms/ngram_suggest.rb', line 38

def suggest(misspelling,
            dictionary_words:,
            prefixes: {},
            suffixes: {},
            known: Set.new,
            maxdiff: 2,
            onlymaxdiff: true,
            has_phonetic: false,
            &block)

  # Stage 1: Find best root candidates by n-gram score
  root_scores = []

  dictionary_words.each do |word_entry|
    stem = word_entry[:stem] || word_entry

    # Skip words with length difference > 4
    next if (stem.length - misspelling.length).abs > 4

    score = root_score(misspelling, stem)

    # Use heap to keep only MAX_ROOTS best results
    if root_scores.size >= MAX_ROOTS
      # Keep only the best scores
      root_scores = root_scores.sort.reverse.first(MAX_ROOTS)
    end

    root_scores << [score, word_entry] if score > 0
  end

  # Stage 2: Generate affixed forms and score them
  threshold = detect_threshold(misspelling)
  guess_scores = []

  # Sort by score descending
  root_scores.sort_by { |score, _| -score }.first(MAX_ROOTS).each do |(_, root_entry)|
    root = root_entry[:stem] || root_entry

    # Generate forms with suffixes
    forms = forms_for(root_entry, prefixes, suffixes, similar_to: misspelling)

    forms.each do |form|
      score = rough_affix_score(misspelling, form.to_s.downcase)
      next unless score > threshold

      guess_scores << [score, form.to_s, form.to_s]
    end
  end

  # Limit to MAX_GUESSES and sort by score
  guesses = guess_scores.sort.reverse.first(MAX_GUESSES)

  # Stage 3: Calculate precise scores
  fact = maxdiff >= 0 ? (10.0 - maxdiff) / 5.0 : 1.0

  guesses2 = guesses.map do |score, compared, real|
    [precise_affix_score(misspelling, compared.to_s.downcase,
                        fact, base: score, has_phonetic: has_phonetic), real.to_s]
  end.sort.reverse

  # Stage 4: Filter and yield suggestions
  filter_guesses(guesses2, known: known, onlymaxdiff: onlymaxdiff, &block)
end

Module: Kotoshu::Algorithms::NgramSuggest

Overview

Constant Summary collapse

Class Method Summary collapse

Class Method Details

.detect_threshold(word) ⇒ Float

.filter_guesses(guesses, known:, onlymaxdiff: true) {|String| ... } ⇒ Object

.forms_for(word_entry, all_prefixes, all_suffixes, similar_to:) ⇒ Array<String>

.precise_affix_score(word1, word2, diff_factor, base:, has_phonetic: false) ⇒ Float

.root_score(word1, word2) ⇒ Float

.rough_affix_score(word1, word2) ⇒ Float

.suggest(misspelling, dictionary_words:, prefixes: {}, suffixes: {}, known: Set.new, maxdiff: 2, onlymaxdiff: true, has_phonetic: false) {|String| ... } ⇒ Object

.detect_threshold(word) ⇒ `Float`

.filter_guesses(guesses, known:, onlymaxdiff: true) {|String| ... } ⇒ `Object`

.forms_for(word_entry, all_prefixes, all_suffixes, similar_to:) ⇒ `Array<String>`

.precise_affix_score(word1, word2, diff_factor, base:, has_phonetic: false) ⇒ `Float`

.root_score(word1, word2) ⇒ `Float`

.rough_affix_score(word1, word2) ⇒ `Float`

.suggest(misspelling, dictionary_words:, prefixes: {}, suffixes: {}, known: Set.new, maxdiff: 2, onlymaxdiff: true, has_phonetic: false) {|String| ... } ⇒ `Object`