Module: Kotoshu::Algorithms::NgramSuggest

Defined in:
lib/kotoshu/algorithms/ngram_suggest.rb

Overview

N-gram based suggestion algorithm.

Ported from Spylls (Python) ngram_suggest.py

This is the core Hunspell suggestion algorithm that uses n-gram similarity to rank and filter spelling corrections.

The algorithm works in three stages:

  1. root_score: Quick n-gram score + left common substring

  2. rough_affix_score: Affixed form n-gram score

  3. precise_affix_score: Full scoring with LCS, bigrams, etc.

Constant Summary collapse

MAX_ROOTS =

Maximum number of root words to consider in first pass

100
MAX_GUESSES =

Maximum number of suggestions to generate

200

Class Method Summary collapse

Class Method Details

.detect_threshold(word) ⇒ Float

Calculate minimum threshold for passable suggestions.

Mangles the word in 3 different ways (replacing each 4th char with ‘*’) and scores them to generate a minimum acceptable score.

Parameters:

  • word (String)

    The misspelled word

Returns:

  • (Float)

    Minimum threshold score



191
192
193
194
195
196
197
198
199
200
201
202
203
204
# File 'lib/kotoshu/algorithms/ngram_suggest.rb', line 191

def detect_threshold(word)
  thresh = 0.0

  (1..3).each do |start_pos|
    mangled = word.chars.map.with_index do |char, pos|
      ((pos - start_pos) % 4).zero? && pos >= start_pos ? "*" : char
    end.join

    thresh += StringMetrics.ngram(word.length, word, mangled, any_mismatch: true)
  end

  # Take average of the three scores and subtract 1
  (thresh / 3.0) - 1
end

.filter_guesses(guesses, known:, onlymaxdiff: true) {|String| ... } ⇒ Object

Filter guesses by score into quality buckets.

Score buckets:

  • > 1000: Very good (same word, different casing)

  • 1000 to -100: Normal suggestions

  • < -100: Questionable (too different)

Stops yielding when:

  • A very good suggestion was found and then a normal one

  • A questionable suggestion was found (only yields one)

Parameters:

  • guesses (Array<Array>)

    Array of [score, value] pairs

  • known (Set<String>)

    Already suggested words

  • onlymaxdiff (Boolean) (defaults to: true)

    Whether to exclude questionable

Yields:

  • (String)

    Each filtered suggestion



242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
# File 'lib/kotoshu/algorithms/ngram_suggest.rb', line 242

def filter_guesses(guesses, known:, onlymaxdiff: true)
  seen = false
  found = 0

  guesses.each do |score, value|
    # Stop if we saw very good and now have normal suggestions
    return if seen && score <= 1000

    if score > 1000
      # Very good suggestion - set flag to only accept other very good ones
      seen = true
    elsif score < -100
      # Questionable suggestion
      # Stop if we already found good ones, or if we're excluding questionable
      return if found.positive? || onlymaxdiff
      seen = true
    end

    # Skip if this word was already suggested
    next if known.any? { |known_word| value.include?(known_word) }

    found += 1
    yield value
  end
end

.forms_for(word_entry, all_prefixes, all_suffixes, similar_to:) ⇒ Array<String>

Generate all possible affixed forms for a dictionary word.

Parameters:

  • word_entry (Hash)

    Dictionary word with stem and flags

  • all_prefixes (Hash)

    Available prefixes

  • all_suffixes (Hash)

    Available suffixes

  • similar_to (String)

    Original misspelling (for filtering)

Returns:

  • (Array<String>)

    Generated forms



213
214
215
216
217
218
219
220
221
222
223
224
225
# File 'lib/kotoshu/algorithms/ngram_suggest.rb', line 213

def forms_for(word_entry, all_prefixes, all_suffixes, similar_to:)
  stem = word_entry[:stem] || word_entry
  flags = word_entry[:flags] || []

  # Base form without affixes
  res = [stem]

  # Generate suffix forms
  # Simplified: just return base form for now
  # Full implementation would parse affix flags and apply them

  res
end

.precise_affix_score(word1, word2, diff_factor, base:, has_phonetic: false) ⇒ Float

Stage 3 scoring: Full precise scoring.

Returns one of three “score groups”:

  • > 1000: Very good (same word, different casing)

  • < -100: Questionable (too different)

  • -100 to 1000: Normal suggestion

Parameters:

  • word1 (String)

    Misspelled word

  • word2 (String)

    Possible suggestion

  • diff_factor (Float)

    Factor based on MAXDIFF (0-2)

  • base (Float)

    Base score from stage 2

  • has_phonetic (Boolean) (defaults to: false)

    Whether PHONE table exists

Returns:

  • (Float)

    Precise affix score



141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
# File 'lib/kotoshu/algorithms/ngram_suggest.rb', line 141

def precise_affix_score(word1, word2, diff_factor, base:, has_phonetic: false)
  # Use lowercase for LCS to catch case-only differences
  word1_lower = word1.downcase
  word2_lower = word2.downcase

  lcs = StringMetrics.lcslen(word1_lower, word2_lower)

  # Same characters with different casing -- "very good" suggestion
  if word1.length == word2.length && word1.length == lcs
    return base + 2000
  end

  # Score is: 2 * LCS - length difference
  result = 2 * lcs - (word1.length - word2.length).abs

  # Add common start substring length
  result += StringMetrics.leftcommonsubstring(word1_lower, word2_lower)

  # Add 1 if any characters match at same positions
  result += 1 if StringMetrics.commoncharacters(word1_lower, word2_lower) > 0

  # Add regular 4-gram score
  result += StringMetrics.ngram(4, word1_lower, word2_lower, any_mismatch: true)

  # Add weighted bigrams (both directions)
  bigrams = (
    StringMetrics.ngram(2, word1_lower, word2_lower, any_mismatch: true, weighted: true) +
    StringMetrics.ngram(2, word2_lower, word1_lower, any_mismatch: true, weighted: true)
  )
  result += bigrams

  # Apply "questionable" threshold based on diff_factor and has_phonetic
  questionable_limit = if has_phonetic
                        word2.length * diff_factor
                      else
                        (word1.length + word2.length) * diff_factor
                      end

  result -= 1000 if bigrams < questionable_limit

  result
end

.root_score(word1, word2) ⇒ Float

Stage 1 scoring: 3-gram score + left common substring.

Parameters:

  • word1 (String)

    Misspelled word

  • word2 (String)

    Possible suggestion

Returns:

  • (Float)

    Root score



107
108
109
110
111
112
113
# File 'lib/kotoshu/algorithms/ngram_suggest.rb', line 107

def root_score(word1, word2)
  # Use lowercase for comparison as per Hunspell
  word2_lower = word2.downcase

  StringMetrics.ngram(3, word1, word2_lower, longer_worse: true) +
    StringMetrics.leftcommonsubstring(word1, word2_lower).to_f
end

.rough_affix_score(word1, word2) ⇒ Float

Stage 2 scoring: N-gram score with n=len(word1) + left common substring.

Parameters:

  • word1 (String)

    Misspelled word

  • word2 (String)

    Possible suggestion

Returns:

  • (Float)

    Rough affix score



120
121
122
123
124
125
126
# File 'lib/kotoshu/algorithms/ngram_suggest.rb', line 120

def rough_affix_score(word1, word2)
  # Use lowercase for comparison as per Hunspell
  word2_lower = word2.downcase

  StringMetrics.ngram(word1.length, word1, word2_lower, any_mismatch: true) +
    StringMetrics.leftcommonsubstring(word1, word2_lower).to_f
end

.suggest(misspelling, dictionary_words:, prefixes: {}, suffixes: {}, known: Set.new, maxdiff: 2, onlymaxdiff: true, has_phonetic: false) {|String| ... } ⇒ Object

Main entry point for n-gram based suggestions.

This is a simplified version that works with basic dictionary structures. Full implementation would need affix flag parsing and Word model objects.

Parameters:

  • misspelling (String)

    The misspelled word

  • dictionary_words (Array<Hash>)

    Dictionary entries with stem and flags

  • prefixes (Hash) (defaults to: {})

    Prefix flags to prefix objects mapping

  • suffixes (Hash) (defaults to: {})

    Suffix flags to suffix objects mapping

  • known (Set<String>) (defaults to: Set.new)

    Already suggested words (to avoid duplicates)

  • maxdiff (Integer) (defaults to: 2)

    MAXDIFF value from aff file (0-10)

  • onlymaxdiff (Boolean) (defaults to: true)

    ONLYMAXDIFF flag

  • has_phonetic (Boolean) (defaults to: false)

    Whether PHONE table exists in aff file

Yields:

  • (String)

    Each suggestion



38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
# File 'lib/kotoshu/algorithms/ngram_suggest.rb', line 38

def suggest(misspelling,
            dictionary_words:,
            prefixes: {},
            suffixes: {},
            known: Set.new,
            maxdiff: 2,
            onlymaxdiff: true,
            has_phonetic: false,
            &block)

  # Stage 1: Find best root candidates by n-gram score
  root_scores = []

  dictionary_words.each do |word_entry|
    stem = word_entry[:stem] || word_entry

    # Skip words with length difference > 4
    next if (stem.length - misspelling.length).abs > 4

    score = root_score(misspelling, stem)

    # Use heap to keep only MAX_ROOTS best results
    if root_scores.size >= MAX_ROOTS
      # Keep only the best scores
      root_scores = root_scores.sort.reverse.first(MAX_ROOTS)
    end

    root_scores << [score, word_entry] if score > 0
  end

  # Stage 2: Generate affixed forms and score them
  threshold = detect_threshold(misspelling)
  guess_scores = []

  # Sort by score descending
  root_scores.sort_by { |score, _| -score }.first(MAX_ROOTS).each do |(_, root_entry)|
    root = root_entry[:stem] || root_entry

    # Generate forms with suffixes
    forms = forms_for(root_entry, prefixes, suffixes, similar_to: misspelling)

    forms.each do |form|
      score = rough_affix_score(misspelling, form.to_s.downcase)
      next unless score > threshold

      guess_scores << [score, form.to_s, form.to_s]
    end
  end

  # Limit to MAX_GUESSES and sort by score
  guesses = guess_scores.sort.reverse.first(MAX_GUESSES)

  # Stage 3: Calculate precise scores
  fact = maxdiff >= 0 ? (10.0 - maxdiff) / 5.0 : 1.0

  guesses2 = guesses.map do |score, compared, real|
    [precise_affix_score(misspelling, compared.to_s.downcase,
                        fact, base: score, has_phonetic: has_phonetic), real.to_s]
  end.sort.reverse

  # Stage 4: Filter and yield suggestions
  filter_guesses(guesses2, known: known, onlymaxdiff: onlymaxdiff, &block)
end