Module: Kotoshu::Algorithms::NgramSuggest
- Defined in:
- lib/kotoshu/algorithms/ngram_suggest.rb
Overview
N-gram based suggestion algorithm.
Ported from Spylls (Python) ngram_suggest.py
This is the core Hunspell suggestion algorithm that uses n-gram similarity to rank and filter spelling corrections.
The algorithm works in three stages:
-
root_score: Quick n-gram score + left common substring
-
rough_affix_score: Affixed form n-gram score
-
precise_affix_score: Full scoring with LCS, bigrams, etc.
Constant Summary collapse
- MAX_ROOTS =
Maximum number of root words to consider in first pass
100- MAX_GUESSES =
Maximum number of suggestions to generate
200
Class Method Summary collapse
-
.detect_threshold(word) ⇒ Float
Calculate minimum threshold for passable suggestions.
-
.filter_guesses(guesses, known:, onlymaxdiff: true) {|String| ... } ⇒ Object
Filter guesses by score into quality buckets.
-
.forms_for(word_entry, all_prefixes, all_suffixes, similar_to:) ⇒ Array<String>
Generate all possible affixed forms for a dictionary word.
-
.precise_affix_score(word1, word2, diff_factor, base:, has_phonetic: false) ⇒ Float
Stage 3 scoring: Full precise scoring.
-
.root_score(word1, word2) ⇒ Float
Stage 1 scoring: 3-gram score + left common substring.
-
.rough_affix_score(word1, word2) ⇒ Float
Stage 2 scoring: N-gram score with n=len(word1) + left common substring.
-
.suggest(misspelling, dictionary_words:, prefixes: {}, suffixes: {}, known: Set.new, maxdiff: 2, onlymaxdiff: true, has_phonetic: false) {|String| ... } ⇒ Object
Main entry point for n-gram based suggestions.
Class Method Details
.detect_threshold(word) ⇒ Float
Calculate minimum threshold for passable suggestions.
Mangles the word in 3 different ways (replacing each 4th char with ‘*’) and scores them to generate a minimum acceptable score.
191 192 193 194 195 196 197 198 199 200 201 202 203 204 |
# File 'lib/kotoshu/algorithms/ngram_suggest.rb', line 191 def detect_threshold(word) thresh = 0.0 (1..3).each do |start_pos| mangled = word.chars.map.with_index do |char, pos| ((pos - start_pos) % 4).zero? && pos >= start_pos ? "*" : char end.join thresh += StringMetrics.ngram(word.length, word, mangled, any_mismatch: true) end # Take average of the three scores and subtract 1 (thresh / 3.0) - 1 end |
.filter_guesses(guesses, known:, onlymaxdiff: true) {|String| ... } ⇒ Object
Filter guesses by score into quality buckets.
Score buckets:
-
> 1000: Very good (same word, different casing)
-
1000 to -100: Normal suggestions
-
< -100: Questionable (too different)
Stops yielding when:
-
A very good suggestion was found and then a normal one
-
A questionable suggestion was found (only yields one)
242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 |
# File 'lib/kotoshu/algorithms/ngram_suggest.rb', line 242 def filter_guesses(guesses, known:, onlymaxdiff: true) seen = false found = 0 guesses.each do |score, value| # Stop if we saw very good and now have normal suggestions return if seen && score <= 1000 if score > 1000 # Very good suggestion - set flag to only accept other very good ones seen = true elsif score < -100 # Questionable suggestion # Stop if we already found good ones, or if we're excluding questionable return if found.positive? || onlymaxdiff seen = true end # Skip if this word was already suggested next if known.any? { |known_word| value.include?(known_word) } found += 1 yield value end end |
.forms_for(word_entry, all_prefixes, all_suffixes, similar_to:) ⇒ Array<String>
Generate all possible affixed forms for a dictionary word.
213 214 215 216 217 218 219 220 221 222 223 224 225 |
# File 'lib/kotoshu/algorithms/ngram_suggest.rb', line 213 def forms_for(word_entry, all_prefixes, all_suffixes, similar_to:) stem = word_entry[:stem] || word_entry flags = word_entry[:flags] || [] # Base form without affixes res = [stem] # Generate suffix forms # Simplified: just return base form for now # Full implementation would parse affix flags and apply them res end |
.precise_affix_score(word1, word2, diff_factor, base:, has_phonetic: false) ⇒ Float
Stage 3 scoring: Full precise scoring.
Returns one of three “score groups”:
-
> 1000: Very good (same word, different casing)
-
< -100: Questionable (too different)
-
-100 to 1000: Normal suggestion
141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 |
# File 'lib/kotoshu/algorithms/ngram_suggest.rb', line 141 def precise_affix_score(word1, word2, diff_factor, base:, has_phonetic: false) # Use lowercase for LCS to catch case-only differences word1_lower = word1.downcase word2_lower = word2.downcase lcs = StringMetrics.lcslen(word1_lower, word2_lower) # Same characters with different casing -- "very good" suggestion if word1.length == word2.length && word1.length == lcs return base + 2000 end # Score is: 2 * LCS - length difference result = 2 * lcs - (word1.length - word2.length).abs # Add common start substring length result += StringMetrics.leftcommonsubstring(word1_lower, word2_lower) # Add 1 if any characters match at same positions result += 1 if StringMetrics.commoncharacters(word1_lower, word2_lower) > 0 # Add regular 4-gram score result += StringMetrics.ngram(4, word1_lower, word2_lower, any_mismatch: true) # Add weighted bigrams (both directions) bigrams = ( StringMetrics.ngram(2, word1_lower, word2_lower, any_mismatch: true, weighted: true) + StringMetrics.ngram(2, word2_lower, word1_lower, any_mismatch: true, weighted: true) ) result += bigrams # Apply "questionable" threshold based on diff_factor and has_phonetic questionable_limit = if has_phonetic word2.length * diff_factor else (word1.length + word2.length) * diff_factor end result -= 1000 if bigrams < questionable_limit result end |
.root_score(word1, word2) ⇒ Float
Stage 1 scoring: 3-gram score + left common substring.
107 108 109 110 111 112 113 |
# File 'lib/kotoshu/algorithms/ngram_suggest.rb', line 107 def root_score(word1, word2) # Use lowercase for comparison as per Hunspell word2_lower = word2.downcase StringMetrics.ngram(3, word1, word2_lower, longer_worse: true) + StringMetrics.leftcommonsubstring(word1, word2_lower).to_f end |
.rough_affix_score(word1, word2) ⇒ Float
Stage 2 scoring: N-gram score with n=len(word1) + left common substring.
120 121 122 123 124 125 126 |
# File 'lib/kotoshu/algorithms/ngram_suggest.rb', line 120 def rough_affix_score(word1, word2) # Use lowercase for comparison as per Hunspell word2_lower = word2.downcase StringMetrics.ngram(word1.length, word1, word2_lower, any_mismatch: true) + StringMetrics.leftcommonsubstring(word1, word2_lower).to_f end |
.suggest(misspelling, dictionary_words:, prefixes: {}, suffixes: {}, known: Set.new, maxdiff: 2, onlymaxdiff: true, has_phonetic: false) {|String| ... } ⇒ Object
Main entry point for n-gram based suggestions.
This is a simplified version that works with basic dictionary structures. Full implementation would need affix flag parsing and Word model objects.
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
# File 'lib/kotoshu/algorithms/ngram_suggest.rb', line 38 def suggest(misspelling, dictionary_words:, prefixes: {}, suffixes: {}, known: Set.new, maxdiff: 2, onlymaxdiff: true, has_phonetic: false, &block) # Stage 1: Find best root candidates by n-gram score root_scores = [] dictionary_words.each do |word_entry| stem = word_entry[:stem] || word_entry # Skip words with length difference > 4 next if (stem.length - misspelling.length).abs > 4 score = root_score(misspelling, stem) # Use heap to keep only MAX_ROOTS best results if root_scores.size >= MAX_ROOTS # Keep only the best scores root_scores = root_scores.sort.reverse.first(MAX_ROOTS) end root_scores << [score, word_entry] if score > 0 end # Stage 2: Generate affixed forms and score them threshold = detect_threshold(misspelling) guess_scores = [] # Sort by score descending root_scores.sort_by { |score, _| -score }.first(MAX_ROOTS).each do |(_, root_entry)| root = root_entry[:stem] || root_entry # Generate forms with suffixes forms = forms_for(root_entry, prefixes, suffixes, similar_to: misspelling) forms.each do |form| score = rough_affix_score(misspelling, form.to_s.downcase) next unless score > threshold guess_scores << [score, form.to_s, form.to_s] end end # Limit to MAX_GUESSES and sort by score guesses = guess_scores.sort.reverse.first(MAX_GUESSES) # Stage 3: Calculate precise scores fact = maxdiff >= 0 ? (10.0 - maxdiff) / 5.0 : 1.0 guesses2 = guesses.map do |score, compared, real| [precise_affix_score(misspelling, compared.to_s.downcase, fact, base: score, has_phonetic: has_phonetic), real.to_s] end.sort.reverse # Stage 4: Filter and yield suggestions filter_guesses(guesses2, known: known, onlymaxdiff: onlymaxdiff, &block) end |