Module: Kotoshu::Algorithms::PhonetSuggest
- Defined in:
- lib/kotoshu/algorithms/phonet_suggest.rb
Overview
Phonetic suggestion algorithm provides suggestions based on phonetical (pronunciation) similarity.
Ported from Spylls (Python) phonet_suggest.py
Requires .aff file to define PHONE table (extremely rare in known dictionaries).
Internally:
-
Selects words from dictionary similarly to ngram_suggest (and reuses its root_score)
-
Scores their phonetic representations (calculated with metaphone) with phonetic representation of misspelling
-
Chooses the most similar ones with final_score (ngram-based comparison)
Constant Summary collapse
- MAX_ROOTS =
100
Class Method Summary collapse
-
.final_score(word1, word2) ⇒ Float
Calculate score of suggestion against misspelling.
-
.match_rule(rule, word, pos) ⇒ Integer?
Check if a rule matches at the given position.
-
.metaphone(table, word) ⇒ String
Metaphone calculation.
-
.suggest(misspelling, dictionary_words:, table:) {|String| ... } ⇒ Object
Main entry point for phonetic suggestions.
Class Method Details
.final_score(word1, word2) ⇒ Float
Calculate score of suggestion against misspelling.
95 96 97 98 99 |
# File 'lib/kotoshu/algorithms/phonet_suggest.rb', line 95 def final_score(word1, word2) 2 * StringMetrics.lcslen(word1, word2) - (word1.length - word2.length).abs + StringMetrics.leftcommonsubstring(word1, word2) end |
.match_rule(rule, word, pos) ⇒ Integer?
Check if a rule matches at the given position.
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 |
# File 'lib/kotoshu/algorithms/phonet_suggest.rb', line 147 def match_rule(rule, word, pos) # Check start constraint return nil if rule[:start] && pos > 0 # Try to match match_data = if rule[:end] # Full match from position rule[:search].match(word[pos..]) else # Regular match from position rule[:search].match(word, pos) end return nil unless match_data match_data.to_s.length end |
.metaphone(table, word) ⇒ String
Metaphone calculation.
Production in Kotoshu is currently implemented naively as just “search and replace” for rules. To see what potentially should be done, look at aspell’s original description: aspell.net/man-html/Phonetic-Code.html
111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
# File 'lib/kotoshu/algorithms/phonet_suggest.rb', line 111 def (table, word) return word if table.nil? || table.empty? rules = table[:rules] || {} pos = 0 word_upper = word.upcase result = +'' while pos < word_upper.length char = word_upper[pos] matched = false # Get rules for this character char_rules = rules[char] || [] char_rules.each do |rule| match_result = match_rule(rule, word_upper, pos) next unless match_result result += rule[:replacement] pos += match_result matched = true break end pos += 1 unless matched end result end |
.suggest(misspelling, dictionary_words:, table:) {|String| ... } ⇒ Object
Main entry point for phonetic suggestions.
Note that both this method and NgramSuggest.suggest iterate through the whole dictionary. Hunspell optimizes by doing it all in one loop. Spylls (and Kotoshu) splits them for clarity.
The table structure should have:
-
:rules => Hash mapping first character to array of rule hashes Each rule has: :search (Regexp), :replacement (String),
:start (Boolean), :end (Boolean)
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
# File 'lib/kotoshu/algorithms/phonet_suggest.rb', line 37 def suggest(misspelling, dictionary_words:, table:, &block) misspelling_lower = misspelling.downcase misspelling_ph = (table, misspelling_lower) scores = [] # First, select words from dictionary whose stems are similar to misspelling # This cycle is exactly the same as the first cycle in ngram_suggest dictionary_words.each do |word| stem = word[:stem] || word # Skip words with length difference > 3 next if (stem.length - misspelling.length).abs > 3 # First, calculate "regular" similarity score, just like in ngram_suggest nscore = NgramSuggest.root_score(misspelling_lower, stem) # Check alternative spellings if available if word[:alt_spellings] word[:alt_spellings].each do |variant| nscore = [nscore, NgramSuggest.root_score(misspelling_lower, variant)].max end end next if nscore <= 2 # Calculate metaphone score word_ph = (table, stem.downcase) score = 2 * StringMetrics.ngram(3, misspelling_ph, word_ph, longer_worse: true) # Use heap-like behavior: keep only MAX_ROOTS best results if scores.size >= MAX_ROOTS # Remove the worst score if we're at capacity scores.sort!.shift if scores.first && scores.first[0] < score end scores << [score, stem] if scores.size < MAX_ROOTS || scores.empty? || score > scores.first[0] end # Sort by score descending guesses = scores.sort.reverse # Finally, sort suggestions by simplistic string similarity metric guesses2 = guesses.map do |score, word| final_scr = final_score(misspelling_lower, word.downcase) [score + final_scr, word] end.sort.reverse guesses2.each do |_, sug| yield sug end end |