Module: Kotoshu::Algorithms::PhonetSuggest

Defined in:
lib/kotoshu/algorithms/phonet_suggest.rb

Overview

Phonetic suggestion algorithm provides suggestions based on phonetical (pronunciation) similarity.

Ported from Spylls (Python) phonet_suggest.py

Requires .aff file to define PHONE table (extremely rare in known dictionaries).

Internally:

  1. Selects words from dictionary similarly to ngram_suggest (and reuses its root_score)

  2. Scores their phonetic representations (calculated with metaphone) with phonetic representation of misspelling

  3. Chooses the most similar ones with final_score (ngram-based comparison)

Constant Summary collapse

MAX_ROOTS =
100

Class Method Summary collapse

Class Method Details

.final_score(word1, word2) ⇒ Float

Calculate score of suggestion against misspelling.

Parameters:

  • word1 (String)

    Misspelling

  • word2 (String)

    Candidate suggestion

Returns:

  • (Float)

    Final score



95
96
97
98
99
# File 'lib/kotoshu/algorithms/phonet_suggest.rb', line 95

def final_score(word1, word2)
  2 * StringMetrics.lcslen(word1, word2) -
    (word1.length - word2.length).abs +
    StringMetrics.leftcommonsubstring(word1, word2)
end

.match_rule(rule, word, pos) ⇒ Integer?

Check if a rule matches at the given position.

Parameters:

  • rule (Hash)

    Rule hash with :search (Regexp), :start, :end

  • word (String)

    The word to match against

  • pos (Integer)

    Position in word

Returns:

  • (Integer, nil)

    Length of match, or nil if no match



147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
# File 'lib/kotoshu/algorithms/phonet_suggest.rb', line 147

def match_rule(rule, word, pos)
  # Check start constraint
  return nil if rule[:start] && pos > 0

  # Try to match
  match_data = if rule[:end]
                # Full match from position
                rule[:search].match(word[pos..])
              else
                # Regular match from position
                rule[:search].match(word, pos)
              end

  return nil unless match_data

  match_data.to_s.length
end

.metaphone(table, word) ⇒ String

Metaphone calculation.

Production in Kotoshu is currently implemented naively as just “search and replace” for rules. To see what potentially should be done, look at aspell’s original description: aspell.net/man-html/Phonetic-Code.html

Parameters:

  • table (Hash)

    Phone table with :rules hash

  • word (String)

    Word to calculate metaphone for

Returns:

  • (String)

    Metaphone representation



111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
# File 'lib/kotoshu/algorithms/phonet_suggest.rb', line 111

def metaphone(table, word)
  return word if table.nil? || table.empty?

  rules = table[:rules] || {}
  pos = 0
  word_upper = word.upcase
  result = +''

  while pos < word_upper.length
    char = word_upper[pos]
    matched = false

    # Get rules for this character
    char_rules = rules[char] || []
    char_rules.each do |rule|
      match_result = match_rule(rule, word_upper, pos)
      next unless match_result

      result += rule[:replacement]
      pos += match_result
      matched = true
      break
    end

    pos += 1 unless matched
  end

  result
end

.suggest(misspelling, dictionary_words:, table:) {|String| ... } ⇒ Object

Main entry point for phonetic suggestions.

Note that both this method and NgramSuggest.suggest iterate through the whole dictionary. Hunspell optimizes by doing it all in one loop. Spylls (and Kotoshu) splits them for clarity.

The table structure should have:

  • :rules => Hash mapping first character to array of rule hashes Each rule has: :search (Regexp), :replacement (String),

    :start (Boolean), :end (Boolean)
    

Parameters:

  • misspelling (String)

    The misspelled word

  • dictionary_words (Array<Hash>)

    Dictionary entries with stem and flags

  • table (Hash)

    Phone table with :rules hash mapping first char to rule list

Yields:

  • (String)

    Each suggestion



37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# File 'lib/kotoshu/algorithms/phonet_suggest.rb', line 37

def suggest(misspelling, dictionary_words:, table:, &block)
  misspelling_lower = misspelling.downcase
  misspelling_ph = metaphone(table, misspelling_lower)

  scores = []

  # First, select words from dictionary whose stems are similar to misspelling
  # This cycle is exactly the same as the first cycle in ngram_suggest
  dictionary_words.each do |word|
    stem = word[:stem] || word

    # Skip words with length difference > 3
    next if (stem.length - misspelling.length).abs > 3

    # First, calculate "regular" similarity score, just like in ngram_suggest
    nscore = NgramSuggest.root_score(misspelling_lower, stem)

    # Check alternative spellings if available
    if word[:alt_spellings]
      word[:alt_spellings].each do |variant|
        nscore = [nscore, NgramSuggest.root_score(misspelling_lower, variant)].max
      end
    end

    next if nscore <= 2

    # Calculate metaphone score
    word_ph = metaphone(table, stem.downcase)
    score = 2 * StringMetrics.ngram(3, misspelling_ph, word_ph, longer_worse: true)

    # Use heap-like behavior: keep only MAX_ROOTS best results
    if scores.size >= MAX_ROOTS
      # Remove the worst score if we're at capacity
      scores.sort!.shift if scores.first && scores.first[0] < score
    end

    scores << [score, stem] if scores.size < MAX_ROOTS || scores.empty? || score > scores.first[0]
  end

  # Sort by score descending
  guesses = scores.sort.reverse

  # Finally, sort suggestions by simplistic string similarity metric
  guesses2 = guesses.map do |score, word|
    final_scr = final_score(misspelling_lower, word.downcase)
    [score + final_scr, word]
  end.sort.reverse

  guesses2.each do |_, sug|
    yield sug
  end
end