Class: Kotoshu::Suggestions::Strategies::SymSpellStrategy
- Inherits:
-
BaseStrategy
- Object
- BaseStrategy
- Kotoshu::Suggestions::Strategies::SymSpellStrategy
- Defined in:
- lib/kotoshu/suggestions/strategies/symspell_strategy.rb
Overview
SymSpell suggestion strategy.
Uses deletion distance algorithm for fast approximate string matching. Pre-computes deletion variants for all dictionary words, enabling O(1) lookup for common misspellings.
This is 10-100x faster than EditDistanceStrategy for large dictionaries.
The algorithm works by:
-
Pre-computing single deletion variants for each dictionary word
-
Looking up input word’s deletion variants in the pre-computed map
-
Distance is inferred from the deletion level
Constant Summary collapse
- DEFAULT_MAX_DELETION_DISTANCE =
Maximum deletion distance to consider
2- DEFAULT_MAX_DICTIONARY_SIZE =
Maximum dictionary words to process (increased for better coverage)
500_000- DEFAULT_HANDLE_TRANSPOSITIONS =
Enable transposition handling (slower pre-computation, better accuracy)
true
Instance Attribute Summary
Attributes inherited from BaseStrategy
Instance Method Summary collapse
-
#deletion_distance(str1, str2) ⇒ Integer
Calculate deletion distance between two words.
-
#generate(context) ⇒ SuggestionSet
Generate suggestions using deletion distance.
-
#generate_transpositions(word) ⇒ Array<String>
Generate all adjacent transposition variants of a word.
-
#initialize(dictionary:, name: :symspell, **config) ⇒ SymSpellStrategy
constructor
Create a new SymSpell strategy.
-
#precompute! ⇒ Object
Pre-compute deletion variants for all dictionary words.
Methods inherited from BaseStrategy
#calculate_ngram_similarity, #create_suggestion, #create_suggestion_set, #enabled?, #generate_ngrams, #get_config, #handles?, #has_config?, #max_results, #priority, #to_s
Constructor Details
#initialize(dictionary:, name: :symspell, **config) ⇒ SymSpellStrategy
Create a new SymSpell strategy.
37 38 39 40 41 42 43 44 45 46 |
# File 'lib/kotoshu/suggestions/strategies/symspell_strategy.rb', line 37 def initialize(dictionary:, name: :symspell, **config) super(name: name, **config) @dictionary = dictionary @max_deletion_distance = config.fetch(:max_deletion_distance, DEFAULT_MAX_DELETION_DISTANCE) @max_dictionary_size = config.fetch(:max_dictionary_size, DEFAULT_MAX_DICTIONARY_SIZE) @handle_transpositions = config.fetch(:handle_transpositions, DEFAULT_HANDLE_TRANSPOSITIONS) @deletes = Hash.new { |h, k| h[k] = [] } # deletion_variant -> [original_words] @words = Set.new precompute! end |
Instance Method Details
#deletion_distance(str1, str2) ⇒ Integer
Calculate deletion distance between two words.
For SymSpell, this is the length of their longest common subsequence based distance (minimum deletions to make them equal).
157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 |
# File 'lib/kotoshu/suggestions/strategies/symspell_strategy.rb', line 157 def deletion_distance(str1, str2) return str2.length if str1.empty? return str1.length if str2.empty? return 0 if str1 == str2 # Simple approach: find if one can be transformed to the other # by only deletions (check if str1 is subsequence of str2 or vice versa) if is_subsequence?(str1, str2) str2.length - str1.length elsif is_subsequence?(str2, str1) str1.length - str2.length else # Fallback to edit distance approximation # This shouldn't happen often with proper SymSpell usage lcs_len = longest_common_subsequence_length(str1, str2) str1.length + str2.length - 2 * lcs_len end end |
#generate(context) ⇒ SuggestionSet
Generate suggestions using deletion distance.
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
# File 'lib/kotoshu/suggestions/strategies/symspell_strategy.rb', line 52 def generate(context) word = context.word max_dist = get_config(:max_deletion_distance, @max_deletion_distance) # Normalize to lowercase for case-insensitive matching word_lower = word.downcase # Check if word is in dictionary return SuggestionSet.empty if @words.include?(word_lower) # Collect candidates with their distances candidates = {} checked = Set.new([word_lower]) # First, check if the input word is a deletion variant of any dictionary word @deletes[word_lower].each do |dict_word| candidates[dict_word] ||= 1 end # If transpositions are enabled, check them too if @handle_transpositions generate_transpositions(word_lower).each do |transposed| @deletes[transposed].each do |dict_word| candidates[dict_word] ||= 1 end end end # Generate deletion variants and check for matches max_dist.times do |dist| generate_deletions_from_set(checked).each do |variant| next if checked.include?(variant) checked.add(variant) # Check if variant is directly in dictionary candidates[variant] = dist + 1 if @words.include?(variant) # Check if variant maps to dictionary words @deletes[variant].each do |dict_word| # Distance = deletions from input + deletions from dict_word # Both reach the same variant candidates[dict_word] ||= dist + 2 end end end # Sort by distance and create suggestions sorted_words = candidates.sort_by { |_, dist| dist }.map(&:first) create_suggestion_set(sorted_words, distances: candidates, original_word: context.word) end |
#generate_transpositions(word) ⇒ Array<String>
Generate all adjacent transposition variants of a word.
For example, “world” → [“owrld”, “wrold”, “wolrd”, “wordl”]
137 138 139 140 141 142 143 144 145 146 147 |
# File 'lib/kotoshu/suggestions/strategies/symspell_strategy.rb', line 137 def generate_transpositions(word) variants = [] word.chars.each_with_index do |_, i| next if i == word.length - 1 # Can't swap last character variant = word.dup variant[i], variant[i + 1] = variant[i + 1], variant[i] variants << variant unless variant == word end variants end |
#precompute! ⇒ Object
Pre-compute deletion variants for all dictionary words.
This is called during initialization and builds the index.
107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 |
# File 'lib/kotoshu/suggestions/strategies/symspell_strategy.rb', line 107 def precompute! words = dictionary_words(@dictionary) words.first(@max_dictionary_size).each do |word| next if word.nil? || word.empty? word_lower = word.downcase @words.add(word_lower) # Generate only single deletion variants for efficiency # Multiple deletions are handled during lookup generate_single_deletions(word_lower).each do |variant| @deletes[variant] << word_lower end # Generate transposition variants if enabled if @handle_transpositions generate_transpositions(word_lower).each do |variant| @deletes[variant] << word_lower end end end end |