Class: Kotoshu::Suggestions::Strategies::SymSpellStrategy

Inherits:
BaseStrategy
  • Object
show all
Defined in:
lib/kotoshu/suggestions/strategies/symspell_strategy.rb

Overview

SymSpell suggestion strategy.

Uses deletion distance algorithm for fast approximate string matching. Pre-computes deletion variants for all dictionary words, enabling O(1) lookup for common misspellings.

This is 10-100x faster than EditDistanceStrategy for large dictionaries.

The algorithm works by:

  1. Pre-computing single deletion variants for each dictionary word

  2. Looking up input word’s deletion variants in the pre-computed map

  3. Distance is inferred from the deletion level

Constant Summary collapse

DEFAULT_MAX_DELETION_DISTANCE =

Maximum deletion distance to consider

2
DEFAULT_MAX_DICTIONARY_SIZE =

Maximum dictionary words to process (increased for better coverage)

500_000
DEFAULT_HANDLE_TRANSPOSITIONS =

Enable transposition handling (slower pre-computation, better accuracy)

true

Instance Attribute Summary

Attributes inherited from BaseStrategy

#config, #name

Instance Method Summary collapse

Methods inherited from BaseStrategy

#calculate_ngram_similarity, #create_suggestion, #create_suggestion_set, #enabled?, #generate_ngrams, #get_config, #handles?, #has_config?, #max_results, #priority, #to_s

Constructor Details

#initialize(dictionary:, name: :symspell, **config) ⇒ SymSpellStrategy

Create a new SymSpell strategy.

Parameters:

  • dictionary (Object)

    Dictionary to use for suggestions

  • name (String, Symbol) (defaults to: :symspell)

    Strategy name

  • config (Hash)

    Configuration options

Options Hash (**config):

  • max_deletion_distance (Integer)

    Maximum deletion distance (default: 2)

  • max_results (Integer)

    Maximum results to return (default: 10)

  • max_dictionary_size (Integer)

    Maximum words to process (default: 500_000)

  • handle_transpositions (Boolean)

    Generate transposition variants (default: true)



37
38
39
40
41
42
43
44
45
46
# File 'lib/kotoshu/suggestions/strategies/symspell_strategy.rb', line 37

def initialize(dictionary:, name: :symspell, **config)
  super(name: name, **config)
  @dictionary = dictionary
  @max_deletion_distance = config.fetch(:max_deletion_distance, DEFAULT_MAX_DELETION_DISTANCE)
  @max_dictionary_size = config.fetch(:max_dictionary_size, DEFAULT_MAX_DICTIONARY_SIZE)
  @handle_transpositions = config.fetch(:handle_transpositions, DEFAULT_HANDLE_TRANSPOSITIONS)
  @deletes = Hash.new { |h, k| h[k] = [] } # deletion_variant -> [original_words]
  @words = Set.new
  precompute!
end

Instance Method Details

#deletion_distance(str1, str2) ⇒ Integer

Calculate deletion distance between two words.

For SymSpell, this is the length of their longest common subsequence based distance (minimum deletions to make them equal).

Parameters:

  • str1 (String)

    First word

  • str2 (String)

    Second word

Returns:

  • (Integer)

    Deletion distance



157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
# File 'lib/kotoshu/suggestions/strategies/symspell_strategy.rb', line 157

def deletion_distance(str1, str2)
  return str2.length if str1.empty?
  return str1.length if str2.empty?
  return 0 if str1 == str2

  # Simple approach: find if one can be transformed to the other
  # by only deletions (check if str1 is subsequence of str2 or vice versa)
  if is_subsequence?(str1, str2)
    str2.length - str1.length
  elsif is_subsequence?(str2, str1)
    str1.length - str2.length
  else
    # Fallback to edit distance approximation
    # This shouldn't happen often with proper SymSpell usage
    lcs_len = longest_common_subsequence_length(str1, str2)
    str1.length + str2.length - 2 * lcs_len
  end
end

#generate(context) ⇒ SuggestionSet

Generate suggestions using deletion distance.

Parameters:

  • context (Context)

    The suggestion context

Returns:



52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
# File 'lib/kotoshu/suggestions/strategies/symspell_strategy.rb', line 52

def generate(context)
  word = context.word
  max_dist = get_config(:max_deletion_distance, @max_deletion_distance)

  # Normalize to lowercase for case-insensitive matching
  word_lower = word.downcase

  # Check if word is in dictionary
  return SuggestionSet.empty if @words.include?(word_lower)

  # Collect candidates with their distances
  candidates = {}
  checked = Set.new([word_lower])

  # First, check if the input word is a deletion variant of any dictionary word
  @deletes[word_lower].each do |dict_word|
    candidates[dict_word] ||= 1
  end

  # If transpositions are enabled, check them too
  if @handle_transpositions
    generate_transpositions(word_lower).each do |transposed|
      @deletes[transposed].each do |dict_word|
        candidates[dict_word] ||= 1
      end
    end
  end

  # Generate deletion variants and check for matches
  max_dist.times do |dist|
    generate_deletions_from_set(checked).each do |variant|
      next if checked.include?(variant)

      checked.add(variant)

      # Check if variant is directly in dictionary
      candidates[variant] = dist + 1 if @words.include?(variant)

      # Check if variant maps to dictionary words
      @deletes[variant].each do |dict_word|
        # Distance = deletions from input + deletions from dict_word
        # Both reach the same variant
        candidates[dict_word] ||= dist + 2
      end
    end
  end

  # Sort by distance and create suggestions
  sorted_words = candidates.sort_by { |_, dist| dist }.map(&:first)
  create_suggestion_set(sorted_words, distances: candidates, original_word: context.word)
end

#generate_transpositions(word) ⇒ Array<String>

Generate all adjacent transposition variants of a word.

For example, “world” → [“owrld”, “wrold”, “wolrd”, “wordl”]

Parameters:

  • word (String)

    The word

Returns:

  • (Array<String>)

    Array of variants with adjacent characters swapped



137
138
139
140
141
142
143
144
145
146
147
# File 'lib/kotoshu/suggestions/strategies/symspell_strategy.rb', line 137

def generate_transpositions(word)
  variants = []
  word.chars.each_with_index do |_, i|
    next if i == word.length - 1 # Can't swap last character

    variant = word.dup
    variant[i], variant[i + 1] = variant[i + 1], variant[i]
    variants << variant unless variant == word
  end
  variants
end

#precompute!Object

Pre-compute deletion variants for all dictionary words.

This is called during initialization and builds the index.



107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
# File 'lib/kotoshu/suggestions/strategies/symspell_strategy.rb', line 107

def precompute!
  words = dictionary_words(@dictionary)

  words.first(@max_dictionary_size).each do |word|
    next if word.nil? || word.empty?

    word_lower = word.downcase
    @words.add(word_lower)

    # Generate only single deletion variants for efficiency
    # Multiple deletions are handled during lookup
    generate_single_deletions(word_lower).each do |variant|
      @deletes[variant] << word_lower
    end

    # Generate transposition variants if enabled
    if @handle_transpositions
      generate_transpositions(word_lower).each do |variant|
        @deletes[variant] << word_lower
      end
    end
  end
end