Module: Kotoshu::StringMetrics

Defined in:
lib/kotoshu/string_metrics.rb

Overview

String similarity metrics for spell checking.

Ported from Spylls (Python) string_metrics.py

These metrics are used for:

  • Computing word similarity

  • Ranking suggestions

  • N-gram based scoring

Class Method Summary collapse

Class Method Details

.commoncharacters(s1, s2) ⇒ Integer

Number of occurrences of the exactly same characters in exactly same position.

Examples:

Kotoshu::StringMetrics.commoncharacters("hello", "hallo")  # => 4 ('h', 'l', 'l', 'o' match)

Parameters:

  • s1 (String)

    First string

  • s2 (String)

    Second string

Returns:

  • (Integer)

    Count of matching characters at same positions



21
22
23
24
25
26
27
28
# File 'lib/kotoshu/string_metrics.rb', line 21

def self.commoncharacters(s1, s2)
  return 0 if s1.nil? || s2.nil?

  # Zip strings and count matching character pairs
  [s1.length, s2.length].min.times.count do |i|
    s1[i] == s2[i]
  end
end

.lcslen(s1, s2) ⇒ Integer

Calculate LCS (Longest Common Subsequence) length.

Classic dynamic programming algorithm. This is different from longest common substring - subsequence doesn’t require contiguity.

Examples:

Kotoshu::StringMetrics.lcslen("AGGTAB", "GXTXAYB")  # => 4 ("GTAB")

Parameters:

  • s1 (String)

    First string

  • s2 (String)

    Second string

Returns:

  • (Integer)

    Length of longest common subsequence



126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
# File 'lib/kotoshu/string_metrics.rb', line 126

def self.lcslen(s1, s2)
  return 0 if s1.nil? || s2.nil? || s1.empty? || s2.empty?

  m = s1.length
  n = s2.length

  # Create DP table
  # Using a 2D array for clarity, though we could optimize space
  c = Array.new(m + 1) { Array.new(n + 1, 0) }

  (0...m).each do |i|
    (0...n).each do |j|
      if s1[i] == s2[j]
        # Characters match - extend diagonal
        c[i + 1][j + 1] = c[i][j] + 1
      elsif c[i][j + 1] >= c[i + 1][j]
        # Take max from top or left
        c[i + 1][j + 1] = c[i][j + 1]
      else
        c[i + 1][j + 1] = c[i + 1][j]
      end
    end
  end

  c[m][n]
end

.leftcommonsubstring(s1, s2) ⇒ Integer

Size of the common start of two strings.

Examples:

Kotoshu::StringMetrics.leftcommonsubstring("foo", "bar")       # => 0
Kotoshu::StringMetrics.leftcommonsubstring("built", "build")   # => 4
Kotoshu::StringMetrics.leftcommonsubstring("cat", "cats")      # => 3

Parameters:

  • s1 (String)

    First string

  • s2 (String)

    Second string

Returns:

  • (Integer)

    Length of common prefix



40
41
42
43
44
45
46
47
48
49
50
# File 'lib/kotoshu/string_metrics.rb', line 40

def self.leftcommonsubstring(s1, s2)
  return 0 if s1.nil? || s2.nil?

  # Find first position where characters differ
  s1.chars.zip(s2.chars).each_with_index do |(c1, c2), i|
    return i if c1 != c2
  end

  # All characters matched up to shorter string length
  [s1.length, s2.length].min
end

.ngram(max_ngram_size, s1, s2, weighted: false, any_mismatch: false, longer_worse: false) ⇒ Integer

Calculate n-gram similarity between two strings.

Calculates how many n-grams of s1 are contained in s2 (the more the number, the more words are similar).

Examples:

Kotoshu::StringMetrics.ngram(4, "hello", "help")                 # => 6
Kotoshu::StringMetrics.ngram(4, "teachings", "teaching")       # => higher score

Parameters:

  • max_ngram_size (Integer)

    Maximum n-gram size to check

  • s1 (String)

    String to compare

  • s2 (String)

    String to compare

  • weighted (Boolean) (defaults to: false)

    Subtract from result for ngrams NOT contained

  • any_mismatch (Boolean) (defaults to: false)

    Add penalty for any string length difference

  • longer_worse (Boolean) (defaults to: false)

    Add penalty when second string is longer

Returns:

  • (Integer)

    N-gram similarity score (higher is more similar)



68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
# File 'lib/kotoshu/string_metrics.rb', line 68

def self.ngram(max_ngram_size, s1, s2, weighted: false, any_mismatch: false, longer_worse: false)
  l2 = s2.length
  return 0 if l2.zero?

  l1 = s1.length
  nscore = 0

  # For all sizes of ngram up to desired...
  (1..max_ngram_size).each do |ngram_size|
    ns = 0

    # Check every position in the first string
    (0..(l1 - ngram_size)).each do |pos|
      ngram = s1[pos, ngram_size]

      # If the ngram is present in ANY place in second string, increase score
      if s2.include?(ngram)
        ns += 1
      elsif weighted
        # For "weighted" ngrams, decrease score if ngram is not found
        ns -= 1
        # Decrease once more if it was the beginning or end of first string
        ns -= 1 if pos.zero? || pos + ngram_size == l1
      end
    end

    nscore += ns

    # There is no need to check for 4-gram if there were only one 3-gram
    break if ns < 2 && !weighted
  end

  # Calculate penalty based on settings
  penalty = if longer_worse
              # Add penalty when second string is longer
              (l2 - l1) - 2
            elsif any_mismatch
              # Add penalty for any string length difference
              (l2 - l1).abs - 2
            else
              0
            end

  # Apply penalty if positive
  penalty > 0 ? nscore - penalty : nscore
end