Module: TextMetrics::Levenshtein

Defined in:
lib/text_metrics/levenshtein.rb

Overview

Levenshtein edit distance between two strings, plus a normalized similarity score. Comparison is a property of two texts, so it lives here rather than on a single-text analyzer.

Class Method Summary collapse

Class Method Details

.distance(first, second) ⇒ Object

Raw edit distance: the number of single-character insertions, deletions or substitutions needed to turn first into second. Case-sensitive.



11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# File 'lib/text_metrics/levenshtein.rb', line 11

def distance(first, second)
  first = first.to_s
  second = second.to_s
  m = first.length
  n = second.length

  return n if m.zero?
  return m if n.zero?

  matrix = Array.new(m + 1) { Array.new(n + 1, 0) }
  (0..m).each { |i| matrix[i][0] = i }
  (0..n).each { |j| matrix[0][j] = j }

  (1..m).each do |i|
    (1..n).each do |j|
      cost = (first[i - 1] == second[j - 1]) ? 0 : 1
      matrix[i][j] = [
        matrix[i - 1][j] + 1,        # deletion
        matrix[i][j - 1] + 1,        # insertion
        matrix[i - 1][j - 1] + cost  # substitution
      ].min
    end
  end

  matrix[m][n]
end

.similarity(first, second) ⇒ Object

Similarity as a 0–100 score: 100.0 means identical, 0.0 means nothing in common.



39
40
41
42
43
44
# File 'lib/text_metrics/levenshtein.rb', line 39

def similarity(first, second)
  max_length = [first.to_s.length, second.to_s.length].max
  return 100.0 if max_length.zero?

  ((max_length - distance(first, second)).to_f / max_length * 100).round(2)
end