Class: Jekyll::L10n::PoFuzzyMatcher

Inherits:
Object
  • Object
show all
Defined in:
lib/jekyll-l10n/po_file/fuzzy_matcher.rb

Overview

Finds the closest matching old PO entry for a new msgid using normalized Levenshtein similarity. Mirrors GNU msgmerge fuzzy-matching behaviour.

Key responsibilities:

  • Compute normalized edit-distance similarity between two strings

  • Select the best-scoring candidate from a pool of orphaned old entries

  • Return the matched old msgid and its msgstr for use as a fuzzy hint

Constant Summary collapse

THRESHOLD =
Constants::DEFAULT_FUZZY_THRESHOLD

Class Method Summary collapse

Class Method Details

.find_match(new_msgid, candidates, threshold: THRESHOLD) ⇒ Hash?

Find the best fuzzy match for new_msgid among candidates.

Skips new_msgid values longer than MAX_FUZZY_MSGID_LENGTH (long strings are unique HTML fragments with no useful near-duplicate and Levenshtein is O(n²)). Pre-filters candidates to the length range where similarity ≥ threshold is mathematically possible before invoking Levenshtein.

Parameters:

  • new_msgid (String)

    the new source string to match

  • candidates (Hash)

    { old_msgid => entry } where entry is either a String msgstr or a Hash with :msgstr key

  • threshold (Float) (defaults to: THRESHOLD)

    minimum similarity score to accept (0.0–1.0)

Returns:

  • (Hash, nil)

    { msgid: String, msgstr: String } or nil if no match



37
38
39
40
41
42
43
44
45
46
47
48
49
# File 'lib/jekyll-l10n/po_file/fuzzy_matcher.rb', line 37

def self.find_match(new_msgid, candidates, threshold: THRESHOLD)
  return nil if candidates.empty?
  return nil if new_msgid.nil? || new_msgid.length > Constants::MAX_FUZZY_MSGID_LENGTH

  len      = new_msgid.length
  min_feas = (len * threshold).ceil
  max_feas = threshold.positive? ? (len / threshold).floor : Float::INFINITY

  best = best_candidate(new_msgid, candidates, min_feas, max_feas, threshold)
  return nil unless best

  { msgid: best[:msgid], msgstr: msgstr_from_entry(best[:entry]) }
end

.msgstr_from_entry(entry) ⇒ String

Extract msgstr from a PO entry that is either a plain String or a metadata Hash.

Parameters:

  • entry (String, Hash)

    PO entry value

Returns:

  • (String)


21
22
23
# File 'lib/jekyll-l10n/po_file/fuzzy_matcher.rb', line 21

def self.msgstr_from_entry(entry)
  entry.is_a?(Hash) ? entry[:msgstr].to_s : entry.to_s
end

.similarity(str_a, str_b) ⇒ Float

Normalized Levenshtein similarity between two strings.

Returns 0.0 immediately when the length ratio falls below the threshold —the maximum achievable similarity is min_len/max_len, so Levenshtein cannot produce a useful result and the O(n²) computation is skipped.

Parameters:

  • str_a (String)
  • str_b (String)

Returns:

  • (Float)

    0.0 (completely different) to 1.0 (identical)



80
81
82
83
84
85
86
87
88
89
90
# File 'lib/jekyll-l10n/po_file/fuzzy_matcher.rb', line 80

def self.similarity(str_a, str_b)
  return 1.0 if str_a == str_b
  return 0.0 if str_a.empty? || str_b.empty?

  max_len = [str_a.length, str_b.length].max
  min_len = [str_a.length, str_b.length].min
  return 0.0 if min_len.to_f / max_len < THRESHOLD

  dist = levenshtein(str_a, str_b)
  1.0 - (dist.to_f / max_len)
end