Class: Jekyll::L10n::PoFuzzyMatcher
- Inherits:
-
Object
- Object
- Jekyll::L10n::PoFuzzyMatcher
- Defined in:
- lib/jekyll-l10n/po_file/fuzzy_matcher.rb
Overview
Finds the closest matching old PO entry for a new msgid using normalized Levenshtein similarity. Mirrors GNU msgmerge fuzzy-matching behaviour.
Key responsibilities:
-
Compute normalized edit-distance similarity between two strings
-
Select the best-scoring candidate from a pool of orphaned old entries
-
Return the matched old msgid and its msgstr for use as a fuzzy hint
Constant Summary collapse
- THRESHOLD =
Constants::DEFAULT_FUZZY_THRESHOLD
Class Method Summary collapse
-
.find_match(new_msgid, candidates, threshold: THRESHOLD) ⇒ Hash?
Find the best fuzzy match for new_msgid among candidates.
-
.msgstr_from_entry(entry) ⇒ String
Extract msgstr from a PO entry that is either a plain String or a metadata Hash.
-
.similarity(str_a, str_b) ⇒ Float
Normalized Levenshtein similarity between two strings.
Class Method Details
.find_match(new_msgid, candidates, threshold: THRESHOLD) ⇒ Hash?
Find the best fuzzy match for new_msgid among candidates.
Skips new_msgid values longer than MAX_FUZZY_MSGID_LENGTH (long strings are unique HTML fragments with no useful near-duplicate and Levenshtein is O(n²)). Pre-filters candidates to the length range where similarity ≥ threshold is mathematically possible before invoking Levenshtein.
37 38 39 40 41 42 43 44 45 46 47 48 49 |
# File 'lib/jekyll-l10n/po_file/fuzzy_matcher.rb', line 37 def self.find_match(new_msgid, candidates, threshold: THRESHOLD) return nil if candidates.empty? return nil if new_msgid.nil? || new_msgid.length > Constants::MAX_FUZZY_MSGID_LENGTH len = new_msgid.length min_feas = (len * threshold).ceil max_feas = threshold.positive? ? (len / threshold).floor : Float::INFINITY best = best_candidate(new_msgid, candidates, min_feas, max_feas, threshold) return nil unless best { msgid: best[:msgid], msgstr: msgstr_from_entry(best[:entry]) } end |
.msgstr_from_entry(entry) ⇒ String
Extract msgstr from a PO entry that is either a plain String or a metadata Hash.
21 22 23 |
# File 'lib/jekyll-l10n/po_file/fuzzy_matcher.rb', line 21 def self.msgstr_from_entry(entry) entry.is_a?(Hash) ? entry[:msgstr].to_s : entry.to_s end |
.similarity(str_a, str_b) ⇒ Float
Normalized Levenshtein similarity between two strings.
Returns 0.0 immediately when the length ratio falls below the threshold —the maximum achievable similarity is min_len/max_len, so Levenshtein cannot produce a useful result and the O(n²) computation is skipped.
80 81 82 83 84 85 86 87 88 89 90 |
# File 'lib/jekyll-l10n/po_file/fuzzy_matcher.rb', line 80 def self.similarity(str_a, str_b) return 1.0 if str_a == str_b return 0.0 if str_a.empty? || str_b.empty? max_len = [str_a.length, str_b.length].max min_len = [str_a.length, str_b.length].min return 0.0 if min_len.to_f / max_len < THRESHOLD dist = levenshtein(str_a, str_b) 1.0 - (dist.to_f / max_len) end |