Module: Eco::Data::FuzzyMatch::NGramsScore
- Included in:
- ClassMethods
- Defined in:
- lib/eco/data/fuzzy_match/ngrams_score.rb
Instance Method Summary collapse
-
#ngrams_score(str1, str2, range: 3..5, normalized: false) ⇒ Score
A score is kept of matching ngram combinations of
str2
. -
#words_ngrams_score(str1, str2, range: 3..5, normalized: false) ⇒ Score
It does the following: 1.
Instance Method Details
#ngrams_score(str1, str2, range: 3..5, normalized: false) ⇒ Score
Note:
This algorithm is best suited for matching sentences, or 'firstname lastname' compared with 'lastname firstname' combinations.
A score is kept of matching ngram combinations of str2
.
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 |
# File 'lib/eco/data/fuzzy_match/ngrams_score.rb', line 42 def ngrams_score(str1, str2, range: 3..5, normalized: false) str1, str2 = normalize_string([str1, str2]) unless normalized len1 = str1 && str1.length; len2 = str2 && str2.length Score.new(0, len1 || 0).tap do |score| next if !str2 || !str1 next if str2.empty? || str1.empty? score.total = len1 next score.increase(score.total) if str1 == str2 next if str1.length < 2 || str2.length < 2 grams = word_ngrams(str2, range, normalized: true) grams_count = grams.length next unless grams_count > 0 if range.is_a?(Integer) item_weight = score.total.to_f / grams_count matches = grams.select {|res| str1.include?(gram)}.length score.increase(matches * item_weight) else groups = grams.group_by {|gram| gram.length} sorted_lens = groups.keys.sort.reverse lens = sorted_lens.length group_weight = (1.0 / lens).round(3) groups.each do |len, grams| len_max_score = score.total * group_weight item_weight = len_max_score / grams_count matches = grams.select {|gram| str1.include?(gram)}.length #pp "(#{len}) match: #{matches} (of #{grams.length} of total #{grams_count}) || max_score: #{len_max_score} (over #{score.total})" score.increase(matches * item_weight) end end end end |
#words_ngrams_score(str1, str2, range: 3..5, normalized: false) ⇒ Score
It does the following:
- It splits both strings into words
- Pairs all words by best
ngrams_score
match - Gives
0
score to those words ofstr2
that lost their pair (a word ofstr1
cannot be paired twice) - Merges the
ngrams_score
of all the paired words ofstr2
against theirstr1
word pair
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
# File 'lib/eco/data/fuzzy_match/ngrams_score.rb', line 13 def words_ngrams_score(str1, str2, range: 3..5, normalized: false) str1, str2 = normalize_string([str1, str2]) unless normalized len1 = str1 && str1.length; len2 = str2 && str2.length Score.new(0, 0).tap do |score| next if !str2 || !str1 next score.increase_total(len1) if str2.empty? || str1.empty? if str1 == str2 score.total = len1 score.increase(score.total) end if str1.length < 2 || str1.length < 2 score.increase_total(len1) end pairs = paired_words(str1, str2, normalized: true) do |needle, item| ngrams_score(needle, item, range: range, normalized: true) end.each do |sub_str1, data| item, iscore = data score.merge!(iscore) end end end |