Module: Licensee::ContentHelper::SimilarityMethods
- Included in:
- Licensee::ContentHelper
- Defined in:
- lib/licensee/content_helper/similarity_methods.rb
Overview
Mixin providing wordset-based similarity scoring.
Instance Method Summary collapse
-
#bigram_similarity(other) ⇒ Object
Given another license or project file, calculates the Dice coefficient over bigrams (consecutive word pairs).
-
#similarity(other) ⇒ Object
Given another license or project file, calculates the similarity as a percentage of words in common, minus a tiny penalty that increases with size difference between licenses so that false positives for long licenses are ruled out by this score alone.
Instance Method Details
#bigram_similarity(other) ⇒ Object
Given another license or project file, calculates the Dice coefficient over bigrams (consecutive word pairs). Unlike wordset similarity this is sensitive to word order, making it resistant to adversarial scrambling where all the correct words appear but in the wrong sequence.
20 21 22 23 24 25 26 27 28 |
# File 'lib/licensee/content_helper/similarity_methods.rb', line 20 def bigram_similarity(other) my_bigrams = bigrams other_bigrams = other.bigrams total = my_bigrams.size + other_bigrams.size return 0.0 if total.zero? overlap = (my_bigrams & other_bigrams).size (overlap * 200.0) / total end |
#similarity(other) ⇒ Object
Given another license or project file, calculates the similarity as a percentage of words in common, minus a tiny penalty that increases with size difference between licenses so that false positives for long licenses are ruled out by this score alone.
11 12 13 14 |
# File 'lib/licensee/content_helper/similarity_methods.rb', line 11 def similarity(other) overlap = (wordset_fieldless & other.wordset).size (overlap * 200.0) / similarity_denominator(other) end |