Module: Licensee::ContentHelper::SimilarityMethods

Included in:
Licensee::ContentHelper
Defined in:
lib/licensee/content_helper/similarity_methods.rb

Overview

Mixin providing wordset-based similarity scoring.

Instance Method Summary collapse

Instance Method Details

#bigram_similarity(other) ⇒ Object

Given another license or project file, calculates the Dice coefficient over bigrams (consecutive word pairs). Unlike wordset similarity this is sensitive to word order, making it resistant to adversarial scrambling where all the correct words appear but in the wrong sequence.



20
21
22
23
24
25
26
27
28
# File 'lib/licensee/content_helper/similarity_methods.rb', line 20

def bigram_similarity(other)
  my_bigrams = bigrams
  other_bigrams = other.bigrams
  total = my_bigrams.size + other_bigrams.size
  return 0.0 if total.zero?

  overlap = (my_bigrams & other_bigrams).size
  (overlap * 200.0) / total
end

#similarity(other) ⇒ Object

Given another license or project file, calculates the similarity as a percentage of words in common, minus a tiny penalty that increases with size difference between licenses so that false positives for long licenses are ruled out by this score alone.



11
12
13
14
# File 'lib/licensee/content_helper/similarity_methods.rb', line 11

def similarity(other)
  overlap = (wordset_fieldless & other.wordset).size
  (overlap * 200.0) / similarity_denominator(other)
end