Class: Canon::TreeDiff::Matchers::SimilarityMatcher

Inherits:
Object
  • Object
show all
Defined in:
lib/canon/tree_diff/matchers/similarity_matcher.rb

Overview

SimilarityMatcher performs similarity-based matching

Based on JATS-diff (2022) approach:

  • Use Jaccard index for content similarity

  • Configurable similarity threshold (default 0.95)

  • Group candidates by signature for efficiency

  • Extend matches for unmatched nodes

Features:

  • Handles text-centric documents

  • Fuzzy matching for similar but not identical nodes

  • Threshold-based filtering

  • Efficient signature-based grouping

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(tree1, tree2, matching, threshold: 0.95) ⇒ SimilarityMatcher

Initialize matcher with two trees and existing matching

Parameters:

  • tree1 (TreeNode)

    First tree root

  • tree2 (TreeNode)

    Second tree root

  • matching (Core::Matching)

    Existing matching from previous phase

  • threshold (Float) (defaults to: 0.95)

    Similarity threshold (0.0 to 1.0)



32
33
34
35
36
37
# File 'lib/canon/tree_diff/matchers/similarity_matcher.rb', line 32

def initialize(tree1, tree2, matching, threshold: 0.95)
  @tree1 = tree1
  @tree2 = tree2
  @matching = matching
  @threshold = threshold
end

Instance Attribute Details

#matchingObject (readonly)

Returns the value of attribute matching.



24
25
26
# File 'lib/canon/tree_diff/matchers/similarity_matcher.rb', line 24

def matching
  @matching
end

#thresholdObject (readonly)

Returns the value of attribute threshold.



24
25
26
# File 'lib/canon/tree_diff/matchers/similarity_matcher.rb', line 24

def threshold
  @threshold
end

#tree1Object (readonly)

Returns the value of attribute tree1.



24
25
26
# File 'lib/canon/tree_diff/matchers/similarity_matcher.rb', line 24

def tree1
  @tree1
end

#tree2Object (readonly)

Returns the value of attribute tree2.



24
25
26
# File 'lib/canon/tree_diff/matchers/similarity_matcher.rb', line 24

def tree2
  @tree2
end

Instance Method Details

#matchCore::Matching

Perform similarity-based matching

Returns:



42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# File 'lib/canon/tree_diff/matchers/similarity_matcher.rb', line 42

def match
  # Get unmatched nodes from both trees
  all_nodes1 = collect_nodes(tree1)
  all_nodes2 = collect_nodes(tree2)

  unmatched1 = @matching.unmatched1(all_nodes1)
  unmatched2 = @matching.unmatched2(all_nodes2)

  # Group unmatched nodes by signature for efficiency
  groups1 = group_by_signature(unmatched1)
  groups2 = group_by_signature(unmatched2)

  # For each signature group, find similar matches
  groups2.each do |sig, nodes2|
    # Find corresponding group in tree1
    nodes1 = groups1[sig] || []
    next if nodes1.empty?

    # Match nodes within this signature group
    match_group(nodes1, nodes2)
  end

  @matching
end