Module: Scrapetor::Fingerprint

Defined in:
lib/scrapetor/fingerprint.rb

Overview

Structural fingerprint of a DOM subtree. Phase 1: tag-bigram rolling hash over the top ‘depth` levels. Phase 2+: tag bigrams + attribute-presence hash + child-shape hash.

Constant Summary collapse

MASK =
0xFFFFFFFFFFFFFFFF

Class Method Summary collapse

Class Method Details

.structural(node, depth: 4) ⇒ Object



10
11
12
13
14
15
16
17
# File 'lib/scrapetor/fingerprint.rb', line 10

def self.structural(node, depth: 4)
  backing = node.respond_to?(:backing_node) ? node.backing_node : node
  h = 0
  walk(backing, depth) do |tag|
    h = (h * 1_315_423_911 + tag.hash) & MASK
  end
  h
end

.walk(nlx, depth, &block) ⇒ Object



19
20
21
22
23
24
25
26
27
# File 'lib/scrapetor/fingerprint.rb', line 19

def self.walk(nlx, depth, &block)
  return if depth <= 0
  return unless nlx.respond_to?(:children)
  nlx.children.each do |c|
    next unless c.respond_to?(:element?) && c.element?
    block.call(c.name)
    walk(c, depth - 1, &block)
  end
end