Class: Html2rss::AutoSource::Scraper::SemanticHtml::Deduplicator

Inherits:
Object
  • Object
show all
Defined in:
lib/html2rss/auto_source/scraper/semantic_html/deduplicator.rb

Overview

Collapses nested containers and deduplicates entries pointing to the same destination. It resolves ties using scoring precedence and payload richness comparison.

Instance Method Summary collapse

Constructor Details

#initialize(url, extractor) ⇒ Deduplicator

Returns a new instance of Deduplicator.

Parameters:

  • url (String, Html2rss::Url)

    base url used to resolve relative hrefs

  • extractor (Class)

    extractor class used to materialize articles



13
14
15
16
17
# File 'lib/html2rss/auto_source/scraper/semantic_html/deduplicator.rb', line 13

def initialize(url, extractor)
  @url = url
  @extractor = extractor
  @article_cache = {}.compare_by_identity
end

Instance Method Details

#article_for(entry) ⇒ Hash?

Returns the materialized article hash for the entry, using the cache.

Parameters:

  • entry (Entry)

    scraper entry

Returns:

  • (Hash, nil)

    article payload



36
37
38
39
40
41
42
43
44
# File 'lib/html2rss/auto_source/scraper/semantic_html/deduplicator.rb', line 36

def article_for(entry)
  return entry.article if entry.article

  @article_cache.fetch(entry) do
    @article_cache[entry] = @extractor.new(
      entry.container, base_url: @url, selected_anchor: entry.selected_anchor
    ).call
  end
end

#call(entries) ⇒ Array<Entry>

Collapses and deduplicates the given entries.

Parameters:

  • entries (Array<Entry>)

    list of scraper entries

Returns:

  • (Array<Entry>)

    deduplicated list of scraper entries



23
24
25
26
27
28
29
30
# File 'lib/html2rss/auto_source/scraper/semantic_html/deduplicator.rb', line 23

def call(entries)
  destination_groups(entries).filter_map do |group|
    collapsed_group = collapse_nested_destination_group(group)
    collapsed_group.reduce do |best, entry|
      stronger_entry?(entry, best) ? entry : best
    end
  end
end

#stronger_entry?(left, right) ⇒ Boolean

Compares two entries to determine which is stronger.

Parameters:

  • left (Entry)

    left entry

  • right (Entry)

    right entry

Returns:

  • (Boolean)

    true if left is stronger than right



51
52
53
54
55
56
57
58
59
60
61
62
63
64
# File 'lib/html2rss/auto_source/scraper/semantic_html/deduplicator.rb', line 51

def stronger_entry?(left, right) # rubocop:disable Metrics/AbcSize
  final_delta = left.final_score <=> right.final_score
  return final_delta.positive? unless final_delta.zero?

  quality_delta = left.quality_score <=> right.quality_score
  return quality_delta.positive? unless quality_delta.zero?

  left_article = article_for(left)
  right_article = article_for(right)
  return !right_article if left_article.nil? || right_article.nil?

  richness_delta = payload_richness_signature(left_article) <=> payload_richness_signature(right_article)
  richness_delta.zero? ? left.position < right.position : richness_delta.positive?
end