Class: Html2rss::AutoSource::Scraper::SemanticHtml::Deduplicator
- Inherits:
-
Object
- Object
- Html2rss::AutoSource::Scraper::SemanticHtml::Deduplicator
- Defined in:
- lib/html2rss/auto_source/scraper/semantic_html/deduplicator.rb
Overview
Collapses nested containers and deduplicates entries pointing to the same destination. It resolves ties using scoring precedence and payload richness comparison.
Instance Method Summary collapse
-
#article_for(entry) ⇒ Hash?
Returns the materialized article hash for the entry, using the cache.
-
#call(entries) ⇒ Array<Entry>
Collapses and deduplicates the given entries.
-
#initialize(url, extractor) ⇒ Deduplicator
constructor
A new instance of Deduplicator.
-
#stronger_entry?(left, right) ⇒ Boolean
Compares two entries to determine which is stronger.
Constructor Details
#initialize(url, extractor) ⇒ Deduplicator
Returns a new instance of Deduplicator.
13 14 15 16 17 |
# File 'lib/html2rss/auto_source/scraper/semantic_html/deduplicator.rb', line 13 def initialize(url, extractor) @url = url @extractor = extractor @article_cache = {}.compare_by_identity end |
Instance Method Details
#article_for(entry) ⇒ Hash?
Returns the materialized article hash for the entry, using the cache.
36 37 38 39 40 41 42 43 44 |
# File 'lib/html2rss/auto_source/scraper/semantic_html/deduplicator.rb', line 36 def article_for(entry) return entry.article if entry.article @article_cache.fetch(entry) do @article_cache[entry] = @extractor.new( entry.container, base_url: @url, selected_anchor: entry.selected_anchor ).call end end |
#call(entries) ⇒ Array<Entry>
Collapses and deduplicates the given entries.
23 24 25 26 27 28 29 30 |
# File 'lib/html2rss/auto_source/scraper/semantic_html/deduplicator.rb', line 23 def call(entries) destination_groups(entries).filter_map do |group| collapsed_group = collapse_nested_destination_group(group) collapsed_group.reduce do |best, entry| stronger_entry?(entry, best) ? entry : best end end end |
#stronger_entry?(left, right) ⇒ Boolean
Compares two entries to determine which is stronger.
51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
# File 'lib/html2rss/auto_source/scraper/semantic_html/deduplicator.rb', line 51 def stronger_entry?(left, right) # rubocop:disable Metrics/AbcSize final_delta = left.final_score <=> right.final_score return final_delta.positive? unless final_delta.zero? quality_delta = left.quality_score <=> right.quality_score return quality_delta.positive? unless quality_delta.zero? left_article = article_for(left) right_article = article_for(right) return !right_article if left_article.nil? || right_article.nil? richness_delta = payload_richness_signature(left_article) <=> payload_richness_signature(right_article) richness_delta.zero? ? left.position < right.position : richness_delta.positive? end |