Class: Html2rss::Articles::Deduplicator

Inherits:
Object
  • Object
show all
Defined in:
lib/html2rss/articles/deduplicator.rb

Overview

Deduplicates a list of articles while preserving their original order.

The deduplicator prefers each article’s URL (combined with its ID when available) to determine uniqueness. When no URL is present, it falls back to the article ID, then to the GUID enriched with title and description metadata. If none of these identifiers are available it defaults to the article object’s hash to preserve the original entry.

Instance Method Summary collapse

Constructor Details

#initialize(articles) ⇒ Deduplicator

Returns a new instance of Deduplicator.

Parameters:

Raises:

  • (ArgumentError)

    if articles are not provided



20
21
22
23
24
# File 'lib/html2rss/articles/deduplicator.rb', line 20

def initialize(articles)
  raise ArgumentError, 'articles must be provided' unless articles

  @articles = articles
end

Instance Method Details

#callArray<Html2rss::RssBuilder::Article>

Returns the list of unique articles, preserving the order of the original collection and keeping the first occurrence of a duplicate.

Returns:



30
31
32
33
34
35
36
37
# File 'lib/html2rss/articles/deduplicator.rb', line 30

def call
  seen = Set.new

  articles.filter do |article|
    fingerprint = deduplication_fingerprint_for(article) || article.hash
    seen.add?(fingerprint)
  end
end