Class: Html2rss::AutoSource::Cleanup

Inherits:
Object
  • Object
show all
Defined in:
lib/html2rss/auto_source/cleanup.rb

Overview

Cleanup is responsible for cleaning up the extracted articles. :reek:MissingSafeMethod { enabled: false } It applies various strategies to filter and refine the article list.

Constant Summary collapse

DEFAULT_CONFIG =

Default cleanup behavior for auto-sourced article lists.

{
  keep_different_domain: false,
  min_words_title: 3
}.freeze
VALID_SCHEMES =

Allowed URL schemes for article filtering.

%w[http https].to_set.freeze

Class Method Summary collapse

Class Method Details

.call(articles, url:, keep_different_domain:, min_words_title:) ⇒ Array<Article>

Returns cleaned article list.

Parameters:

  • articles (Array<Article>)

    extracted article candidates

  • url (Html2rss::Url)

    feed source URL used for same-host filtering

  • keep_different_domain (Boolean)

    whether to keep off-domain entries

  • min_words_title (Integer)

    minimum word count for title filtering

Returns:

  • (Array<Article>)

    cleaned article list



25
26
27
28
29
30
31
32
33
34
35
36
37
38
# File 'lib/html2rss/auto_source/cleanup.rb', line 25

def call(articles, url:, keep_different_domain:, min_words_title:)
  Log.debug "Cleanup: start with #{articles.size} articles"

  articles.select!(&:valid?)

  deduplicate_by!(articles, :url)

  keep_only_http_urls!(articles)
  reject_different_domain!(articles, url) unless keep_different_domain
  keep_only_with_min_words_title!(articles, min_words_title:)

  Log.debug "Cleanup: end with #{articles.size} articles"
  articles
end

.deduplicate_by!(articles, key) ⇒ Array<Article>

Deduplicates articles by a given key.

Parameters:

  • articles (Array<Article>)

    The list of articles to process.

  • key (Symbol)

    The key to deduplicate by.

Returns:

  • (Array<Article>)

    the mutated articles array



46
47
48
49
50
51
52
# File 'lib/html2rss/auto_source/cleanup.rb', line 46

def deduplicate_by!(articles, key)
  seen = {}
  articles.reject! do |article|
    value = article.public_send(key)
    value.nil? || seen.key?(value).tap { seen[value] = true }
  end
end

.keep_only_http_urls!(articles) ⇒ Array<Article>

Keeps only articles with HTTP or HTTPS URLs.

Parameters:

  • articles (Array<Article>)

    The list of articles to process.

Returns:

  • (Array<Article>)

    the mutated articles array



59
60
61
# File 'lib/html2rss/auto_source/cleanup.rb', line 59

def keep_only_http_urls!(articles)
  articles.select! { |article| VALID_SCHEMES.include?(article.url&.scheme) }
end

.keep_only_with_min_words_title!(articles, min_words_title:) ⇒ Array<Article>

Keeps only articles with a title that is present and has at least ‘min_words_title` words.

Parameters:

  • articles (Array<Article>)

    The list of articles to process.

  • min_words_title (Integer)

    The minimum number of words in the title.

Returns:

  • (Array<Article>)

    the mutated articles array



80
81
82
83
84
# File 'lib/html2rss/auto_source/cleanup.rb', line 80

def keep_only_with_min_words_title!(articles, min_words_title:)
  articles.select! do |article|
    article.title ? word_count_at_least?(article.title, min_words_title) : true
  end
end

.reject_different_domain!(articles, base_url) ⇒ Array<Article>

Rejects articles that have a URL not on the same domain as the source.

Parameters:

  • articles (Array<Article>)

    The list of articles to process.

  • base_url (Html2rss::Url)

    The source URL to compare against.

Returns:

  • (Array<Article>)

    the mutated articles array



69
70
71
72
# File 'lib/html2rss/auto_source/cleanup.rb', line 69

def reject_different_domain!(articles, base_url)
  base_host = base_url.host
  articles.select! { |article| article.url&.host == base_host }
end