Class: Html2rss::AutoSource::Cleanup
- Inherits:
-
Object
- Object
- Html2rss::AutoSource::Cleanup
- Defined in:
- lib/html2rss/auto_source/cleanup.rb
Overview
Cleanup is responsible for cleaning up the extracted articles. :reek:MissingSafeMethod { enabled: false } It applies various strategies to filter and refine the article list.
Constant Summary collapse
- DEFAULT_CONFIG =
Default cleanup behavior for auto-sourced article lists.
{ keep_different_domain: false, min_words_title: 3 }.freeze
- VALID_SCHEMES =
Allowed URL schemes for article filtering.
%w[http https].to_set.freeze
Class Method Summary collapse
-
.call(articles, url:, keep_different_domain:, min_words_title:) ⇒ Array<Article>
Cleaned article list.
-
.deduplicate_by!(articles, key) ⇒ Array<Article>
Deduplicates articles by a given key.
-
.keep_only_http_urls!(articles) ⇒ Array<Article>
Keeps only articles with HTTP or HTTPS URLs.
-
.keep_only_with_min_words_title!(articles, min_words_title:) ⇒ Array<Article>
Keeps only articles with a title that is present and has at least ‘min_words_title` words.
-
.reject_different_domain!(articles, base_url) ⇒ Array<Article>
Rejects articles that have a URL not on the same domain as the source.
Class Method Details
.call(articles, url:, keep_different_domain:, min_words_title:) ⇒ Array<Article>
Returns cleaned article list.
25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
# File 'lib/html2rss/auto_source/cleanup.rb', line 25 def call(articles, url:, keep_different_domain:, min_words_title:) Log.debug "Cleanup: start with #{articles.size} articles" articles.select!(&:valid?) deduplicate_by!(articles, :url) keep_only_http_urls!(articles) reject_different_domain!(articles, url) unless keep_different_domain keep_only_with_min_words_title!(articles, min_words_title:) Log.debug "Cleanup: end with #{articles.size} articles" articles end |
.deduplicate_by!(articles, key) ⇒ Array<Article>
Deduplicates articles by a given key.
46 47 48 49 50 51 52 |
# File 'lib/html2rss/auto_source/cleanup.rb', line 46 def deduplicate_by!(articles, key) seen = {} articles.reject! do |article| value = article.public_send(key) value.nil? || seen.key?(value).tap { seen[value] = true } end end |
.keep_only_http_urls!(articles) ⇒ Array<Article>
Keeps only articles with HTTP or HTTPS URLs.
59 60 61 |
# File 'lib/html2rss/auto_source/cleanup.rb', line 59 def keep_only_http_urls!(articles) articles.select! { |article| VALID_SCHEMES.include?(article.url&.scheme) } end |
.keep_only_with_min_words_title!(articles, min_words_title:) ⇒ Array<Article>
Keeps only articles with a title that is present and has at least ‘min_words_title` words.
80 81 82 83 84 |
# File 'lib/html2rss/auto_source/cleanup.rb', line 80 def keep_only_with_min_words_title!(articles, min_words_title:) articles.select! do |article| article.title ? word_count_at_least?(article.title, min_words_title) : true end end |
.reject_different_domain!(articles, base_url) ⇒ Array<Article>
Rejects articles that have a URL not on the same domain as the source.
69 70 71 72 |
# File 'lib/html2rss/auto_source/cleanup.rb', line 69 def reject_different_domain!(articles, base_url) base_host = base_url.host articles.select! { |article| article.url&.host == base_host } end |