Class: Html2rss::AutoSource

Inherits:
Object
  • Object
show all
Defined in:
lib/html2rss/auto_source.rb,
lib/html2rss/auto_source/cleanup.rb,
lib/html2rss/auto_source/scraper.rb,
lib/html2rss/auto_source/scraper/html.rb,
lib/html2rss/auto_source/scraper/schema.rb,
lib/html2rss/auto_source/scraper/microdata.rb,
lib/html2rss/auto_source/scraper/json_state.rb,
lib/html2rss/auto_source/scraper/schema/thing.rb,
lib/html2rss/auto_source/scraper/semantic_html.rb,
lib/html2rss/auto_source/scraper/wordpress_api.rb,
lib/html2rss/auto_source/scraper/schema/item_list.rb,
lib/html2rss/auto_source/scraper/schema/list_item.rb,
lib/html2rss/auto_source/scraper/wordpress_api/page_scope.rb,
lib/html2rss/auto_source/scraper/schema/category_extractor.rb,
lib/html2rss/auto_source/scraper/wordpress_api/posts_endpoint.rb,
lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb

Overview

The AutoSource class automatically extracts articles from a given URL by utilizing a collection of Scrapers. These scrapers analyze and parse popular structured data formats—such as schema, microdata, and open graph—to identify and compile article elements into unified articles.

Scrapers supporting plain HTML are also available for sites without structured data, though results may vary based on page markup.

Defined Under Namespace

Modules: Scraper Classes: Cleanup

Constant Summary collapse

DEFAULT_CONFIG =

Default auto-source configuration shipped for scraper and cleanup behavior.

{
  scraper: {
    wordpress_api: {
      enabled: true
    },
    schema: {
      enabled: true
    },
    microdata: {
      enabled: true
    },
    json_state: {
      enabled: true
    },
    semantic_html: {
      enabled: true
    },
    html: {
      enabled: true,
      minimum_selector_frequency: Scraper::Html::DEFAULT_MINIMUM_SELECTOR_FREQUENCY,
      use_top_selectors: Scraper::Html::DEFAULT_USE_TOP_SELECTORS
    }
  },
  cleanup: Cleanup::DEFAULT_CONFIG
}.freeze
Config =

Runtime schema used to validate auto-source config values.

Dry::Schema.Params do
  optional(:scraper).hash(&SCRAPER_CONFIG)

  optional(:cleanup).hash do
    optional(:keep_different_domain).filled(:bool)
    optional(:min_words_title).filled(:integer, gt?: 0)
  end
end

Instance Method Summary collapse

Constructor Details

#initialize(response, opts = DEFAULT_CONFIG, request_session: nil) ⇒ void

Parameters:

Options Hash (opts):

  • :scraper (Hash)

    scraper configuration map

  • :cleanup (Hash)

    cleanup configuration map



88
89
90
91
92
93
# File 'lib/html2rss/auto_source.rb', line 88

def initialize(response, opts = DEFAULT_CONFIG, request_session: nil)
  @parsed_body = response.parsed_body
  @url = response.url
  @opts = opts
  @request_session = request_session
end

Instance Method Details

#articlesArray<Html2rss::RssBuilder::Article>

Extracts article candidates by selecting every scraper that can explain the page shape, running those scrapers, and normalizing the resulting hashes into ‘RssBuilder::Article` objects.

The contributor-facing flow is:

  1. choose scraper instances that match the page

  2. let each scraper collect its own candidates

  3. clean and deduplicate the merged article list

Scrapers with expensive precomputation, such as ‘SemanticHtml`, keep that state on the instance so detection and extraction can reuse the same work.

Returns:



109
110
111
112
113
114
# File 'lib/html2rss/auto_source.rb', line 109

def articles
  @articles ||= extract_articles
rescue Html2rss::AutoSource::Scraper::NoScraperFound => error
  Log.warn "#{self.class}: no scraper matched #{url} (#{error.message})"
  []
end