Class: Html2rss::AutoSource
- Inherits:
-
Object
- Object
- Html2rss::AutoSource
- Defined in:
- lib/html2rss/auto_source.rb,
lib/html2rss/auto_source/cleanup.rb,
lib/html2rss/auto_source/scraper.rb,
lib/html2rss/auto_source/scraper/html.rb,
lib/html2rss/auto_source/scraper/schema.rb,
lib/html2rss/auto_source/scraper/microdata.rb,
lib/html2rss/auto_source/scraper/json_state.rb,
lib/html2rss/auto_source/scraper/schema/thing.rb,
lib/html2rss/auto_source/scraper/semantic_html.rb,
lib/html2rss/auto_source/scraper/wordpress_api.rb,
lib/html2rss/auto_source/scraper/schema/item_list.rb,
lib/html2rss/auto_source/scraper/schema/list_item.rb,
lib/html2rss/auto_source/scraper/wordpress_api/page_scope.rb,
lib/html2rss/auto_source/scraper/schema/category_extractor.rb,
lib/html2rss/auto_source/scraper/wordpress_api/posts_endpoint.rb,
lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb
Overview
The AutoSource class automatically extracts articles from a given URL by utilizing a collection of Scrapers. These scrapers analyze and parse popular structured data formats—such as schema, microdata, and open graph—to identify and compile article elements into unified articles.
Scrapers supporting plain HTML are also available for sites without structured data, though results may vary based on page markup.
Defined Under Namespace
Modules: Scraper Classes: Cleanup
Constant Summary collapse
- DEFAULT_CONFIG =
Default auto-source configuration shipped for scraper and cleanup behavior.
{ scraper: { wordpress_api: { enabled: true }, schema: { enabled: true }, microdata: { enabled: true }, json_state: { enabled: true }, semantic_html: { enabled: true }, html: { enabled: true, minimum_selector_frequency: Scraper::Html::DEFAULT_MINIMUM_SELECTOR_FREQUENCY, use_top_selectors: Scraper::Html::DEFAULT_USE_TOP_SELECTORS } }, cleanup: Cleanup::DEFAULT_CONFIG }.freeze
- Config =
Runtime schema used to validate auto-source config values.
Dry::Schema.Params do optional(:scraper).hash(&SCRAPER_CONFIG) optional(:cleanup).hash do optional(:keep_different_domain).filled(:bool) optional(:min_words_title).filled(:integer, gt?: 0) end end
Instance Method Summary collapse
-
#articles ⇒ Array<Html2rss::RssBuilder::Article>
Extracts article candidates by selecting every scraper that can explain the page shape, running those scrapers, and normalizing the resulting hashes into ‘RssBuilder::Article` objects.
- #initialize(response, opts = DEFAULT_CONFIG, request_session: nil) ⇒ void constructor
Constructor Details
#initialize(response, opts = DEFAULT_CONFIG, request_session: nil) ⇒ void
88 89 90 91 92 93 |
# File 'lib/html2rss/auto_source.rb', line 88 def initialize(response, opts = DEFAULT_CONFIG, request_session: nil) @parsed_body = response.parsed_body @url = response.url @opts = opts @request_session = request_session end |
Instance Method Details
#articles ⇒ Array<Html2rss::RssBuilder::Article>
Extracts article candidates by selecting every scraper that can explain the page shape, running those scrapers, and normalizing the resulting hashes into ‘RssBuilder::Article` objects.
The contributor-facing flow is:
-
choose scraper instances that match the page
-
let each scraper collect its own candidates
-
clean and deduplicate the merged article list
Scrapers with expensive precomputation, such as ‘SemanticHtml`, keep that state on the instance so detection and extraction can reuse the same work.
109 110 111 112 113 114 |
# File 'lib/html2rss/auto_source.rb', line 109 def articles @articles ||= extract_articles rescue Html2rss::AutoSource::Scraper::NoScraperFound => error Log.warn "#{self.class}: no scraper matched #{url} (#{error.})" [] end |