Module: Html2rss::AutoSource::Scraper
- Defined in:
- lib/html2rss/auto_source/scraper.rb,
lib/html2rss/auto_source/scraper/html.rb,
lib/html2rss/auto_source/scraper/schema.rb,
lib/html2rss/auto_source/scraper/microdata.rb,
lib/html2rss/auto_source/scraper/json_state.rb,
lib/html2rss/auto_source/scraper/schema/thing.rb,
lib/html2rss/auto_source/scraper/semantic_html.rb,
lib/html2rss/auto_source/scraper/wordpress_api.rb,
lib/html2rss/auto_source/scraper/schema/item_list.rb,
lib/html2rss/auto_source/scraper/schema/list_item.rb,
lib/html2rss/auto_source/scraper/wordpress_api/page_scope.rb,
lib/html2rss/auto_source/scraper/schema/category_extractor.rb,
lib/html2rss/auto_source/scraper/wordpress_api/posts_endpoint.rb,
lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb
Overview
The Scraper module contains all scrapers that can be used to extract articles. Each scraper should implement an ‘each` method that yields article hashes. Each scraper should also implement an `articles?` method that returns true if the scraper can potentially be used to extract articles from the given HTML.
Detection is intentionally shallow for most scrapers, but instance-based matching is available for scrapers that need to carry expensive selection state forward into extraction. Scrapers run in parallel threads, so implementations must avoid shared mutable state and degrade by returning no articles when a follow-up would be unsafe or unsupported.
Defined Under Namespace
Classes: Html, JsonState, Microdata, NoScraperFound, Schema, SemanticHtml, WordpressApi
Constant Summary collapse
- APP_SHELL_ROOT_SELECTORS =
Root markers indicating likely app-shell/client-rendered surfaces.
'#app, #root, #__next, [data-reactroot], [ng-app], [id*="app-shell"]'- APP_SHELL_MAX_ANCHORS =
Maximum anchors tolerated before app-shell detection is considered unlikely.
2- APP_SHELL_MAX_VISIBLE_TEXT_LENGTH =
Maximum visible text length tolerated for app-shell classification.
220- SCRAPERS =
Ordered scraper classes considered during auto-source extraction.
[ WordpressApi, Schema, Microdata, JsonState, SemanticHtml, Html ].freeze
Class Method Summary collapse
-
.from(parsed_body, opts = ) ⇒ Array<Class>
Returns an array of scraper classes that claim to find articles in the parsed body.
-
.instances_for(parsed_body, url:, request_session: nil, opts: ) ⇒ Array<Object>
Returns scraper instances ready for extraction.
Class Method Details
.from(parsed_body, opts = ) ⇒ Array<Class>
Returns an array of scraper classes that claim to find articles in the parsed body.
82 83 84 85 86 87 88 89 |
# File 'lib/html2rss/auto_source/scraper.rb', line 82 def self.from(parsed_body, opts = Html2rss::AutoSource::DEFAULT_CONFIG[:scraper]) scrapers = SCRAPERS.select { |scraper| opts.dig(scraper., :enabled) } scrapers.select! { |scraper| scraper.articles?(parsed_body) } raise no_scraper_found_for(parsed_body) if scrapers.empty? scrapers end |
.instances_for(parsed_body, url:, request_session: nil, opts: ) ⇒ Array<Object>
Returns scraper instances ready for extraction. ‘instances_for` is the main entrypoint for extraction. It lets a scraper decide whether it matches using the same instance that will later yield article hashes, which keeps precomputed state close to the scraper that owns it.
108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 |
# File 'lib/html2rss/auto_source/scraper.rb', line 108 def self.instances_for(parsed_body, url:, request_session: nil, opts: Html2rss::AutoSource::DEFAULT_CONFIG[:scraper]) instances = SCRAPERS.filter_map do |scraper| next unless opts.dig(scraper., :enabled) instance = scraper.new(parsed_body, url:, request_session:, **opts.fetch(scraper., {})) next unless extractable_instance?(instance, parsed_body) instance end raise no_scraper_found_for(parsed_body) if instances.empty? instances end |