Module: Html2rss::AutoSource::Scraper

Defined in:: lib/html2rss/auto_source/scraper.rb,
lib/html2rss/auto_source/scraper/html.rb,
lib/html2rss/auto_source/scraper/schema.rb,
lib/html2rss/auto_source/scraper/microdata.rb,
lib/html2rss/auto_source/scraper/json_state.rb,
lib/html2rss/auto_source/scraper/schema/thing.rb,
lib/html2rss/auto_source/scraper/semantic_html.rb,
lib/html2rss/auto_source/scraper/wordpress_api.rb,
lib/html2rss/auto_source/scraper/schema/item_list.rb,
lib/html2rss/auto_source/scraper/schema/list_item.rb,
lib/html2rss/auto_source/scraper/wordpress_api/page_scope.rb,
lib/html2rss/auto_source/scraper/schema/category_extractor.rb,
lib/html2rss/auto_source/scraper/wordpress_api/posts_endpoint.rb,
lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb

Overview

The Scraper module contains all scrapers that can be used to extract articles. Each scraper should implement an ‘each` method that yields article hashes. Each scraper should also implement an `articles?` method that returns true if the scraper can potentially be used to extract articles from the given HTML.

Detection is intentionally shallow for most scrapers, but instance-based matching is available for scrapers that need to carry expensive selection state forward into extraction. Scrapers run in parallel threads, so implementations must avoid shared mutable state and degrade by returning no articles when a follow-up would be unsafe or unsupported.

Defined Under Namespace

Classes: Html, JsonState, Microdata, NoScraperFound, Schema, SemanticHtml, WordpressApi

Constant Summary collapse

APP_SHELL_ROOT_SELECTORS = Root markers indicating likely app-shell/client-rendered surfaces.

'#app, #root, #__next, [data-reactroot], [ng-app], [id*="app-shell"]'

APP_SHELL_MAX_ANCHORS = Maximum anchors tolerated before app-shell detection is considered unlikely.

APP_SHELL_MAX_VISIBLE_TEXT_LENGTH = Maximum visible text length tolerated for app-shell classification.

SCRAPERS = Ordered scraper classes considered during auto-source extraction.

[
  WordpressApi,
  Schema,
  Microdata,
  JsonState,
  SemanticHtml,
  Html
].freeze

Class Method Summary collapse

.from(parsed_body, opts = ) ⇒ Array<Class>

Returns an array of scraper classes that claim to find articles in the parsed body.
.instances_for(parsed_body, url:, request_session: nil, opts: ) ⇒ Array<Object>

Returns scraper instances ready for extraction.

Class Method Details

.from(parsed_body, opts = ) ⇒ `Array<Class>`

Returns an array of scraper classes that claim to find articles in the parsed body.

Parameters:

parsed_body (Nokogiri::HTML::Document) —

The parsed HTML body.
opts (Hash) (defaults to: ) —

The options hash.

Options Hash (opts):

:wordpress_api (Hash) —

scraper toggle and configuration
:schema (Hash) —

scraper toggle and configuration
:microdata (Hash) —

scraper toggle and configuration
:json_state (Hash) —

scraper toggle and configuration
:semantic_html (Hash) —

scraper toggle and configuration
:html (Hash) —

scraper toggle and configuration

Returns:

(Array<Class>) —

An array of scraper classes that can handle the parsed body.

# File 'lib/html2rss/auto_source/scraper.rb', line 82

def self.from(parsed_body, opts = Html2rss::AutoSource::DEFAULT_CONFIG[:scraper])
  scrapers = SCRAPERS.select { |scraper| opts.dig(scraper.options_key, :enabled) }
  scrapers.select! { |scraper| scraper.articles?(parsed_body) }

  raise no_scraper_found_for(parsed_body) if scrapers.empty?

  scrapers
end

.instances_for(parsed_body, url:, request_session: nil, opts: ) ⇒ `Array<Object>`

Returns scraper instances ready for extraction. ‘instances_for` is the main entrypoint for extraction. It lets a scraper decide whether it matches using the same instance that will later yield article hashes, which keeps precomputed state close to the scraper that owns it.