Module: Html2rss::AutoSource::Scraper

Defined in:: lib/html2rss/auto_source/scraper.rb,
lib/html2rss/auto_source/scraper/html.rb,
lib/html2rss/auto_source/scraper/schema.rb,
lib/html2rss/auto_source/scraper/microdata.rb,
lib/html2rss/auto_source/scraper/json_state.rb,
lib/html2rss/auto_source/scraper/schema/thing.rb,
lib/html2rss/auto_source/scraper/semantic_html.rb,
lib/html2rss/auto_source/scraper/wordpress_api.rb,
lib/html2rss/auto_source/scraper/link_heuristics.rb,
lib/html2rss/auto_source/scraper/schema/item_list.rb,
lib/html2rss/auto_source/scraper/schema/list_item.rb,
lib/html2rss/auto_source/scraper/wordpress_api/page_scope.rb,
lib/html2rss/auto_source/scraper/schema/category_extractor.rb,
lib/html2rss/auto_source/scraper/semantic_html/deduplicator.rb,
lib/html2rss/auto_source/scraper/wordpress_api/posts_endpoint.rb,
lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb

Overview

The Scraper module contains all scrapers that can be used to extract articles. Each scraper should implement an ‘each` method that yields article hashes. Each scraper should also implement an `articles?` method that returns true if the scraper can potentially be used to extract articles from the given HTML.

Detection is intentionally shallow for most scrapers, but instance-based matching is available for scrapers that need to carry expensive selection state forward into extraction.

Defined Under Namespace

Classes: Html, JsonState, LinkHeuristics, Microdata, NoScraperFound, Schema, SemanticHtml, WordpressApi

Constant Summary collapse

APP_SHELL_ROOT_SELECTORS = Root markers indicating likely app-shell/client-rendered surfaces.

'#app, #root, #__next, [data-reactroot], [ng-app], [id*="app-shell"]'

APP_SHELL_MAX_ANCHORS = Maximum anchors tolerated before app-shell detection is considered unlikely.

APP_SHELL_MAX_VISIBLE_TEXT_LENGTH = Maximum visible text length tolerated for app-shell classification.

SCRAPERS = Ordered scraper classes considered during auto-source extraction.

[
  WordpressApi,
  Schema,
  Microdata,
  JsonState,
  SemanticHtml,
  Html
].freeze

Class Method Summary collapse

.from(parsed_body, opts = ) ⇒ Array<Class>

Returns an array of scraper classes that claim to find articles in the parsed body.
.instances_for(parsed_body, url:, request_session: nil, opts: ) ⇒ Array<Object>

Returns scraper instances ready for extraction.

Class Method Details

.from(parsed_body, opts = ) ⇒ `Array<Class>`

Returns an array of scraper classes that claim to find articles in the parsed body.

Parameters:

parsed_body (Nokogiri::HTML::Document) —

The parsed HTML body.
opts (Hash) (defaults to: ) —

The options hash.

Options Hash (opts):

:wordpress_api (Hash) —

scraper toggle and configuration
:schema (Hash) —

scraper toggle and configuration
:microdata (Hash) —

scraper toggle and configuration
:json_state (Hash) —

scraper toggle and configuration
:semantic_html (Hash) —

scraper toggle and configuration
:html (Hash) —

scraper toggle and configuration

Returns:

(Array<Class>) —

An array of scraper classes that can handle the parsed body.

# File 'lib/html2rss/auto_source/scraper.rb', line 79

def self.from(parsed_body, opts = Html2rss::AutoSource::DEFAULT_CONFIG[:scraper])
  scrapers = SCRAPERS.select { |scraper| opts.dig(scraper.options_key, :enabled) }
  scrapers.select! { |scraper| scraper.articles?(parsed_body) }

  raise no_scraper_found_for(parsed_body) if scrapers.empty?

  scrapers
end

.instances_for(parsed_body, url:, request_session: nil, opts: ) ⇒ `Array<Object>`

Returns scraper instances ready for extraction. ‘instances_for` is the main entrypoint for extraction. It lets a scraper decide whether it matches using the same instance that will later yield article hashes, which keeps precomputed state close to the scraper that owns it.