Module: Html2rss::AutoSource::Scraper

Defined in:
lib/html2rss/auto_source/scraper.rb,
lib/html2rss/auto_source/scraper/html.rb,
lib/html2rss/auto_source/scraper/schema.rb,
lib/html2rss/auto_source/scraper/microdata.rb,
lib/html2rss/auto_source/scraper/json_state.rb,
lib/html2rss/auto_source/scraper/schema/thing.rb,
lib/html2rss/auto_source/scraper/semantic_html.rb,
lib/html2rss/auto_source/scraper/wordpress_api.rb,
lib/html2rss/auto_source/scraper/schema/item_list.rb,
lib/html2rss/auto_source/scraper/schema/list_item.rb,
lib/html2rss/auto_source/scraper/wordpress_api/page_scope.rb,
lib/html2rss/auto_source/scraper/schema/category_extractor.rb,
lib/html2rss/auto_source/scraper/wordpress_api/posts_endpoint.rb,
lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb

Overview

The Scraper module contains all scrapers that can be used to extract articles. Each scraper should implement an ‘each` method that yields article hashes. Each scraper should also implement an `articles?` method that returns true if the scraper can potentially be used to extract articles from the given HTML.

Detection is intentionally shallow for most scrapers, but instance-based matching is available for scrapers that need to carry expensive selection state forward into extraction. Scrapers run in parallel threads, so implementations must avoid shared mutable state and degrade by returning no articles when a follow-up would be unsafe or unsupported.

Defined Under Namespace

Classes: Html, JsonState, Microdata, NoScraperFound, Schema, SemanticHtml, WordpressApi

Constant Summary collapse

APP_SHELL_ROOT_SELECTORS =

Root markers indicating likely app-shell/client-rendered surfaces.

'#app, #root, #__next, [data-reactroot], [ng-app], [id*="app-shell"]'
APP_SHELL_MAX_ANCHORS =

Maximum anchors tolerated before app-shell detection is considered unlikely.

2
APP_SHELL_MAX_VISIBLE_TEXT_LENGTH =

Maximum visible text length tolerated for app-shell classification.

220
SCRAPERS =

Ordered scraper classes considered during auto-source extraction.

[
  WordpressApi,
  Schema,
  Microdata,
  JsonState,
  SemanticHtml,
  Html
].freeze

Class Method Summary collapse

Class Method Details

.from(parsed_body, opts = ) ⇒ Array<Class>

Returns an array of scraper classes that claim to find articles in the parsed body.

Parameters:

  • parsed_body (Nokogiri::HTML::Document)

    The parsed HTML body.

  • opts (Hash) (defaults to: )

    The options hash.

Options Hash (opts):

  • :wordpress_api (Hash)

    scraper toggle and configuration

  • :schema (Hash)

    scraper toggle and configuration

  • :microdata (Hash)

    scraper toggle and configuration

  • :json_state (Hash)

    scraper toggle and configuration

  • :semantic_html (Hash)

    scraper toggle and configuration

  • :html (Hash)

    scraper toggle and configuration

Returns:

  • (Array<Class>)

    An array of scraper classes that can handle the parsed body.



82
83
84
85
86
87
88
89
# File 'lib/html2rss/auto_source/scraper.rb', line 82

def self.from(parsed_body, opts = Html2rss::AutoSource::DEFAULT_CONFIG[:scraper])
  scrapers = SCRAPERS.select { |scraper| opts.dig(scraper.options_key, :enabled) }
  scrapers.select! { |scraper| scraper.articles?(parsed_body) }

  raise no_scraper_found_for(parsed_body) if scrapers.empty?

  scrapers
end

.instances_for(parsed_body, url:, request_session: nil, opts: ) ⇒ Array<Object>

Returns scraper instances ready for extraction. ‘instances_for` is the main entrypoint for extraction. It lets a scraper decide whether it matches using the same instance that will later yield article hashes, which keeps precomputed state close to the scraper that owns it.

Parameters:

  • parsed_body (Nokogiri::HTML::Document)

    The parsed HTML body.

  • url (String, Html2rss::Url)

    The page url.

  • request_session (Html2rss::RequestSession, nil) (defaults to: nil)

    Shared follow-up session.

  • opts (Hash) (defaults to: )

    The options hash.

Options Hash (opts:):

  • :wordpress_api (Hash)

    scraper toggle and configuration

  • :schema (Hash)

    scraper toggle and configuration

  • :microdata (Hash)

    scraper toggle and configuration

  • :json_state (Hash)

    scraper toggle and configuration

  • :semantic_html (Hash)

    scraper toggle and configuration

  • :html (Hash)

    scraper toggle and configuration

Returns:

  • (Array<Object>)

    An array of scraper instances that can handle the parsed body.



108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
# File 'lib/html2rss/auto_source/scraper.rb', line 108

def self.instances_for(parsed_body, url:, request_session: nil,
                       opts: Html2rss::AutoSource::DEFAULT_CONFIG[:scraper])
  instances = SCRAPERS.filter_map do |scraper|
    next unless opts.dig(scraper.options_key, :enabled)

    instance = scraper.new(parsed_body, url:, request_session:, **opts.fetch(scraper.options_key, {}))
    next unless extractable_instance?(instance, parsed_body)

    instance
  end

  raise no_scraper_found_for(parsed_body) if instances.empty?

  instances
end