Module: Html2rss::AutoSource::Scraper

Defined in:
lib/html2rss/auto_source/scraper.rb,
lib/html2rss/auto_source/scraper/html.rb,
lib/html2rss/auto_source/scraper/schema.rb,
lib/html2rss/auto_source/scraper/microdata.rb,
lib/html2rss/auto_source/scraper/json_state.rb,
lib/html2rss/auto_source/scraper/schema/thing.rb,
lib/html2rss/auto_source/scraper/semantic_html.rb,
lib/html2rss/auto_source/scraper/wordpress_api.rb,
lib/html2rss/auto_source/scraper/link_heuristics.rb,
lib/html2rss/auto_source/scraper/schema/item_list.rb,
lib/html2rss/auto_source/scraper/schema/list_item.rb,
lib/html2rss/auto_source/scraper/wordpress_api/page_scope.rb,
lib/html2rss/auto_source/scraper/schema/category_extractor.rb,
lib/html2rss/auto_source/scraper/semantic_html/deduplicator.rb,
lib/html2rss/auto_source/scraper/wordpress_api/posts_endpoint.rb,
lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb

Overview

The Scraper module contains all scrapers that can be used to extract articles. Each scraper should implement an ‘each` method that yields article hashes. Each scraper should also implement an `articles?` method that returns true if the scraper can potentially be used to extract articles from the given HTML.

Detection is intentionally shallow for most scrapers, but instance-based matching is available for scrapers that need to carry expensive selection state forward into extraction.

Defined Under Namespace

Classes: Html, JsonState, LinkHeuristics, Microdata, NoScraperFound, Schema, SemanticHtml, WordpressApi

Constant Summary collapse

APP_SHELL_ROOT_SELECTORS =

Root markers indicating likely app-shell/client-rendered surfaces.

'#app, #root, #__next, [data-reactroot], [ng-app], [id*="app-shell"]'
APP_SHELL_MAX_ANCHORS =

Maximum anchors tolerated before app-shell detection is considered unlikely.

2
APP_SHELL_MAX_VISIBLE_TEXT_LENGTH =

Maximum visible text length tolerated for app-shell classification.

220
SCRAPERS =

Ordered scraper classes considered during auto-source extraction.

[
  WordpressApi,
  Schema,
  Microdata,
  JsonState,
  SemanticHtml,
  Html
].freeze

Class Method Summary collapse

Class Method Details

.from(parsed_body, opts = ) ⇒ Array<Class>

Returns an array of scraper classes that claim to find articles in the parsed body.

Parameters:

  • parsed_body (Nokogiri::HTML::Document)

    The parsed HTML body.

  • opts (Hash) (defaults to: )

    The options hash.

Options Hash (opts):

  • :wordpress_api (Hash)

    scraper toggle and configuration

  • :schema (Hash)

    scraper toggle and configuration

  • :microdata (Hash)

    scraper toggle and configuration

  • :json_state (Hash)

    scraper toggle and configuration

  • :semantic_html (Hash)

    scraper toggle and configuration

  • :html (Hash)

    scraper toggle and configuration

Returns:

  • (Array<Class>)

    An array of scraper classes that can handle the parsed body.



79
80
81
82
83
84
85
86
# File 'lib/html2rss/auto_source/scraper.rb', line 79

def self.from(parsed_body, opts = Html2rss::AutoSource::DEFAULT_CONFIG[:scraper])
  scrapers = SCRAPERS.select { |scraper| opts.dig(scraper.options_key, :enabled) }
  scrapers.select! { |scraper| scraper.articles?(parsed_body) }

  raise no_scraper_found_for(parsed_body) if scrapers.empty?

  scrapers
end

.instances_for(parsed_body, url:, request_session: nil, opts: ) ⇒ Array<Object>

Returns scraper instances ready for extraction. ‘instances_for` is the main entrypoint for extraction. It lets a scraper decide whether it matches using the same instance that will later yield article hashes, which keeps precomputed state close to the scraper that owns it.

Parameters:

  • parsed_body (Nokogiri::HTML::Document)

    The parsed HTML body.

  • url (String, Html2rss::Url)

    The page url.

  • request_session (Html2rss::RequestSession, nil) (defaults to: nil)

    Shared follow-up session.

  • opts (Hash) (defaults to: )

    The options hash.

Options Hash (opts:):

  • :wordpress_api (Hash)

    scraper toggle and configuration

  • :schema (Hash)

    scraper toggle and configuration

  • :microdata (Hash)

    scraper toggle and configuration

  • :json_state (Hash)

    scraper toggle and configuration

  • :semantic_html (Hash)

    scraper toggle and configuration

  • :html (Hash)

    scraper toggle and configuration

Returns:

  • (Array<Object>)

    An array of scraper instances that can handle the parsed body.



105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
# File 'lib/html2rss/auto_source/scraper.rb', line 105

def self.instances_for(parsed_body, url:, request_session: nil,
                       opts: Html2rss::AutoSource::DEFAULT_CONFIG[:scraper])
  instances = SCRAPERS.filter_map do |scraper|
    next unless opts.dig(scraper.options_key, :enabled)

    instance = scraper.new(parsed_body, url:, request_session:, **opts.fetch(scraper.options_key, {}))
    next unless extractable_instance?(instance, parsed_body)

    instance
  end

  raise no_scraper_found_for(parsed_body) if instances.empty?

  instances
end