Class: Html2rss::AutoSource

Inherits:

Object

Object
Html2rss::AutoSource

show all

Defined in:: lib/html2rss/auto_source.rb,
lib/html2rss/auto_source/cleanup.rb,
lib/html2rss/auto_source/scraper.rb,
lib/html2rss/auto_source/scraper/html.rb,
lib/html2rss/auto_source/scraper/schema.rb,
lib/html2rss/auto_source/scraper/microdata.rb,
lib/html2rss/auto_source/scraper/json_state.rb,
lib/html2rss/auto_source/scraper/schema/thing.rb,
lib/html2rss/auto_source/scraper/semantic_html.rb,
lib/html2rss/auto_source/scraper/wordpress_api.rb,
lib/html2rss/auto_source/scraper/schema/item_list.rb,
lib/html2rss/auto_source/scraper/schema/list_item.rb,
lib/html2rss/auto_source/scraper/wordpress_api/page_scope.rb,
lib/html2rss/auto_source/scraper/schema/category_extractor.rb,
lib/html2rss/auto_source/scraper/wordpress_api/posts_endpoint.rb,
lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb

Overview

The AutoSource class automatically extracts articles from a given URL by utilizing a collection of Scrapers. These scrapers analyze and parse popular structured data formats—such as schema, microdata, and open graph—to identify and compile article elements into unified articles.

Scrapers supporting plain HTML are also available for sites without structured data, though results may vary based on page markup.

Defined Under Namespace

Modules: Scraper Classes: Cleanup

Constant Summary collapse

DEFAULT_CONFIG = Default auto-source configuration shipped for scraper and cleanup behavior.

{
  scraper: {
    wordpress_api: {
      enabled: true
    },
    schema: {
      enabled: true
    },
    microdata: {
      enabled: true
    },
    json_state: {
      enabled: true
    },
    semantic_html: {
      enabled: true
    },
    html: {
      enabled: true,
      minimum_selector_frequency: Scraper::Html::DEFAULT_MINIMUM_SELECTOR_FREQUENCY,
      use_top_selectors: Scraper::Html::DEFAULT_USE_TOP_SELECTORS
    }
  },
  cleanup: Cleanup::DEFAULT_CONFIG
}.freeze

Config = Runtime schema used to validate auto-source config values.

Dry::Schema.Params do
  optional(:scraper).hash(&SCRAPER_CONFIG)

  optional(:cleanup).hash do
    optional(:keep_different_domain).filled(:bool)
    optional(:min_words_title).filled(:integer, gt?: 0)
  end
end

Instance Method Summary collapse

#articles ⇒ Array<Html2rss::RssBuilder::Article>

Extracts article candidates by selecting every scraper that can explain the page shape, running those scrapers, and normalizing the resulting hashes into ‘RssBuilder::Article` objects.
#initialize(response, opts = DEFAULT_CONFIG, request_session: nil) ⇒ void constructor

Constructor Details

#initialize(response, opts = DEFAULT_CONFIG, request_session: nil) ⇒ `void`

Parameters:

response (Html2rss::RequestService::Response) —

initial page response
opts (Hash) (defaults to: DEFAULT_CONFIG) —

validated auto-source options
request_session (Html2rss::RequestSession, nil) (defaults to: nil) —

shared request session for follow-up fetches

Options Hash (opts):

:scraper (Hash) —

scraper configuration map
:cleanup (Hash) —

cleanup configuration map

# File 'lib/html2rss/auto_source.rb', line 88

def initialize(response, opts = DEFAULT_CONFIG, request_session: nil)
  @parsed_body = response.parsed_body
  @url = response.url
  @opts = opts
  @request_session = request_session
end

Instance Method Details

#articles ⇒ `Array<Html2rss::RssBuilder::Article>`

Extracts article candidates by selecting every scraper that can explain the page shape, running those scrapers, and normalizing the resulting hashes into ‘RssBuilder::Article` objects.

The contributor-facing flow is:

choose scraper instances that match the page
let each scraper collect its own candidates
clean and deduplicate the merged article list

Scrapers with expensive precomputation, such as ‘SemanticHtml`, keep that state on the instance so detection and extraction can reuse the same work.

Returns:

(Array<Html2rss::RssBuilder::Article>) —

extracted articles

# File 'lib/html2rss/auto_source.rb', line 109

def articles
  @articles ||= extract_articles
rescue Html2rss::AutoSource::Scraper::NoScraperFound => error
  Log.warn "#{self.class}: no scraper matched #{url} (#{error.message})"
  []
end