Class: Html2rss::AutoSource::Scraper::SemanticHtml

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/html2rss/auto_source/scraper/semantic_html.rb,
lib/html2rss/auto_source/scraper/semantic_html/deduplicator.rb,
lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb

Overview

Scrapes semantic containers by choosing one primary content link per block before extraction.

This scraper is intentionally container-first:

  1. collect candidate semantic containers once

  2. select the strongest content-like anchor within each container

  3. extract fields from the container while honoring that anchor choice

The result is lower recall on weak-signal blocks, but much better link quality on modern teaser cards that mix headlines, utility links, and duplicate image overlays.

Defined Under Namespace

Classes: AnchorSelector, Deduplicator, Entry

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(parsed_body, url:, extractor: HtmlExtractor, **_opts) ⇒ SemanticHtml

Returns a new instance of SemanticHtml.

Parameters:

  • parsed_body (Nokogiri::HTML::Document)

    parsed HTML document

  • url (String, Html2rss::Url)

    base url

  • extractor (Class) (defaults to: HtmlExtractor)

    extractor class used for article extraction

  • _opts (Hash)

    scraper-specific options

Options Hash (**_opts):

  • :_reserved (Object)

    reserved for future scraper-specific options



53
54
55
56
57
58
59
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 53

def initialize(parsed_body, url:, extractor: HtmlExtractor, **_opts)
  @parsed_body = parsed_body
  @url = url
  @extractor = extractor
  @link_heuristics = LinkHeuristics.new(url)
  @anchor_selector = AnchorSelector.new(url)
end

Instance Attribute Details

#parsed_bodyObject (readonly)

Returns the value of attribute parsed_body.



61
62
63
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 61

def parsed_body
  @parsed_body
end

Class Method Details

.articles?(parsed_body) ⇒ Boolean

Returns true when at least one semantic container has an eligible anchor.

Parameters:

  • parsed_body (Nokogiri::HTML::Document)

    parsed HTML document

Returns:

  • (Boolean)

    true when at least one semantic container has an eligible anchor



42
43
44
45
46
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 42

def self.articles?(parsed_body)
  return false unless parsed_body

  new(parsed_body, url: 'https://example.com').extractable?
end

.options_keySymbol

Returns config key used to enable or configure this scraper.

Returns:

  • (Symbol)

    config key used to enable or configure this scraper



38
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 38

def self.options_key = :semantic_html

Instance Method Details

#each {|article_hash| ... } ⇒ Enumerator<Hash>

Yields extracted article hashes for each semantic container that survives anchor selection.

Detection and extraction share the same memoized entry list so this scraper does not rerun anchor ranking once a page has already been accepted as extractable.

Yield Parameters:

  • article_hash (Hash)

    extracted article hash

Returns:

  • (Enumerator<Hash>)


73
74
75
76
77
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 73

def each
  return enum_for(:each) unless block_given?

  ranked_entries.each { yield _1.article }
end

#extractable?Boolean

Reports whether the page contains at least one semantic container with a selectable primary anchor.

Returns:

  • (Boolean)

    true when at least one candidate container yields a primary anchor



84
85
86
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 84

def extractable?
  extractable_entries.any?
end