Class: Html2rss::AutoSource::Scraper::SemanticHtml::AnchorSelector

Inherits:
Object
  • Object
show all
Defined in:
lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb

Overview

Selects the best content-like anchor from a semantic container.

The selector turns raw DOM anchors into ranked facts so semantic scraping can reason about link intent instead of DOM order. It favors heading-aligned article links and suppresses utility links, duplicate destinations, and weak textless affordances.

Constant Summary collapse

HEADING_SELECTOR =

Comma-separated heading selector used for heading/anchor matching.

HtmlExtractor::HEADING_TAGS.join(',').freeze

Instance Method Summary collapse

Constructor Details

#initialize(base_url) ⇒ AnchorSelector

Returns a new instance of AnchorSelector.

Parameters:

  • base_url (String, Html2rss::Url)

    page URL used to normalize href destinations



19
20
21
# File 'lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb', line 19

def initialize(base_url)
  @link_heuristics = LinkHeuristics.new(base_url)
end

Instance Method Details

#primary_anchor_for(container) ⇒ Nokogiri::XML::Element?

Chooses the single anchor that best represents the story contained in a semantic block.

Ranking is scoped to one container at a time. That keeps the logic local, makes duplicate links to the same destination collapse into one candidate, and avoids page-wide heuristics leaking across cards.

Parameters:

  • container (Nokogiri::XML::Element)

    semantic container being evaluated

Returns:

  • (Nokogiri::XML::Element, nil)

    selected primary anchor or nil when none qualify



33
34
35
# File 'lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb', line 33

def primary_anchor_for(container)
  facts_for(container).max_by(&:score)&.anchor
end