Class: Html2rss::AutoSource::Scraper::LinkHeuristics

Inherits:
Object
  • Object
show all
Defined in:
lib/html2rss/auto_source/scraper/link_heuristics.rb

Overview

Shared link-level heuristics used by scraper-local selection and scoring. This keeps normalization and route/text classification consistent without moving scraper policy into higher orchestration.

Defined Under Namespace

Classes: DestinationFacts, HrefExtractor, PathClassifier, TextClassifier

Instance Method Summary collapse

Constructor Details

#initialize(base_url) ⇒ LinkHeuristics

Returns a new instance of LinkHeuristics.

Parameters:

  • base_url (String, Html2rss::Url)

    page URL used to resolve relative hrefs



351
352
353
354
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 351

def initialize(base_url)
  @base_url = base_url
  @text_classifier = TextClassifier.new
end

Instance Method Details

#destination_facts(anchor_or_href) ⇒ DestinationFacts?

Builds normalized destination facts for an anchor element or href string.

Parameters:

  • anchor_or_href (Nokogiri::XML::Element, String, #to_s)

    anchor element or href-like value

Returns:

  • (DestinationFacts, nil)

    normalized destination facts, or nil for blank/invalid URLs



360
361
362
363
364
365
366
367
368
369
370
371
372
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 360

def destination_facts(anchor_or_href)
  return node_facts[anchor_or_href] if node_facts.key?(anchor_or_href)

  href = HrefExtractor.call(anchor_or_href)
  return unless href

  res = memoized_destination_facts(href)

  node_facts[anchor_or_href] = res if anchor_or_href.is_a?(Nokogiri::XML::Node)
  res
rescue ArgumentError
  nil
end

Returns true when text identifies recommendation chrome.

Parameters:

  • text (String, #to_s)

    visible anchor text

Returns:

  • (Boolean)

    true when text identifies recommendation chrome



384
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 384

def recommended_text?(text) = @text_classifier.recommended?(text)

#utility_prefix_text?(text) ⇒ Boolean

Returns true when text begins with a utility label.

Parameters:

  • text (String, #to_s)

    visible anchor text

Returns:

  • (Boolean)

    true when text begins with a utility label



380
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 380

def utility_prefix_text?(text) = @text_classifier.utility_prefix?(text)

#utility_text?(text) ⇒ Boolean

Returns true when text matches a utility label.

Parameters:

  • text (String, #to_s)

    visible anchor text

Returns:

  • (Boolean)

    true when text matches a utility label



376
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 376

def utility_text?(text) = @text_classifier.utility?(text)