Class: Html2rss::AutoSource::Scraper::LinkHeuristics

Inherits:
Object
  • Object
show all
Defined in:
lib/html2rss/auto_source/scraper/link_heuristics.rb

Overview

Shared link-level heuristics used by scraper-local selection and scoring. This keeps normalization and route/text classification consistent without moving scraper policy into higher orchestration.

Defined Under Namespace

Classes: ConfidenceClassifier, DestinationFacts, HrefExtractor, LeadingSegments, PathClassifier, PostSuffixClassifier, TextClassifier

Instance Method Summary collapse

Constructor Details

#initialize(base_url) ⇒ LinkHeuristics

Returns a new instance of LinkHeuristics.

Parameters:

  • base_url (String, Html2rss::Url)

    page URL used to resolve relative hrefs



414
415
416
417
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 414

def initialize(base_url)
  @base_url = base_url
  @text_classifier = TextClassifier.new
end

Instance Method Details

#destination_facts(anchor_or_href) ⇒ DestinationFacts?

Builds normalized destination facts for an anchor element or href string.

Parameters:

  • anchor_or_href (Nokogiri::XML::Element, String, #to_s)

    anchor element or href-like value

Returns:

  • (DestinationFacts, nil)

    normalized destination facts, or nil for blank/invalid URLs



423
424
425
426
427
428
429
430
431
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 423

def destination_facts(anchor_or_href)
  href = HrefExtractor.call(anchor_or_href)
  return unless href

  url = Html2rss::Url.from_relative(href, @base_url)
  DestinationFacts.build(url)
rescue ArgumentError
  nil
end

Returns true when text identifies recommendation chrome.

Parameters:

  • text (String, #to_s)

    visible anchor text

Returns:

  • (Boolean)

    true when text identifies recommendation chrome



443
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 443

def recommended_text?(text) = @text_classifier.recommended?(text)

#utility_prefix_text?(text) ⇒ Boolean

Returns true when text begins with a utility label.

Parameters:

  • text (String, #to_s)

    visible anchor text

Returns:

  • (Boolean)

    true when text begins with a utility label



439
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 439

def utility_prefix_text?(text) = @text_classifier.utility_prefix?(text)

#utility_text?(text) ⇒ Boolean

Returns true when text matches a utility label.

Parameters:

  • text (String, #to_s)

    visible anchor text

Returns:

  • (Boolean)

    true when text matches a utility label



435
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 435

def utility_text?(text) = @text_classifier.utility?(text)