Class: Html2rss::AutoSource::Scraper::LinkHeuristics
- Inherits:
-
Object
- Object
- Html2rss::AutoSource::Scraper::LinkHeuristics
- Defined in:
- lib/html2rss/auto_source/scraper/link_heuristics.rb
Overview
Shared link-level heuristics used by scraper-local selection and scoring. This keeps normalization and route/text classification consistent without moving scraper policy into higher orchestration.
Defined Under Namespace
Classes: DestinationFacts, HrefExtractor, PathClassifier, TextClassifier
Instance Method Summary collapse
-
#destination_facts(anchor_or_href) ⇒ DestinationFacts?
Builds normalized destination facts for an anchor element or href string.
-
#initialize(base_url) ⇒ LinkHeuristics
constructor
A new instance of LinkHeuristics.
-
#recommended_text?(text) ⇒ Boolean
True when text identifies recommendation chrome.
-
#utility_prefix_text?(text) ⇒ Boolean
True when text begins with a utility label.
-
#utility_text?(text) ⇒ Boolean
True when text matches a utility label.
Constructor Details
#initialize(base_url) ⇒ LinkHeuristics
Returns a new instance of LinkHeuristics.
351 352 353 354 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 351 def initialize(base_url) @base_url = base_url @text_classifier = TextClassifier.new end |
Instance Method Details
#destination_facts(anchor_or_href) ⇒ DestinationFacts?
Builds normalized destination facts for an anchor element or href string.
360 361 362 363 364 365 366 367 368 369 370 371 372 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 360 def destination_facts(anchor_or_href) return node_facts[anchor_or_href] if node_facts.key?(anchor_or_href) href = HrefExtractor.call(anchor_or_href) return unless href res = memoized_destination_facts(href) node_facts[anchor_or_href] = res if anchor_or_href.is_a?(Nokogiri::XML::Node) res rescue ArgumentError nil end |
#recommended_text?(text) ⇒ Boolean
Returns true when text identifies recommendation chrome.
384 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 384 def recommended_text?(text) = @text_classifier.recommended?(text) |
#utility_prefix_text?(text) ⇒ Boolean
Returns true when text begins with a utility label.
380 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 380 def utility_prefix_text?(text) = @text_classifier.utility_prefix?(text) |
#utility_text?(text) ⇒ Boolean
Returns true when text matches a utility label.
376 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 376 def utility_text?(text) = @text_classifier.utility?(text) |