Class: Html2rss::AutoSource::Scraper::LinkHeuristics
- Inherits:
-
Object
- Object
- Html2rss::AutoSource::Scraper::LinkHeuristics
- Defined in:
- lib/html2rss/auto_source/scraper/link_heuristics.rb
Overview
Shared link-level heuristics used by scraper-local selection and scoring. This keeps normalization and route/text classification consistent without moving scraper policy into higher orchestration.
Defined Under Namespace
Classes: ConfidenceClassifier, DestinationFacts, HrefExtractor, LeadingSegments, PathClassifier, PostSuffixClassifier, TextClassifier
Instance Method Summary collapse
-
#destination_facts(anchor_or_href) ⇒ DestinationFacts?
Builds normalized destination facts for an anchor element or href string.
-
#initialize(base_url) ⇒ LinkHeuristics
constructor
A new instance of LinkHeuristics.
-
#recommended_text?(text) ⇒ Boolean
True when text identifies recommendation chrome.
-
#utility_prefix_text?(text) ⇒ Boolean
True when text begins with a utility label.
-
#utility_text?(text) ⇒ Boolean
True when text matches a utility label.
Constructor Details
#initialize(base_url) ⇒ LinkHeuristics
Returns a new instance of LinkHeuristics.
414 415 416 417 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 414 def initialize(base_url) @base_url = base_url @text_classifier = TextClassifier.new end |
Instance Method Details
#destination_facts(anchor_or_href) ⇒ DestinationFacts?
Builds normalized destination facts for an anchor element or href string.
423 424 425 426 427 428 429 430 431 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 423 def destination_facts(anchor_or_href) href = HrefExtractor.call(anchor_or_href) return unless href url = Html2rss::Url.from_relative(href, @base_url) DestinationFacts.build(url) rescue ArgumentError nil end |
#recommended_text?(text) ⇒ Boolean
Returns true when text identifies recommendation chrome.
443 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 443 def recommended_text?(text) = @text_classifier.recommended?(text) |
#utility_prefix_text?(text) ⇒ Boolean
Returns true when text begins with a utility label.
439 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 439 def utility_prefix_text?(text) = @text_classifier.utility_prefix?(text) |
#utility_text?(text) ⇒ Boolean
Returns true when text matches a utility label.
435 |
# File 'lib/html2rss/auto_source/scraper/link_heuristics.rb', line 435 def utility_text?(text) = @text_classifier.utility?(text) |