Class: Html2rss::AutoSource::Scraper::Html
- Inherits:
-
Object
- Object
- Html2rss::AutoSource::Scraper::Html
- Includes:
- Enumerable
- Defined in:
- lib/html2rss/auto_source/scraper/html.rb
Overview
Scrapes article-like blocks from plain HTML by looking for repeated link structures when richer structured data is unavailable.
The approach is intentionally heuristic:
-
collect repeated anchor paths
-
walk upward to a shared container shape
-
extract the best anchor found inside each container
This scraper is broader and noisier than ‘SemanticHtml`, so it acts as a fallback for pages without stronger semantic signals.
Constant Summary collapse
- DETECTION_BASE_URL =
Absolute base URL used when probe-time detection needs to normalize relative hrefs.
'https://example.com'- DEFAULT_MINIMUM_SELECTOR_FREQUENCY =
Minimum selector frequency required to treat a path as a stable list signal.
2- DEFAULT_USE_TOP_SELECTORS =
Number of most frequent selectors kept for container extraction.
5
Instance Attribute Summary collapse
-
#parsed_body ⇒ Object
readonly
Returns the value of attribute parsed_body.
Class Method Summary collapse
-
.articles?(parsed_body) ⇒ Boolean
Probes whether the document appears to contain repeated anchor structures that this fallback scraper can cluster into article-like containers.
-
.options_key ⇒ Symbol
Config key used to enable or configure this scraper.
-
.simplify_xpath(xpath) ⇒ String
Simplify an XPath selector by removing the index notation.
Instance Method Summary collapse
-
#article_tag_condition?(node) ⇒ Boolean
Decides whether a traversed node has reached a useful article-like boundary for the generic HTML scraper.
-
#each {|The| ... } ⇒ Enumerator
Enumerator for the scraped articles.
-
#extractable? ⇒ Boolean
True when the scraper can likely extract articles.
-
#initialize(parsed_body, url:, extractor: HtmlExtractor, **opts) ⇒ Html
constructor
A new instance of Html.
Constructor Details
#initialize(parsed_body, url:, extractor: HtmlExtractor, **opts) ⇒ Html
Returns a new instance of Html.
60 61 62 63 64 65 66 67 |
# File 'lib/html2rss/auto_source/scraper/html.rb', line 60 def initialize(parsed_body, url:, extractor: HtmlExtractor, **opts) @parsed_body = parsed_body @url = url @extractor = extractor @opts = opts @link_heuristics = LinkHeuristics.new(url) @ignored_cache = {}.compare_by_identity end |
Instance Attribute Details
#parsed_body ⇒ Object (readonly)
Returns the value of attribute parsed_body.
69 70 71 |
# File 'lib/html2rss/auto_source/scraper/html.rb', line 69 def parsed_body @parsed_body end |
Class Method Details
.articles?(parsed_body) ⇒ Boolean
Probes whether the document appears to contain repeated anchor structures that this fallback scraper can cluster into article-like containers.
40 41 42 |
# File 'lib/html2rss/auto_source/scraper/html.rb', line 40 def self.articles?(parsed_body) new(parsed_body, url: DETECTION_BASE_URL).any? end |
.options_key ⇒ Symbol
Returns config key used to enable or configure this scraper.
31 |
# File 'lib/html2rss/auto_source/scraper/html.rb', line 31 def self. = :html |
.simplify_xpath(xpath) ⇒ String
Simplify an XPath selector by removing the index notation. This keeps repeated anchor paths comparable across sibling blocks.
50 51 52 |
# File 'lib/html2rss/auto_source/scraper/html.rb', line 50 def self.simplify_xpath(xpath) HtmlExtractor::ListCandidates.simplify_xpath(xpath) end |
Instance Method Details
#article_tag_condition?(node) ⇒ Boolean
Decides whether a traversed node has reached a useful article-like boundary for the generic HTML scraper.
The predicate prefers containers that add surrounding link context, which helps the scraper move from a leaf anchor toward a repeated teaser/card wrapper.
96 97 98 99 100 101 102 103 |
# File 'lib/html2rss/auto_source/scraper/html.rb', line 96 def article_tag_condition?(node) # Ignore tags that are below ignored DOM chrome. return false if HtmlExtractor.ignored_container_path?(node, @ignored_cache) return true if %w[body html].include?(node.name) return false unless (parent = node.parent) anchor_count(parent) > anchor_count(node) end |
#each {|The| ... } ⇒ Enumerator
Returns Enumerator for the scraped articles.
74 75 76 77 78 |
# File 'lib/html2rss/auto_source/scraper/html.rb', line 74 def each return enum_for(:each) unless block_given? articles.each { yield _1 } end |
#extractable? ⇒ Boolean
Returns true when the scraper can likely extract articles.
82 83 84 |
# File 'lib/html2rss/auto_source/scraper/html.rb', line 82 def extractable? articles.any? end |