Class: Html2rss::AutoSource::Scraper::SemanticHtml
- Inherits:
-
Object
- Object
- Html2rss::AutoSource::Scraper::SemanticHtml
- Includes:
- Enumerable
- Defined in:
- lib/html2rss/auto_source/scraper/semantic_html.rb,
lib/html2rss/auto_source/scraper/semantic_html/deduplicator.rb,
lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb
Overview
Scrapes semantic containers by choosing one primary content link per block before extraction.
This scraper is intentionally container-first:
-
collect candidate semantic containers once
-
select the strongest content-like anchor within each container
-
extract fields from the container while honoring that anchor choice
The result is lower recall on weak-signal blocks, but much better link quality on modern teaser cards that mix headlines, utility links, and duplicate image overlays.
Defined Under Namespace
Classes: AnchorSelector, Deduplicator, Entry
Instance Attribute Summary collapse
-
#parsed_body ⇒ Object
readonly
Returns the value of attribute parsed_body.
Class Method Summary collapse
-
.articles?(parsed_body) ⇒ Boolean
True when at least one semantic container has an eligible anchor.
-
.options_key ⇒ Symbol
Config key used to enable or configure this scraper.
Instance Method Summary collapse
-
#each {|article_hash| ... } ⇒ Enumerator<Hash>
Yields extracted article hashes for each semantic container that survives anchor selection.
-
#extractable? ⇒ Boolean
Reports whether the page contains at least one semantic container with a selectable primary anchor.
-
#initialize(parsed_body, url:, extractor: HtmlExtractor, **_opts) ⇒ SemanticHtml
constructor
A new instance of SemanticHtml.
Constructor Details
#initialize(parsed_body, url:, extractor: HtmlExtractor, **_opts) ⇒ SemanticHtml
Returns a new instance of SemanticHtml.
53 54 55 56 57 58 59 |
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 53 def initialize(parsed_body, url:, extractor: HtmlExtractor, **_opts) @parsed_body = parsed_body @url = url @extractor = extractor @link_heuristics = LinkHeuristics.new(url) @anchor_selector = AnchorSelector.new(url) end |
Instance Attribute Details
#parsed_body ⇒ Object (readonly)
Returns the value of attribute parsed_body.
61 62 63 |
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 61 def parsed_body @parsed_body end |
Class Method Details
.articles?(parsed_body) ⇒ Boolean
Returns true when at least one semantic container has an eligible anchor.
42 43 44 45 46 |
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 42 def self.articles?(parsed_body) return false unless parsed_body new(parsed_body, url: 'https://example.com').extractable? end |
.options_key ⇒ Symbol
Returns config key used to enable or configure this scraper.
38 |
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 38 def self. = :semantic_html |
Instance Method Details
#each {|article_hash| ... } ⇒ Enumerator<Hash>
Yields extracted article hashes for each semantic container that survives anchor selection.
Detection and extraction share the same memoized entry list so this scraper does not rerun anchor ranking once a page has already been accepted as extractable.
73 74 75 76 77 |
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 73 def each return enum_for(:each) unless block_given? ranked_entries.each { yield _1.article } end |
#extractable? ⇒ Boolean
Reports whether the page contains at least one semantic container with a selectable primary anchor.
84 85 86 |
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 84 def extractable? extractable_entries.any? end |