Class: Html2rss::AutoSource::Scraper::SemanticHtml
- Inherits:
-
Object
- Object
- Html2rss::AutoSource::Scraper::SemanticHtml
- Includes:
- Enumerable
- Defined in:
- lib/html2rss/auto_source/scraper/semantic_html.rb,
lib/html2rss/auto_source/scraper/semantic_html/deduplicator.rb,
lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb
Overview
Scrapes semantic containers by choosing one primary content link per block before extraction.
This scraper is intentionally container-first:
-
collect candidate semantic containers once
-
select the strongest content-like anchor within each container
-
extract fields from the container while honoring that anchor choice
The result is lower recall on weak-signal blocks, but much better link quality on modern teaser cards that mix headlines, utility links, and duplicate image overlays.
Defined Under Namespace
Classes: AnchorSelector, Deduplicator, Entry
Constant Summary collapse
- CONTENT_REGEXP =
Regexp to match content-related tokens.
begin words = LinkHeuristics::PathClassifier::SEGMENT_SETS.fetch(:content) /(?:^|\s|[-_])(#{Regexp.union(words.to_a).source})(?:\s|[-_]|$)/i end.freeze
- JUNK_REGEXP =
Regexp to match junk/utility-related tokens.
begin words = LinkHeuristics::PathClassifier::SEGMENT_SETS.fetch(:utility) /(?:^|\s|[-_])(#{Regexp.union(words.to_a).source})(?:\s|[-_]|$)/i end.freeze
Instance Attribute Summary collapse
-
#parsed_body ⇒ Object
readonly
Returns the value of attribute parsed_body.
Class Method Summary collapse
-
.articles?(parsed_body) ⇒ Boolean
True when at least one semantic container has an eligible anchor.
-
.options_key ⇒ Symbol
Config key used to enable or configure this scraper.
Instance Method Summary collapse
-
#each {|article_hash| ... } ⇒ Enumerator<Hash>
Yields extracted article hashes for each semantic container that survives anchor selection.
-
#extractable? ⇒ Boolean
Reports whether the page contains at least one semantic container with a selectable primary anchor.
-
#initialize(parsed_body, url:, extractor: HtmlExtractor, **_opts) ⇒ SemanticHtml
constructor
A new instance of SemanticHtml.
Constructor Details
#initialize(parsed_body, url:, extractor: HtmlExtractor, **_opts) ⇒ SemanticHtml
Returns a new instance of SemanticHtml.
65 66 67 68 69 70 71 |
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 65 def initialize(parsed_body, url:, extractor: HtmlExtractor, **_opts) @parsed_body = parsed_body @url = url @extractor = extractor @link_heuristics = LinkHeuristics.new(url) @anchor_selector = AnchorSelector.new(url) end |
Instance Attribute Details
#parsed_body ⇒ Object (readonly)
Returns the value of attribute parsed_body.
73 74 75 |
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 73 def parsed_body @parsed_body end |
Class Method Details
.articles?(parsed_body) ⇒ Boolean
Returns true when at least one semantic container has an eligible anchor.
54 55 56 57 58 |
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 54 def self.articles?(parsed_body) return false unless parsed_body new(parsed_body, url: 'https://example.com').extractable? end |
.options_key ⇒ Symbol
Returns config key used to enable or configure this scraper.
50 |
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 50 def self. = :semantic_html |
Instance Method Details
#each {|article_hash| ... } ⇒ Enumerator<Hash>
Yields extracted article hashes for each semantic container that survives anchor selection.
Detection and extraction share the same memoized entry list so this scraper does not rerun anchor ranking once a page has already been accepted as extractable.
85 86 87 88 89 |
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 85 def each return enum_for(:each) unless block_given? ranked_entries.each { yield _1.article } end |
#extractable? ⇒ Boolean
Reports whether the page contains at least one semantic container with a selectable primary anchor.
96 97 98 |
# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 96 def extractable? extractable_entries.any? end |