Class: Html2rss::AutoSource::Scraper::SemanticHtml

Inherits:

Object

Object
Html2rss::AutoSource::Scraper::SemanticHtml

show all

Includes:: Enumerable

Defined in:: lib/html2rss/auto_source/scraper/semantic_html.rb,
lib/html2rss/auto_source/scraper/semantic_html/deduplicator.rb,
lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb

Overview

Scrapes semantic containers by choosing one primary content link per block before extraction.

This scraper is intentionally container-first:

collect candidate semantic containers once
select the strongest content-like anchor within each container
extract fields from the container while honoring that anchor choice

The result is lower recall on weak-signal blocks, but much better link quality on modern teaser cards that mix headlines, utility links, and duplicate image overlays.

Defined Under Namespace

Classes: AnchorSelector, Deduplicator, Entry

Constant Summary collapse

CONTENT_REGEXP = Regexp to match content-related tokens.

begin
  words = LinkHeuristics::PathClassifier::SEGMENT_SETS.fetch(:content)
  /(?:^|\s|[-_])(#{Regexp.union(words.to_a).source})(?:\s|[-_]|$)/i
end.freeze

JUNK_REGEXP = Regexp to match junk/utility-related tokens.

begin
  words = LinkHeuristics::PathClassifier::SEGMENT_SETS.fetch(:utility)
  /(?:^|\s|[-_])(#{Regexp.union(words.to_a).source})(?:\s|[-_]|$)/i
end.freeze

Instance Attribute Summary collapse

#parsed_body ⇒ Object readonly

Returns the value of attribute parsed_body.

Class Method Summary collapse

.articles?(parsed_body) ⇒ Boolean

True when at least one semantic container has an eligible anchor.
.options_key ⇒ Symbol

Config key used to enable or configure this scraper.

Instance Method Summary collapse

#each {|article_hash| ... } ⇒ Enumerator<Hash>

Yields extracted article hashes for each semantic container that survives anchor selection.
#extractable? ⇒ Boolean

Reports whether the page contains at least one semantic container with a selectable primary anchor.
#initialize(parsed_body, url:, extractor: HtmlExtractor, **_opts) ⇒ SemanticHtml constructor

A new instance of SemanticHtml.

Constructor Details

#initialize(parsed_body, url:, extractor: HtmlExtractor, **_opts) ⇒ `SemanticHtml`

Returns a new instance of SemanticHtml.

Parameters:

parsed_body (Nokogiri::HTML::Document) —

parsed HTML document
url (String, Html2rss::Url) —

base url
extractor (Class) (defaults to: HtmlExtractor) —

extractor class used for article extraction
_opts (Hash) —

scraper-specific options

Options Hash (**_opts):

:_reserved (Object) —

reserved for future scraper-specific options

# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 65

def initialize(parsed_body, url:, extractor: HtmlExtractor, **_opts)
  @parsed_body = parsed_body
  @url = url
  @extractor = extractor
  @link_heuristics = LinkHeuristics.new(url)
  @anchor_selector = AnchorSelector.new(url)
end

Instance Attribute Details

#parsed_body ⇒ `Object` (readonly)

Returns the value of attribute parsed_body.



73
74
75

# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 73

def parsed_body
  @parsed_body
end

Class Method Details

.articles?(parsed_body) ⇒ `Boolean`

Returns true when at least one semantic container has an eligible anchor.

Parameters:

parsed_body (Nokogiri::HTML::Document) —

parsed HTML document

Returns:

(Boolean) —

true when at least one semantic container has an eligible anchor

# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 54

def self.articles?(parsed_body)
  return false unless parsed_body

  new(parsed_body, url: 'https://example.com').extractable?
end

.options_key ⇒ `Symbol`

Returns config key used to enable or configure this scraper.

Returns:

(Symbol) —

config key used to enable or configure this scraper

50	# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 50 def self.options_key = :semantic_html

Instance Method Details

#each {|article_hash| ... } ⇒ `Enumerator<Hash>`

Yields extracted article hashes for each semantic container that survives anchor selection.

Detection and extraction share the same memoized entry list so this scraper does not rerun anchor ranking once a page has already been accepted as extractable.

Yield Parameters:

article_hash (Hash) —

extracted article hash

Returns:

(Enumerator<Hash>)

# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 85

def each
  return enum_for(:each) unless block_given?

  ranked_entries.each { yield _1.article }
end

#extractable? ⇒ `Boolean`

Reports whether the page contains at least one semantic container with a selectable primary anchor.

Returns:

(Boolean) —

true when at least one candidate container yields a primary anchor



96
97
98

# File 'lib/html2rss/auto_source/scraper/semantic_html.rb', line 96

def extractable?
  extractable_entries.any?
end

Class: Html2rss::AutoSource::Scraper::SemanticHtml

Overview

Defined Under Namespace

Constant Summary collapse

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(parsed_body, url:, extractor: HtmlExtractor, **_opts) ⇒ SemanticHtml

Instance Attribute Details

#parsed_body ⇒ Object (readonly)

Class Method Details

.articles?(parsed_body) ⇒ Boolean

.options_key ⇒ Symbol

Instance Method Details

#each {|article_hash| ... } ⇒ Enumerator<Hash>

#extractable? ⇒ Boolean

#initialize(parsed_body, url:, extractor: HtmlExtractor, **_opts) ⇒ `SemanticHtml`

#parsed_body ⇒ `Object` (readonly)

.articles?(parsed_body) ⇒ `Boolean`

.options_key ⇒ `Symbol`

#each {|article_hash| ... } ⇒ `Enumerator<Hash>`

#extractable? ⇒ `Boolean`