Class: Html2rss::AutoSource::Scraper::Html

Inherits:

Object

Object
Html2rss::AutoSource::Scraper::Html

show all

Includes:: Enumerable

Defined in:: lib/html2rss/auto_source/scraper/html.rb

Overview

Scrapes article-like blocks from plain HTML by looking for repeated link structures when richer structured data is unavailable.

The approach is intentionally heuristic:

collect repeated anchor paths
walk upward to a shared container shape
extract the best anchor found inside each container

This scraper is broader and noisier than ‘SemanticHtml`, so it acts as a fallback for pages without stronger semantic signals.

Constant Summary collapse

DETECTION_BASE_URL = Absolute base URL used when probe-time detection needs to normalize relative hrefs.

'https://example.com'

DEFAULT_MINIMUM_SELECTOR_FREQUENCY = Minimum selector frequency required to treat a path as a stable list signal.

DEFAULT_USE_TOP_SELECTORS = Number of most frequent selectors kept for container extraction.

Instance Attribute Summary collapse

#parsed_body ⇒ Object readonly

Returns the value of attribute parsed_body.

Class Method Summary collapse

.articles?(parsed_body) ⇒ Boolean

Probes whether the document appears to contain repeated anchor structures that this fallback scraper can cluster into article-like containers.
.options_key ⇒ Symbol

Config key used to enable or configure this scraper.
.simplify_xpath(xpath) ⇒ String

Simplify an XPath selector by removing the index notation.

Instance Method Summary collapse

#article_tag_condition?(node) ⇒ Boolean

Decides whether a traversed node has reached a useful article-like boundary for the generic HTML scraper.
#each {|The| ... } ⇒ Enumerator

Enumerator for the scraped articles.
#extractable? ⇒ Boolean

True when the scraper can likely extract articles.
#initialize(parsed_body, url:, extractor: HtmlExtractor, **opts) ⇒ Html constructor

A new instance of Html.

Constructor Details

#initialize(parsed_body, url:, extractor: HtmlExtractor, **opts) ⇒ `Html`

Returns a new instance of Html.

Parameters:

parsed_body (Nokogiri::HTML::Document) —

The parsed HTML document.
url (String) —

The base URL.
extractor (Class) (defaults to: HtmlExtractor) —

The extractor class to handle article extraction.
opts (Hash) —

Additional options.

Options Hash (**opts):

:minimum_selector_frequency (Integer) —

minimum count before a selector is considered stable
:use_top_selectors (Integer) —

number of top selectors to keep

# File 'lib/html2rss/auto_source/scraper/html.rb', line 60

def initialize(parsed_body, url:, extractor: HtmlExtractor, **opts)
  @parsed_body = parsed_body
  @url = url
  @extractor = extractor
  @opts = opts
  @link_heuristics = LinkHeuristics.new(url)
  @ignored_cache = {}.compare_by_identity
end

Instance Attribute Details

#parsed_body ⇒ `Object` (readonly)

Returns the value of attribute parsed_body.



69
70
71

# File 'lib/html2rss/auto_source/scraper/html.rb', line 69

def parsed_body
  @parsed_body
end

Class Method Details

.articles?(parsed_body) ⇒ `Boolean`

Probes whether the document appears to contain repeated anchor structures that this fallback scraper can cluster into article-like containers.

Parameters:

parsed_body (Nokogiri::HTML::Document) —

parsed HTML document

Returns:

(Boolean) —

true when the scraper can likely extract articles



40
41
42

# File 'lib/html2rss/auto_source/scraper/html.rb', line 40

def self.articles?(parsed_body)
  new(parsed_body, url: DETECTION_BASE_URL).any?
end

.options_key ⇒ `Symbol`

Returns config key used to enable or configure this scraper.

Returns:

(Symbol) —

config key used to enable or configure this scraper

31	# File 'lib/html2rss/auto_source/scraper/html.rb', line 31 def self.options_key = :html

.simplify_xpath(xpath) ⇒ `String`

Simplify an XPath selector by removing the index notation. This keeps repeated anchor paths comparable across sibling blocks.

Parameters:

xpath (String) —

original XPath

Returns:

(String) —

XPath without positional indexes



50
51
52

# File 'lib/html2rss/auto_source/scraper/html.rb', line 50

def self.simplify_xpath(xpath)
  HtmlExtractor::ListCandidates.simplify_xpath(xpath)
end

Instance Method Details

#article_tag_condition?(node) ⇒ `Boolean`

Decides whether a traversed node has reached a useful article-like boundary for the generic HTML scraper.

The predicate prefers containers that add surrounding link context, which helps the scraper move from a leaf anchor toward a repeated teaser/card wrapper.

Parameters:

node (Nokogiri::XML::Node) —

candidate boundary node

Returns:

(Boolean) —

true when the node is a good extraction boundary

# File 'lib/html2rss/auto_source/scraper/html.rb', line 96

def article_tag_condition?(node)
  # Ignore tags that are below ignored DOM chrome.
  return false if HtmlExtractor.ignored_container_path?(node, @ignored_cache)
  return true if %w[body html].include?(node.name)
  return false unless (parent = node.parent)

  anchor_count(parent) > anchor_count(node)
end

#each {|The| ... } ⇒ `Enumerator`

Returns Enumerator for the scraped articles.

Yield Parameters:

The (Hash) —

scraped article hash

Returns:

(Enumerator) —

Enumerator for the scraped articles

# File 'lib/html2rss/auto_source/scraper/html.rb', line 74

def each
  return enum_for(:each) unless block_given?

  articles.each { yield _1 }
end

#extractable? ⇒ `Boolean`

Returns true when the scraper can likely extract articles.

Returns:

(Boolean) —

true when the scraper can likely extract articles



82
83
84

# File 'lib/html2rss/auto_source/scraper/html.rb', line 82

def extractable?
  articles.any?
end

Class: Html2rss::AutoSource::Scraper::Html

Overview

Constant Summary collapse

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(parsed_body, url:, extractor: HtmlExtractor, **opts) ⇒ Html

Instance Attribute Details

#parsed_body ⇒ Object (readonly)

Class Method Details

.articles?(parsed_body) ⇒ Boolean

.options_key ⇒ Symbol

.simplify_xpath(xpath) ⇒ String

Instance Method Details

#article_tag_condition?(node) ⇒ Boolean

#each {|The| ... } ⇒ Enumerator

#extractable? ⇒ Boolean

#initialize(parsed_body, url:, extractor: HtmlExtractor, **opts) ⇒ `Html`

#parsed_body ⇒ `Object` (readonly)

.articles?(parsed_body) ⇒ `Boolean`

.options_key ⇒ `Symbol`

.simplify_xpath(xpath) ⇒ `String`

#article_tag_condition?(node) ⇒ `Boolean`

#each {|The| ... } ⇒ `Enumerator`

#extractable? ⇒ `Boolean`