Class: Html2rss::HtmlExtractor::SemanticContainers

Inherits:
Object
  • Object
show all
Defined in:
lib/html2rss/html_extractor/semantic_containers.rb

Overview

Collects semantic content containers from a parsed HTML document.

Constant Summary collapse

SELECTORS =

Candidate selectors used to locate extractable semantic content blocks.

[
  'article:not(:has(article))',
  'section:not(:has(section))',
  'li:not(:has(li))',
  'tr:not(:has(tr))',
  'div:not(:has(div))'
].freeze

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(parsed_body) ⇒ SemanticContainers

Returns a new instance of SemanticContainers.

Parameters:

  • parsed_body (Nokogiri::HTML::Document)

    parsed document



24
25
26
# File 'lib/html2rss/html_extractor/semantic_containers.rb', line 24

def initialize(parsed_body)
  @parsed_body = parsed_body
end

Class Method Details

.call(parsed_body) ⇒ Array<Nokogiri::XML::Node>

Returns candidate semantic containers.

Parameters:

  • parsed_body (Nokogiri::HTML::Document)

    parsed document

Returns:

  • (Array<Nokogiri::XML::Node>)

    candidate semantic containers



19
20
21
# File 'lib/html2rss/html_extractor/semantic_containers.rb', line 19

def self.call(parsed_body)
  new(parsed_body).call
end

Instance Method Details

#callArray<Nokogiri::XML::Node>

Returns candidate semantic containers.

Returns:

  • (Array<Nokogiri::XML::Node>)

    candidate semantic containers



29
30
31
32
33
34
35
36
37
38
39
40
41
# File 'lib/html2rss/html_extractor/semantic_containers.rb', line 29

def call
  cache = {}.compare_by_identity
  candidates = @parsed_body.css(SELECTORS.join(',')).reject do |node|
    HtmlExtractor.ignored_container_path?(node, cache)
  end

  # Preserve the original post-order traversal intent (specific-first)
  # by sorting candidates by depth (descending) while keeping original document
  # order for nodes at the same depth.
  candidates.each_with_index
            .sort_by { |node, index| [-node.ancestors.size, index] }
            .map!(&:first)
end