Class: Html2rss::HtmlExtractor::SemanticContainers

Inherits:
Object
  • Object
show all
Defined in:
lib/html2rss/html_extractor/semantic_containers.rb

Overview

Collects semantic content containers from a parsed HTML document.

Constant Summary collapse

SELECTORS =

Candidate selectors used to locate extractable semantic content blocks.

[
  'article:not(:has(article))',
  'section:not(:has(section))',
  'li:not(:has(li))',
  'tr:not(:has(tr))',
  'div:not(:has(div))'
].freeze

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(parsed_body) ⇒ SemanticContainers

Returns a new instance of SemanticContainers.

Parameters:

  • parsed_body (Nokogiri::HTML::Document)

    parsed document



24
25
26
# File 'lib/html2rss/html_extractor/semantic_containers.rb', line 24

def initialize(parsed_body)
  @parsed_body = parsed_body
end

Class Method Details

.call(parsed_body) ⇒ Array<Nokogiri::XML::Node>

Returns candidate semantic containers.

Parameters:

  • parsed_body (Nokogiri::HTML::Document)

    parsed document

Returns:

  • (Array<Nokogiri::XML::Node>)

    candidate semantic containers



19
20
21
# File 'lib/html2rss/html_extractor/semantic_containers.rb', line 19

def self.call(parsed_body)
  new(parsed_body).call
end

Instance Method Details

#callArray<Nokogiri::XML::Node>

Returns candidate semantic containers.

Returns:

  • (Array<Nokogiri::XML::Node>)

    candidate semantic containers



29
30
31
32
33
34
35
# File 'lib/html2rss/html_extractor/semantic_containers.rb', line 29

def call
  containers = SELECTORS.each_with_object([]) do |selector, memo|
    collect_selector_containers(selector, memo)
  end

  containers.sort_by { document_order.fetch(_1) }
end