Class: Html2rss::HtmlExtractor::SemanticContainers
- Inherits:
-
Object
- Object
- Html2rss::HtmlExtractor::SemanticContainers
- Defined in:
- lib/html2rss/html_extractor/semantic_containers.rb
Overview
Collects semantic content containers from a parsed HTML document.
Constant Summary collapse
- SELECTORS =
Candidate selectors used to locate extractable semantic content blocks.
[ 'article:not(:has(article))', 'section:not(:has(section))', 'li:not(:has(li))', 'tr:not(:has(tr))', 'div:not(:has(div))' ].freeze
Class Method Summary collapse
-
.call(parsed_body) ⇒ Array<Nokogiri::XML::Node>
Candidate semantic containers.
Instance Method Summary collapse
-
#call ⇒ Array<Nokogiri::XML::Node>
Candidate semantic containers.
-
#initialize(parsed_body) ⇒ SemanticContainers
constructor
A new instance of SemanticContainers.
Constructor Details
#initialize(parsed_body) ⇒ SemanticContainers
Returns a new instance of SemanticContainers.
24 25 26 |
# File 'lib/html2rss/html_extractor/semantic_containers.rb', line 24 def initialize(parsed_body) @parsed_body = parsed_body end |
Class Method Details
.call(parsed_body) ⇒ Array<Nokogiri::XML::Node>
Returns candidate semantic containers.
19 20 21 |
# File 'lib/html2rss/html_extractor/semantic_containers.rb', line 19 def self.call(parsed_body) new(parsed_body).call end |
Instance Method Details
#call ⇒ Array<Nokogiri::XML::Node>
Returns candidate semantic containers.
29 30 31 32 33 34 35 36 37 38 39 40 41 |
# File 'lib/html2rss/html_extractor/semantic_containers.rb', line 29 def call cache = {}.compare_by_identity candidates = @parsed_body.css(SELECTORS.join(',')).reject do |node| HtmlExtractor.ignored_container_path?(node, cache) end # Preserve the original post-order traversal intent (specific-first) # by sorting candidates by depth (descending) while keeping original document # order for nodes at the same depth. candidates.each_with_index .sort_by { |node, index| [-node.ancestors.size, index] } .map!(&:first) end |