Class: Html2rss::HtmlExtractor::ListCandidates

Inherits:
Object
  • Object
show all
Defined in:
lib/html2rss/html_extractor/list_candidates.rb

Overview

Builds repeated-list article container candidates from generic HTML.

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(parsed_body, minimum_selector_frequency:, use_top_selectors:) ⇒ ListCandidates

Returns a new instance of ListCandidates.

Parameters:

  • parsed_body (Nokogiri::HTML::Document)

    parsed document

  • minimum_selector_frequency (Integer)

    minimum repeated anchor path count

  • use_top_selectors (Integer)

    number of frequent anchor paths to inspect



20
21
22
23
24
# File 'lib/html2rss/html_extractor/list_candidates.rb', line 20

def initialize(parsed_body, minimum_selector_frequency:, use_top_selectors:)
  @parsed_body = parsed_body
  @minimum_selector_frequency = minimum_selector_frequency
  @use_top_selectors = use_top_selectors
end

Class Method Details

.simplify_xpath(xpath) ⇒ String

Simplify an XPath selector by removing index notation.

Parameters:

  • xpath (String)

    original XPath

Returns:

  • (String)

    XPath without positional indexes



13
14
15
# File 'lib/html2rss/html_extractor/list_candidates.rb', line 13

def self.simplify_xpath(xpath)
  xpath.gsub(/\[\d+\]/, '')
end

Instance Method Details

#each_article_tag(anchor_filter:, boundary_condition:) {|article_tag, selected_anchor| ... } ⇒ Enumerator

Parameters:

  • anchor_filter (#call)

    predicate for scraper-specific anchor eligibility

  • boundary_condition (#call)

    predicate for article container boundary

Yield Parameters:

  • article_tag (Nokogiri::XML::Node)

    candidate article container

  • selected_anchor (Nokogiri::XML::Node)

    anchor that made the container eligible

Returns:

  • (Enumerator)


32
33
34
35
36
# File 'lib/html2rss/html_extractor/list_candidates.rb', line 32

def (anchor_filter:, boundary_condition:)
  return enum_for(:each_article_tag, anchor_filter:, boundary_condition:) unless block_given?

  (anchor_filter:, boundary_condition:).each { yield _1[:article_tag], _1[:selected_anchor] }
end