Class: Html2rss::HtmlExtractor::ListCandidates

Inherits:

Object

Object
Html2rss::HtmlExtractor::ListCandidates

show all

Defined in:: lib/html2rss/html_extractor/list_candidates.rb

Overview

Builds repeated-list article container candidates from generic HTML.

Class Method Summary collapse

.simplify_xpath(xpath) ⇒ String

Simplify an XPath selector by removing index notation.

Instance Method Summary collapse

#each_article_tag(anchor_filter:, boundary_condition:) {|article_tag, selected_anchor| ... } ⇒ Enumerator
#initialize(parsed_body, minimum_selector_frequency:, use_top_selectors:) ⇒ ListCandidates constructor

A new instance of ListCandidates.

Constructor Details

#initialize(parsed_body, minimum_selector_frequency:, use_top_selectors:) ⇒ `ListCandidates`

Returns a new instance of ListCandidates.

Parameters:

parsed_body (Nokogiri::HTML::Document) —

parsed document
minimum_selector_frequency (Integer) —

minimum repeated anchor path count
use_top_selectors (Integer) —

number of frequent anchor paths to inspect

# File 'lib/html2rss/html_extractor/list_candidates.rb', line 20

def initialize(parsed_body, minimum_selector_frequency:, use_top_selectors:)
  @parsed_body = parsed_body
  @minimum_selector_frequency = minimum_selector_frequency
  @use_top_selectors = use_top_selectors
end

Class Method Details

.simplify_xpath(xpath) ⇒ `String`

Simplify an XPath selector by removing index notation.

Parameters:

xpath (String) —

original XPath

Returns:

(String) —

XPath without positional indexes



13
14
15

# File 'lib/html2rss/html_extractor/list_candidates.rb', line 13

def self.simplify_xpath(xpath)
  xpath.gsub(/\[\d+\]/, '')
end

Instance Method Details

#each_article_tag(anchor_filter:, boundary_condition:) {|article_tag, selected_anchor| ... } ⇒ `Enumerator`

Parameters:

anchor_filter (#call) —

predicate for scraper-specific anchor eligibility
boundary_condition (#call) —

predicate for article container boundary

Yield Parameters:

article_tag (Nokogiri::XML::Node) —

candidate article container
selected_anchor (Nokogiri::XML::Node) —

anchor that made the container eligible

Returns:

(Enumerator)

# File 'lib/html2rss/html_extractor/list_candidates.rb', line 32

def each_article_tag(anchor_filter:, boundary_condition:)
  return enum_for(:each_article_tag, anchor_filter:, boundary_condition:) unless block_given?

  article_tags(anchor_filter:, boundary_condition:).each { yield _1[:article_tag], _1[:selected_anchor] }
end

Class: Html2rss::HtmlExtractor::ListCandidates

Overview

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(parsed_body, minimum_selector_frequency:, use_top_selectors:) ⇒ ListCandidates

Class Method Details

.simplify_xpath(xpath) ⇒ String

Instance Method Details

#each_article_tag(anchor_filter:, boundary_condition:) {|article_tag, selected_anchor| ... } ⇒ Enumerator

#initialize(parsed_body, minimum_selector_frequency:, use_top_selectors:) ⇒ `ListCandidates`

.simplify_xpath(xpath) ⇒ `String`

#each_article_tag(anchor_filter:, boundary_condition:) {|article_tag, selected_anchor| ... } ⇒ `Enumerator`