Class: Html2rss::HtmlExtractor::SemanticAnchorCandidates::Context

Inherits:
Object
  • Object
show all
Defined in:
lib/html2rss/html_extractor/semantic_anchor_candidates.rb

Overview

Shared context for all anchors in one semantic container.

Constant Summary collapse

UTILITY_LANDMARK_TAGS =

Ancestor tags that usually indicate navigation/utility regions.

%w[nav aside footer menu].freeze

Instance Method Summary collapse

Constructor Details

#initialize(container, link_heuristics:) ⇒ Context

Returns a new instance of Context.

Parameters:



39
40
41
42
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 39

def initialize(container, link_heuristics:)
  @container = container
  @link_heuristics = link_heuristics
end

Instance Method Details

#destination_facts(anchor) ⇒ Html2rss::AutoSource::Scraper::LinkHeuristics::DestinationFacts?

Returns destination facts.

Parameters:

  • anchor (Nokogiri::XML::Node)

    anchor candidate

Returns:



64
65
66
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 64

def destination_facts(anchor)
  @link_heuristics.destination_facts(anchor)
end

#headingNokogiri::XML::Node?

Returns heading used to identify title anchors.

Returns:

  • (Nokogiri::XML::Node, nil)

    heading used to identify title anchors



45
46
47
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 45

def heading
  @heading ||= @container.at_css(HtmlExtractor::HEADING_TAGS.join(','))
end

#heading_textString

Returns visible heading text.

Returns:

  • (String)

    visible heading text



50
51
52
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 50

def heading_text
  @heading_text ||= visible_text(heading)
end

#utility_landmark?(ancestors) ⇒ Boolean

Returns true when the anchor lives inside navigation chrome.

Parameters:

  • ancestors (Array<Nokogiri::XML::Node>)

Returns:

  • (Boolean)

    true when the anchor lives inside navigation chrome



76
77
78
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 76

def utility_landmark?(ancestors)
  ancestors.any? { |node| UTILITY_LANDMARK_TAGS.include?(node.name) }
end

#utility_text?(text) ⇒ Boolean

Returns true when text is utility chrome.

Parameters:

  • text (String)

    visible anchor text

Returns:

  • (Boolean)

    true when text is utility chrome



70
71
72
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 70

def utility_text?(text)
  @link_heuristics.utility_text?(text)
end

#visible_text(node) ⇒ String

Returns visible text for the node.

Parameters:

  • node (Nokogiri::XML::Node, nil)

    node to extract text from

Returns:

  • (String)

    visible text for the node



56
57
58
59
60
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 56

def visible_text(node)
  return '' unless node

  HtmlExtractor.extract_visible_text(node).to_s.strip
end