Class: Html2rss::HtmlExtractor::SemanticAnchorCandidates::Candidate

Inherits:
Object
  • Object
show all
Defined in:
lib/html2rss/html_extractor/semantic_anchor_candidates.rb

Overview

One anchor plus the facts needed to decide whether it represents content.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(anchor, context) ⇒ Candidate

Returns a new instance of Candidate.

Parameters:

  • anchor (Nokogiri::XML::Node)

    anchor candidate

  • context (Context)

    semantic container context



87
88
89
90
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 87

def initialize(anchor, context)
  @anchor = anchor
  @context = context
end

Instance Attribute Details

#anchorObject (readonly)

Returns the value of attribute anchor.



83
84
85
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 83

def anchor
  @anchor
end

Instance Method Details

#anchor_identity_attributesHash

Returns anchor identity attributes used to build AnchorFacts.

Returns:

  • (Hash)

    anchor identity attributes used to build AnchorFacts



112
113
114
115
116
117
118
119
120
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 112

def anchor_identity_attributes
  {
    anchor:,
    text:,
    url: destination_facts.url,
    destination: destination_facts.destination,
    segments: destination_facts.segments
  }
end

#anchor_signal_attributesHash

Returns anchor signal attributes used to build AnchorFacts.

Returns:

  • (Hash)

    anchor signal attributes used to build AnchorFacts



123
124
125
126
127
128
129
130
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 123

def anchor_signal_attributes
  {
    meaningful_text: meaningful_text?,
    content_like_destination: content_like_destination?,
    heading_anchor: heading_anchor?,
    heading_text_match: heading_text_match?
  }
end

#content_like_destination?Boolean

Returns true when the destination route has content signals.

Returns:

  • (Boolean)

    true when the destination route has content signals



138
139
140
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 138

def content_like_destination?
  destination_facts.content_path
end

#destination_factsHtml2rss::AutoSource::Scraper::LinkHeuristics::DestinationFacts?

Returns destination facts.



102
103
104
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 102

def destination_facts
  @destination_facts ||= @context.destination_facts(@anchor)
end

#factsAnchorFacts?

Returns ranked anchor facts when the anchor is eligible.

Returns:

  • (AnchorFacts, nil)

    ranked anchor facts when the anchor is eligible



93
94
95
96
97
98
99
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 93

def facts
  return unless destination_facts
  return if utility_text_suppressed? || ineligible_anchor?
  return unless representative_content_anchor?

  AnchorFacts.from_candidate(self)
end

#heading_anchor?Boolean

Returns true when the anchor is inside the selected heading.

Returns:

  • (Boolean)

    true when the anchor is inside the selected heading



143
144
145
146
147
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 143

def heading_anchor?
  heading = @context.heading

  heading && @anchor.ancestors.include?(heading)
end

#heading_text_match?Boolean

Returns true when anchor text exactly matches heading text.

Returns:

  • (Boolean)

    true when anchor text exactly matches heading text



150
151
152
153
154
155
156
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 150

def heading_text_match?
  heading_text = @context.heading_text

  meaningful_text? &&
    heading_text.scan(/\p{Alnum}+/).any? &&
    heading_text == text
end

#meaningful_text?Boolean

Returns true when visible anchor text has words.

Returns:

  • (Boolean)

    true when visible anchor text has words



133
134
135
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 133

def meaningful_text?
  text.scan(/\p{Alnum}+/).any?
end

#textString

Returns visible anchor text.

Returns:

  • (String)

    visible anchor text



107
108
109
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 107

def text
  @text ||= @context.visible_text(@anchor)
end