Class: Html2rss::HtmlExtractor::SemanticAnchorCandidates::Candidate

Inherits:
Object
  • Object
show all
Defined in:
lib/html2rss/html_extractor/semantic_anchor_candidates.rb

Overview

One anchor plus the facts needed to decide whether it represents content.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(anchor, context) ⇒ Candidate

Returns a new instance of Candidate.

Parameters:

  • anchor (Nokogiri::XML::Node)

    anchor candidate

  • context (Context)

    semantic container context



83
84
85
86
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 83

def initialize(anchor, context)
  @anchor = anchor
  @context = context
end

Instance Attribute Details

#anchorObject (readonly)

Returns the value of attribute anchor.



79
80
81
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 79

def anchor
  @anchor
end

Instance Method Details

#anchor_identity_attributesHash

Returns anchor identity attributes used to build AnchorFacts.

Returns:

  • (Hash)

    anchor identity attributes used to build AnchorFacts



108
109
110
111
112
113
114
115
116
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 108

def anchor_identity_attributes
  {
    anchor:,
    text:,
    url: destination_facts.url,
    destination: destination_facts.destination,
    segments: destination_facts.segments
  }
end

#anchor_signal_attributesHash

Returns anchor signal attributes used to build AnchorFacts.

Returns:

  • (Hash)

    anchor signal attributes used to build AnchorFacts



119
120
121
122
123
124
125
126
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 119

def anchor_signal_attributes
  {
    meaningful_text: meaningful_text?,
    content_like_destination: content_like_destination?,
    heading_anchor: heading_anchor?,
    heading_text_match: heading_text_match?
  }
end

#content_like_destination?Boolean

Returns true when the destination route has content signals.

Returns:

  • (Boolean)

    true when the destination route has content signals



134
135
136
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 134

def content_like_destination?
  destination_facts.content_path
end

#destination_factsHtml2rss::AutoSource::Scraper::LinkHeuristics::DestinationFacts?

Returns destination facts.



98
99
100
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 98

def destination_facts
  @destination_facts ||= @context.destination_facts(@anchor)
end

#factsAnchorFacts?

Returns ranked anchor facts when the anchor is eligible.

Returns:

  • (AnchorFacts, nil)

    ranked anchor facts when the anchor is eligible



89
90
91
92
93
94
95
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 89

def facts
  return unless destination_facts
  return if utility_text_suppressed? || ineligible_anchor?
  return unless representative_content_anchor?

  AnchorFacts.from_candidate(self)
end

#heading_anchor?Boolean

Returns true when the anchor is inside the selected heading.

Returns:

  • (Boolean)

    true when the anchor is inside the selected heading



139
140
141
142
143
144
145
146
147
148
149
150
151
152
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 139

def heading_anchor?
  heading = @context.heading
  return false unless heading

  curr = @anchor
  container = @context.container
  while curr.respond_to?(:parent)
    return true if curr == heading
    break if curr == container

    curr = curr.parent
  end
  false
end

#heading_text_match?Boolean

Returns true when anchor text exactly matches heading text.

Returns:

  • (Boolean)

    true when anchor text exactly matches heading text



155
156
157
158
159
160
161
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 155

def heading_text_match?
  heading_text = @context.heading_text

  meaningful_text? &&
    heading_text.match?(/\p{Alnum}/) &&
    heading_text == text
end

#meaningful_text?Boolean

Returns true when visible anchor text has words.

Returns:

  • (Boolean)

    true when visible anchor text has words



129
130
131
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 129

def meaningful_text?
  @meaningful_text ||= text.match?(/\p{Alnum}/)
end

#textString

Returns visible anchor text.

Returns:

  • (String)

    visible anchor text



103
104
105
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 103

def text
  @text ||= @context.visible_text(@anchor)
end