Class: Html2rss::HtmlExtractor::SemanticAnchorCandidates::Candidate
- Inherits:
-
Object
- Object
- Html2rss::HtmlExtractor::SemanticAnchorCandidates::Candidate
- Defined in:
- lib/html2rss/html_extractor/semantic_anchor_candidates.rb
Overview
One anchor plus the facts needed to decide whether it represents content.
Instance Attribute Summary collapse
-
#anchor ⇒ Object
readonly
Returns the value of attribute anchor.
Instance Method Summary collapse
-
#anchor_identity_attributes ⇒ Hash
Anchor identity attributes used to build AnchorFacts.
-
#anchor_signal_attributes ⇒ Hash
Anchor signal attributes used to build AnchorFacts.
-
#content_like_destination? ⇒ Boolean
True when the destination route has content signals.
-
#destination_facts ⇒ Html2rss::AutoSource::Scraper::LinkHeuristics::DestinationFacts?
Destination facts.
-
#facts ⇒ AnchorFacts?
Ranked anchor facts when the anchor is eligible.
-
#heading_anchor? ⇒ Boolean
True when the anchor is inside the selected heading.
-
#heading_text_match? ⇒ Boolean
True when anchor text exactly matches heading text.
-
#initialize(anchor, context) ⇒ Candidate
constructor
A new instance of Candidate.
-
#meaningful_text? ⇒ Boolean
True when visible anchor text has words.
-
#text ⇒ String
Visible anchor text.
Constructor Details
#initialize(anchor, context) ⇒ Candidate
Returns a new instance of Candidate.
83 84 85 86 |
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 83 def initialize(anchor, context) @anchor = anchor @context = context end |
Instance Attribute Details
#anchor ⇒ Object (readonly)
Returns the value of attribute anchor.
79 80 81 |
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 79 def anchor @anchor end |
Instance Method Details
#anchor_identity_attributes ⇒ Hash
Returns anchor identity attributes used to build AnchorFacts.
108 109 110 111 112 113 114 115 116 |
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 108 def anchor_identity_attributes { anchor:, text:, url: destination_facts.url, destination: destination_facts.destination, segments: destination_facts.segments } end |
#anchor_signal_attributes ⇒ Hash
Returns anchor signal attributes used to build AnchorFacts.
119 120 121 122 123 124 125 126 |
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 119 def anchor_signal_attributes { meaningful_text: meaningful_text?, content_like_destination: content_like_destination?, heading_anchor: heading_anchor?, heading_text_match: heading_text_match? } end |
#content_like_destination? ⇒ Boolean
Returns true when the destination route has content signals.
134 135 136 |
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 134 def content_like_destination? destination_facts.content_path end |
#destination_facts ⇒ Html2rss::AutoSource::Scraper::LinkHeuristics::DestinationFacts?
Returns destination facts.
98 99 100 |
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 98 def destination_facts @destination_facts ||= @context.destination_facts(@anchor) end |
#facts ⇒ AnchorFacts?
Returns ranked anchor facts when the anchor is eligible.
89 90 91 92 93 94 95 |
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 89 def facts return unless destination_facts return if utility_text_suppressed? || ineligible_anchor? return unless representative_content_anchor? AnchorFacts.from_candidate(self) end |
#heading_anchor? ⇒ Boolean
Returns true when the anchor is inside the selected heading.
139 140 141 142 143 144 145 146 147 148 149 150 151 152 |
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 139 def heading_anchor? heading = @context.heading return false unless heading curr = @anchor container = @context.container while curr.respond_to?(:parent) return true if curr == heading break if curr == container curr = curr.parent end false end |
#heading_text_match? ⇒ Boolean
Returns true when anchor text exactly matches heading text.
155 156 157 158 159 160 161 |
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 155 def heading_text_match? heading_text = @context.heading_text meaningful_text? && heading_text.match?(/\p{Alnum}/) && heading_text == text end |
#meaningful_text? ⇒ Boolean
Returns true when visible anchor text has words.
129 130 131 |
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 129 def meaningful_text? @meaningful_text ||= text.match?(/\p{Alnum}/) end |
#text ⇒ String
Returns visible anchor text.
103 104 105 |
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 103 def text @text ||= @context.visible_text(@anchor) end |