Class: Html2rss::HtmlExtractor::SemanticAnchorCandidates::Candidate
- Inherits:
-
Object
- Object
- Html2rss::HtmlExtractor::SemanticAnchorCandidates::Candidate
- Defined in:
- lib/html2rss/html_extractor/semantic_anchor_candidates.rb
Overview
One anchor plus the facts needed to decide whether it represents content.
Instance Attribute Summary collapse
-
#anchor ⇒ Object
readonly
Returns the value of attribute anchor.
Instance Method Summary collapse
-
#anchor_identity_attributes ⇒ Hash
Anchor identity attributes used to build AnchorFacts.
-
#anchor_signal_attributes ⇒ Hash
Anchor signal attributes used to build AnchorFacts.
-
#content_like_destination? ⇒ Boolean
True when the destination route has content signals.
-
#destination_facts ⇒ Html2rss::AutoSource::Scraper::LinkHeuristics::DestinationFacts?
Destination facts.
-
#facts ⇒ AnchorFacts?
Ranked anchor facts when the anchor is eligible.
-
#heading_anchor? ⇒ Boolean
True when the anchor is inside the selected heading.
-
#heading_text_match? ⇒ Boolean
True when anchor text exactly matches heading text.
-
#initialize(anchor, context) ⇒ Candidate
constructor
A new instance of Candidate.
-
#meaningful_text? ⇒ Boolean
True when visible anchor text has words.
-
#text ⇒ String
Visible anchor text.
Constructor Details
#initialize(anchor, context) ⇒ Candidate
Returns a new instance of Candidate.
87 88 89 90 |
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 87 def initialize(anchor, context) @anchor = anchor @context = context end |
Instance Attribute Details
#anchor ⇒ Object (readonly)
Returns the value of attribute anchor.
83 84 85 |
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 83 def anchor @anchor end |
Instance Method Details
#anchor_identity_attributes ⇒ Hash
Returns anchor identity attributes used to build AnchorFacts.
112 113 114 115 116 117 118 119 120 |
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 112 def anchor_identity_attributes { anchor:, text:, url: destination_facts.url, destination: destination_facts.destination, segments: destination_facts.segments } end |
#anchor_signal_attributes ⇒ Hash
Returns anchor signal attributes used to build AnchorFacts.
123 124 125 126 127 128 129 130 |
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 123 def anchor_signal_attributes { meaningful_text: meaningful_text?, content_like_destination: content_like_destination?, heading_anchor: heading_anchor?, heading_text_match: heading_text_match? } end |
#content_like_destination? ⇒ Boolean
Returns true when the destination route has content signals.
138 139 140 |
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 138 def content_like_destination? destination_facts.content_path end |
#destination_facts ⇒ Html2rss::AutoSource::Scraper::LinkHeuristics::DestinationFacts?
Returns destination facts.
102 103 104 |
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 102 def destination_facts @destination_facts ||= @context.destination_facts(@anchor) end |
#facts ⇒ AnchorFacts?
Returns ranked anchor facts when the anchor is eligible.
93 94 95 96 97 98 99 |
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 93 def facts return unless destination_facts return if utility_text_suppressed? || ineligible_anchor? return unless representative_content_anchor? AnchorFacts.from_candidate(self) end |
#heading_anchor? ⇒ Boolean
Returns true when the anchor is inside the selected heading.
143 144 145 146 147 |
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 143 def heading_anchor? heading = @context.heading heading && @anchor.ancestors.include?(heading) end |
#heading_text_match? ⇒ Boolean
Returns true when anchor text exactly matches heading text.
150 151 152 153 154 155 156 |
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 150 def heading_text_match? heading_text = @context.heading_text meaningful_text? && heading_text.scan(/\p{Alnum}+/).any? && heading_text == text end |
#meaningful_text? ⇒ Boolean
Returns true when visible anchor text has words.
133 134 135 |
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 133 def meaningful_text? text.scan(/\p{Alnum}+/).any? end |
#text ⇒ String
Returns visible anchor text.
107 108 109 |
# File 'lib/html2rss/html_extractor/semantic_anchor_candidates.rb', line 107 def text @text ||= @context.visible_text(@anchor) end |