Class: Html2rss::HtmlExtractor
- Inherits:
-
Object
- Object
- Html2rss::HtmlExtractor
- Defined in:
- lib/html2rss/html_extractor.rb,
lib/html2rss/html_extractor/id_generator.rb,
lib/html2rss/html_extractor/date_extractor.rb,
lib/html2rss/html_extractor/text_extractor.rb,
lib/html2rss/html_extractor/image_extractor.rb,
lib/html2rss/html_extractor/list_candidates.rb,
lib/html2rss/html_extractor/heading_extractor.rb,
lib/html2rss/html_extractor/enclosure_extractor.rb,
lib/html2rss/html_extractor/semantic_containers.rb,
lib/html2rss/html_extractor/semantic_anchor_candidates.rb
Overview
HtmlExtractor is responsible for extracting details (headline, url, images, etc.) from an article_tag. rubocop:disable Metrics/ClassLength
Defined Under Namespace
Classes: DateExtractor, EnclosureExtractor, HeadingExtractor, IdGenerator, ImageExtractor, ListCandidates, SemanticAnchorCandidates, SemanticContainers, TextExtractor
Constant Summary collapse
- HEADING_TAGS =
Heading tags used to prioritize title extraction.
%w[h1 h2 h3 h4 h5 h6].freeze
- IGNORED_CONTAINER_TAGS =
Element tags that indicate ignored DOM chrome when found in a container path.
%w[nav footer header svg script style].to_set.freeze
- MAIN_ANCHOR_SELECTOR =
Anchor selector used to identify the canonical article link element.
begin buf = +'a[href]:not([href=""])' %w[# javascript: mailto: tel: file:// sms: data:].each do |prefix| buf << %[:not([href^="#{prefix}"])] end buf.freeze end
Class Method Summary collapse
-
.extract_visible_text(tag, separator: ' ', exclude_nodes: nil) ⇒ String?
Extracts visible text from a given node and its children.
-
.ignored_container_path?(node, cache = nil) ⇒ Boolean
True when the node belongs to ignored DOM chrome.
-
.main_anchor_for(article_tag) ⇒ Nokogiri::XML::Node?
First eligible descendant anchor.
Instance Method Summary collapse
-
#call ⇒ Hash{Symbol => Object}
Extracted article attributes.
-
#initialize(article_tag, base_url:, selected_anchor:, fallback_anchorless: false) ⇒ HtmlExtractor
constructor
A new instance of HtmlExtractor.
Constructor Details
#initialize(article_tag, base_url:, selected_anchor:, fallback_anchorless: false) ⇒ HtmlExtractor
Returns a new instance of HtmlExtractor.
76 77 78 79 80 81 82 83 |
# File 'lib/html2rss/html_extractor.rb', line 76 def initialize(article_tag, base_url:, selected_anchor:, fallback_anchorless: false) raise ArgumentError, 'article_tag is required' unless article_tag @article_tag = article_tag @base_url = base_url @selected_anchor = selected_anchor @fallback_anchorless = fallback_anchorless end |
Class Method Details
.extract_visible_text(tag, separator: ' ', exclude_nodes: nil) ⇒ String?
Extracts visible text from a given node and its children. Delegates to TextExtractor.
33 34 35 |
# File 'lib/html2rss/html_extractor.rb', line 33 def extract_visible_text(tag, separator: ' ', exclude_nodes: nil) TextExtractor.call(tag, separator:, exclude_nodes:) end |
.ignored_container_path?(node, cache = nil) ⇒ Boolean
Returns true when the node belongs to ignored DOM chrome.
50 51 52 53 54 55 56 |
# File 'lib/html2rss/html_extractor.rb', line 50 def ignored_container_path?(node, cache = nil) return cache[node] if cache&.key?(node) res = walk_ignored_container_path?(node) cache[node] = res if cache res end |
.main_anchor_for(article_tag) ⇒ Nokogiri::XML::Node?
Returns first eligible descendant anchor.
40 41 42 43 44 |
# File 'lib/html2rss/html_extractor.rb', line 40 def main_anchor_for(article_tag) return article_tag if article_tag.name == 'a' && article_tag.matches?(MAIN_ANCHOR_SELECTOR) article_tag.at_css(MAIN_ANCHOR_SELECTOR) end |
Instance Method Details
#call ⇒ Hash{Symbol => Object}
Returns extracted article attributes.
86 87 88 89 90 91 92 93 94 95 96 97 |
# File 'lib/html2rss/html_extractor.rb', line 86 def call { title: extract_title, url: extract_url, image: extract_image, description: extract_description, id: generate_id, published_at: extract_published_at, enclosures: extract_enclosures, categories: extract_categories } end |