Class: Html2rss::HtmlExtractor
- Inherits:
-
Object
- Object
- Html2rss::HtmlExtractor
- Defined in:
- lib/html2rss/html_extractor.rb,
lib/html2rss/html_extractor/date_extractor.rb,
lib/html2rss/html_extractor/image_extractor.rb,
lib/html2rss/html_extractor/list_candidates.rb,
lib/html2rss/html_extractor/enclosure_extractor.rb,
lib/html2rss/html_extractor/semantic_containers.rb,
lib/html2rss/html_extractor/semantic_anchor_candidates.rb
Overview
HtmlExtractor is responsible for extracting details (headline, url, images, etc.) from an article_tag.
Defined Under Namespace
Modules: Extractors Classes: DateExtractor, EnclosureExtractor, ImageExtractor, ListCandidates, SemanticAnchorCandidates, SemanticContainers
Constant Summary collapse
- INVISIBLE_CONTENT_TAGS =
Tags ignored when extracting visible text content from article containers.
%w[svg script noscript style template].to_set.freeze
- IGNORED_CONTAINER_PATH =
Element path pattern ignored when traversing candidate article containers.
/(nav|footer|header|svg|script|style)/i- HEADING_TAGS =
Heading tags used to prioritize title extraction.
%w[h1 h2 h3 h4 h5 h6].freeze
- NON_HEADLINE_SELECTOR =
Selector used to derive non-headline description nodes.
(HEADING_TAGS.map { |tag| ":not(#{tag})" } + INVISIBLE_CONTENT_TAGS.to_a).freeze
- MAIN_ANCHOR_SELECTOR =
Anchor selector used to identify the canonical article link element.
begin buf = +'a[href]:not([href=""])' %w[# javascript: mailto: tel: file:// sms: data:].each do |prefix| buf << %[:not([href^="#{prefix}"])] end buf.freeze end
Class Method Summary collapse
-
.extract_visible_text(tag, separator: ' ') ⇒ String?
Extracts visible text from a given node and its children.
-
.ignored_container_path?(node) ⇒ Boolean
True when the node belongs to ignored DOM chrome.
-
.main_anchor_for(article_tag) ⇒ Nokogiri::XML::Node?
First eligible descendant anchor.
Instance Method Summary collapse
-
#call ⇒ Hash{Symbol => Object}
Extracted article attributes.
-
#initialize(article_tag, base_url:, selected_anchor:) ⇒ HtmlExtractor
constructor
A new instance of HtmlExtractor.
Constructor Details
#initialize(article_tag, base_url:, selected_anchor:) ⇒ HtmlExtractor
Returns a new instance of HtmlExtractor.
57 58 59 60 61 62 63 |
# File 'lib/html2rss/html_extractor.rb', line 57 def initialize(article_tag, base_url:, selected_anchor:) raise ArgumentError, 'article_tag is required' unless article_tag @article_tag = article_tag @base_url = base_url @selected_anchor = selected_anchor end |
Class Method Details
.extract_visible_text(tag, separator: ' ') ⇒ String?
Extracts visible text from a given node and its children.
33 34 35 36 37 38 39 40 41 42 43 |
# File 'lib/html2rss/html_extractor.rb', line 33 def extract_visible_text(tag, separator: ' ') parts = tag.children.filter_map do |child| next unless visible_child?(child) raw_text = child.children.empty? ? child.text : extract_visible_text(child) text = raw_text&.strip text unless text.to_s.empty? end parts.join(separator).squeeze(' ').strip unless parts.empty? end |
.ignored_container_path?(node) ⇒ Boolean
Returns true when the node belongs to ignored DOM chrome.
96 97 98 99 100 |
# File 'lib/html2rss/html_extractor.rb', line 96 def ignored_container_path?(node) path = node.respond_to?(:path) ? node.path : node.to_s path.match?(IGNORED_CONTAINER_PATH) end |
.main_anchor_for(article_tag) ⇒ Nokogiri::XML::Node?
Returns first eligible descendant anchor.
87 88 89 90 91 |
# File 'lib/html2rss/html_extractor.rb', line 87 def main_anchor_for(article_tag) return article_tag if article_tag.name == 'a' && article_tag.matches?(MAIN_ANCHOR_SELECTOR) article_tag.at_css(MAIN_ANCHOR_SELECTOR) end |
Instance Method Details
#call ⇒ Hash{Symbol => Object}
Returns extracted article attributes.
66 67 68 69 70 71 72 73 74 75 76 77 |
# File 'lib/html2rss/html_extractor.rb', line 66 def call { title: extract_title, url: extract_url, image: extract_image, description: extract_description, id: generate_id, published_at: extract_published_at, enclosures: extract_enclosures, categories: extract_categories } end |