Class: Html2rss::HtmlExtractor

Inherits:

Object

Object
Html2rss::HtmlExtractor

show all

Defined in:: lib/html2rss/html_extractor.rb,
lib/html2rss/html_extractor/id_generator.rb,
lib/html2rss/html_extractor/date_extractor.rb,
lib/html2rss/html_extractor/text_extractor.rb,
lib/html2rss/html_extractor/image_extractor.rb,
lib/html2rss/html_extractor/list_candidates.rb,
lib/html2rss/html_extractor/heading_extractor.rb,
lib/html2rss/html_extractor/enclosure_extractor.rb,
lib/html2rss/html_extractor/semantic_containers.rb,
lib/html2rss/html_extractor/semantic_anchor_candidates.rb

Overview

HtmlExtractor is responsible for extracting details (headline, url, images, etc.) from an article_tag. rubocop:disable Metrics/ClassLength

Defined Under Namespace

Classes: DateExtractor, EnclosureExtractor, HeadingExtractor, IdGenerator, ImageExtractor, ListCandidates, SemanticAnchorCandidates, SemanticContainers, TextExtractor

Constant Summary collapse

HEADING_TAGS = Heading tags used to prioritize title extraction.

%w[h1 h2 h3 h4 h5 h6].freeze

IGNORED_CONTAINER_TAGS = Element tags that indicate ignored DOM chrome when found in a container path.

%w[nav footer header svg script style].to_set.freeze

MAIN_ANCHOR_SELECTOR = Anchor selector used to identify the canonical article link element.

begin
  buf = +'a[href]:not([href=""])'
  %w[# javascript: mailto: tel: file:// sms: data:].each do |prefix|
    buf << %[:not([href^="#{prefix}"])]
  end
  buf.freeze
end

Class Method Summary collapse

.extract_visible_text(tag, separator: ' ', exclude_nodes: nil) ⇒ String^?

Extracts visible text from a given node and its children.
.ignored_container_path?(node, cache = nil) ⇒ Boolean

True when the node belongs to ignored DOM chrome.
.main_anchor_for(article_tag) ⇒ Nokogiri::XML::Node^?

First eligible descendant anchor.

Instance Method Summary collapse

#call ⇒ Hash{Symbol => Object}

Extracted article attributes.
#initialize(article_tag, base_url:, selected_anchor:, fallback_anchorless: false) ⇒ HtmlExtractor constructor

A new instance of HtmlExtractor.

Constructor Details

#initialize(article_tag, base_url:, selected_anchor:, fallback_anchorless: false) ⇒ `HtmlExtractor`

Returns a new instance of HtmlExtractor.

Parameters:

article_tag (Nokogiri::XML::Node) —

article-like container to extract from
base_url (String, Html2rss::Url) —

base url used to resolve relative links
selected_anchor (Nokogiri::XML::Node, nil) —

explicit primary anchor for the container
fallback_anchorless (Boolean) (defaults to: false) —

whether to fall back to anchorless extraction

Raises:

(ArgumentError)

# File 'lib/html2rss/html_extractor.rb', line 76

def initialize(article_tag, base_url:, selected_anchor:, fallback_anchorless: false)
  raise ArgumentError, 'article_tag is required' unless article_tag

  @article_tag = article_tag
  @base_url = base_url
  @selected_anchor = selected_anchor
  @fallback_anchorless = fallback_anchorless
end

Class Method Details

.extract_visible_text(tag, separator: ' ', exclude_nodes: nil) ⇒ `String`^?

Extracts visible text from a given node and its children. Delegates to TextExtractor.

Parameters:

tag (Nokogiri::XML::Node) —

the node from which to extract visible text
separator (String) (defaults to: ' ') —

separator used to join text fragments (default is a space)
exclude_nodes (Array<Nokogiri::XML::Node>, nil) (defaults to: nil) —

nodes to exclude from extraction

Returns:

(String, nil) —

the concatenated visible text, or nil if none is found



33
34
35

# File 'lib/html2rss/html_extractor.rb', line 33

def extract_visible_text(tag, separator: ' ', exclude_nodes: nil)
  TextExtractor.call(tag, separator:, exclude_nodes:)
end

.ignored_container_path?(node, cache = nil) ⇒ `Boolean`

Returns true when the node belongs to ignored DOM chrome.

Parameters:

node (Nokogiri::XML::Node)
cache (Hash, nil) (defaults to: nil) —

identity cache used to store results (must use compare_by_identity)

Returns:

(Boolean) —

true when the node belongs to ignored DOM chrome

# File 'lib/html2rss/html_extractor.rb', line 50

def ignored_container_path?(node, cache = nil)
  return cache[node] if cache&.key?(node)

  res = walk_ignored_container_path?(node)
  cache[node] = res if cache
  res
end

.main_anchor_for(article_tag) ⇒ `Nokogiri::XML::Node`^?

Returns first eligible descendant anchor.

Parameters:

article_tag (Nokogiri::XML::Node) —

article-like container to search within

Returns:

(Nokogiri::XML::Node, nil) —

first eligible descendant anchor

# File 'lib/html2rss/html_extractor.rb', line 40

def main_anchor_for(article_tag)
  return article_tag if article_tag.name == 'a' && article_tag.matches?(MAIN_ANCHOR_SELECTOR)

  article_tag.at_css(MAIN_ANCHOR_SELECTOR)
end

Instance Method Details

#call ⇒ `Hash{Symbol => Object}`

Returns extracted article attributes.

Returns:

(Hash{Symbol => Object}) —

extracted article attributes

# File 'lib/html2rss/html_extractor.rb', line 86

def call
  {
    title: extract_title,
    url: extract_url,
    image: extract_image,
    description: extract_description,
    id: generate_id,
    published_at: extract_published_at,
    enclosures: extract_enclosures,
    categories: extract_categories
  }
end

Class: Html2rss::HtmlExtractor

Overview

Defined Under Namespace

Constant Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(article_tag, base_url:, selected_anchor:, fallback_anchorless: false) ⇒ HtmlExtractor

Class Method Details

.extract_visible_text(tag, separator: ' ', exclude_nodes: nil) ⇒ String?

.ignored_container_path?(node, cache = nil) ⇒ Boolean

.main_anchor_for(article_tag) ⇒ Nokogiri::XML::Node?

Instance Method Details

#call ⇒ Hash{Symbol => Object}

#initialize(article_tag, base_url:, selected_anchor:, fallback_anchorless: false) ⇒ `HtmlExtractor`

.extract_visible_text(tag, separator: ' ', exclude_nodes: nil) ⇒ `String`^?

.ignored_container_path?(node, cache = nil) ⇒ `Boolean`

.main_anchor_for(article_tag) ⇒ `Nokogiri::XML::Node`^?

#call ⇒ `Hash{Symbol => Object}`