Class: Html2rss::HtmlExtractor

Inherits:
Object
  • Object
show all
Defined in:
lib/html2rss/html_extractor.rb,
lib/html2rss/html_extractor/id_generator.rb,
lib/html2rss/html_extractor/date_extractor.rb,
lib/html2rss/html_extractor/text_extractor.rb,
lib/html2rss/html_extractor/image_extractor.rb,
lib/html2rss/html_extractor/list_candidates.rb,
lib/html2rss/html_extractor/heading_extractor.rb,
lib/html2rss/html_extractor/enclosure_extractor.rb,
lib/html2rss/html_extractor/semantic_containers.rb,
lib/html2rss/html_extractor/semantic_anchor_candidates.rb

Overview

HtmlExtractor is responsible for extracting details (headline, url, images, etc.) from an article_tag. rubocop:disable Metrics/ClassLength

Defined Under Namespace

Classes: DateExtractor, EnclosureExtractor, HeadingExtractor, IdGenerator, ImageExtractor, ListCandidates, SemanticAnchorCandidates, SemanticContainers, TextExtractor

Constant Summary collapse

HEADING_TAGS =

Heading tags used to prioritize title extraction.

%w[h1 h2 h3 h4 h5 h6].freeze
IGNORED_CONTAINER_TAGS =

Element tags that indicate ignored DOM chrome when found in a container path.

%w[nav footer header svg script style].to_set.freeze
MAIN_ANCHOR_SELECTOR =

Anchor selector used to identify the canonical article link element.

begin
  buf = +'a[href]:not([href=""])'
  %w[# javascript: mailto: tel: file:// sms: data:].each do |prefix|
    buf << %[:not([href^="#{prefix}"])]
  end
  buf.freeze
end

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(article_tag, base_url:, selected_anchor:, fallback_anchorless: false) ⇒ HtmlExtractor

Returns a new instance of HtmlExtractor.

Parameters:

  • article_tag (Nokogiri::XML::Node)

    article-like container to extract from

  • base_url (String, Html2rss::Url)

    base url used to resolve relative links

  • selected_anchor (Nokogiri::XML::Node, nil)

    explicit primary anchor for the container

  • fallback_anchorless (Boolean) (defaults to: false)

    whether to fall back to anchorless extraction

Raises:

  • (ArgumentError)


84
85
86
87
88
89
90
91
# File 'lib/html2rss/html_extractor.rb', line 84

def initialize(, base_url:, selected_anchor:, fallback_anchorless: false)
  raise ArgumentError, 'article_tag is required' unless 

  @article_tag = 
  @base_url = base_url
  @selected_anchor = selected_anchor
  @fallback_anchorless = fallback_anchorless
end

Class Method Details

.extract_visible_text(tag, separator: ' ', exclude_nodes: nil) ⇒ String?

Extracts visible text from a given node and its children. Delegates to TextExtractor.

Parameters:

  • tag (Nokogiri::XML::Node)

    the node from which to extract visible text

  • separator (String) (defaults to: ' ')

    separator used to join text fragments (default is a space)

  • exclude_nodes (Array<Nokogiri::XML::Node>, nil) (defaults to: nil)

    nodes to exclude from extraction

Returns:

  • (String, nil)

    the concatenated visible text, or nil if none is found



33
34
35
# File 'lib/html2rss/html_extractor.rb', line 33

def extract_visible_text(tag, separator: ' ', exclude_nodes: nil)
  TextExtractor.call(tag, separator:, exclude_nodes:)
end

.ignored_container_path?(node, cache = nil) ⇒ Boolean

rubocop:disable Metrics/CyclomaticComplexity, Metrics/MethodLength, Metrics/PerceivedComplexity

Parameters:

  • node (Nokogiri::XML::Node)
  • cache (Hash, nil) (defaults to: nil)

    identity cache used to store results (must use compare_by_identity)

Returns:

  • (Boolean)

    true when the node belongs to ignored DOM chrome



51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
# File 'lib/html2rss/html_extractor.rb', line 51

def ignored_container_path?(node, cache = nil)
  return cache[node] if cache&.key?(node)

  curr = node
  visited = []
  is_ignored = false

  while curr.respond_to?(:parent) && curr
    if cache&.key?(curr)
      is_ignored = cache[curr]
      break
    end

    if IGNORED_CONTAINER_TAGS.include?(curr.name)
      is_ignored = true
      break
    end

    visited << curr
    curr = curr.parent
  end
  visited.each { |n| cache[n] = is_ignored } if cache

  is_ignored
end

.main_anchor_for(article_tag) ⇒ Nokogiri::XML::Node?

Returns first eligible descendant anchor.

Parameters:

  • article_tag (Nokogiri::XML::Node)

    article-like container to search within

Returns:

  • (Nokogiri::XML::Node, nil)

    first eligible descendant anchor



40
41
42
43
44
# File 'lib/html2rss/html_extractor.rb', line 40

def main_anchor_for()
  return  if .name == 'a' && .matches?(MAIN_ANCHOR_SELECTOR)

  .at_css(MAIN_ANCHOR_SELECTOR)
end

Instance Method Details

#callHash{Symbol => Object}

Returns extracted article attributes.

Returns:

  • (Hash{Symbol => Object})

    extracted article attributes



94
95
96
97
98
99
100
101
102
103
104
105
# File 'lib/html2rss/html_extractor.rb', line 94

def call
  {
    title: extract_title,
    url: extract_url,
    image: extract_image,
    description: extract_description,
    id: generate_id,
    published_at: extract_published_at,
    enclosures: extract_enclosures,
    categories: extract_categories
  }
end