Class: Html2rss::HtmlExtractor::TextExtractor

Inherits:
Object
  • Object
show all
Defined in:
lib/html2rss/html_extractor/text_extractor.rb

Overview

TextExtractor extracts visible text from DOM elements, preserving lists and block spacing while sanitizing white spaces.

Constant Summary collapse

BLOCK_TAGS =

HTML block elements that trigger line breaks or special formatting.

%w[p div li ul ol h1 h2 h3 h4 h5 h6 tr br].to_set.freeze
INVISIBLE_CONTENT_TAGS =

Tags ignored when extracting visible text content.

%w[svg script noscript style template].to_set.freeze

Class Method Summary collapse

Class Method Details

.call(tag, separator: ' ', exclude_nodes: nil) ⇒ String?

Returns the concatenated visible text, or nil if none is found.

Parameters:

  • tag (Nokogiri::XML::Node)

    the node from which to extract visible text

  • separator (String) (defaults to: ' ')

    separator used to join text fragments (default is a space)

  • exclude_nodes (Array<Nokogiri::XML::Node>, nil) (defaults to: nil)

    nodes to exclude from extraction

Returns:

  • (String, nil)

    the concatenated visible text, or nil if none is found



20
21
22
23
24
25
26
27
# File 'lib/html2rss/html_extractor/text_extractor.rb', line 20

def call(tag, separator: ' ', exclude_nodes: nil)
  return tag.text.gsub(/\s+/, ' ').strip if tag.respond_to?(:text?) && tag.text?

  parts = iterate_children(tag, separator, exclude_nodes)
  return if parts.empty?

  parts.join.squeeze(' ').strip
end