Class: Html2rss::HtmlExtractor::TextExtractor

Inherits:

Object

Object
Html2rss::HtmlExtractor::TextExtractor

show all

Defined in:: lib/html2rss/html_extractor/text_extractor.rb

Overview

TextExtractor extracts visible text from DOM elements, preserving lists and block spacing while sanitizing white spaces.

Constant Summary collapse

BLOCK_TAGS = HTML block elements that trigger line breaks or special formatting.

%w[p div li ul ol h1 h2 h3 h4 h5 h6 tr br].to_set.freeze

INVISIBLE_CONTENT_TAGS = Tags ignored when extracting visible text content.

%w[svg script noscript style template].to_set.freeze

Class Method Summary collapse

.call(tag, separator: ' ', exclude_nodes: nil) ⇒ String^?

The concatenated visible text, or nil if none is found.

Class Method Details

.call(tag, separator: ' ', exclude_nodes: nil) ⇒ `String`^?

Returns the concatenated visible text, or nil if none is found.

Parameters:

tag (Nokogiri::XML::Node) —

the node from which to extract visible text
separator (String) (defaults to: ' ') —

separator used to join text fragments (default is a space)
exclude_nodes (Array<Nokogiri::XML::Node>, nil) (defaults to: nil) —

nodes to exclude from extraction

Returns:

(String, nil) —

the concatenated visible text, or nil if none is found

# File 'lib/html2rss/html_extractor/text_extractor.rb', line 20

def call(tag, separator: ' ', exclude_nodes: nil)
  return tag.text.gsub(/\s+/, ' ').strip if tag.respond_to?(:text?) && tag.text?

  parts = iterate_children(tag, separator, exclude_nodes)
  return if parts.empty?

  parts.join.squeeze(' ').strip
end