Class: Html2rss::HtmlExtractor::TextExtractor
- Inherits:
-
Object
- Object
- Html2rss::HtmlExtractor::TextExtractor
- Defined in:
- lib/html2rss/html_extractor/text_extractor.rb
Overview
TextExtractor extracts visible text from DOM elements, preserving lists and block spacing while sanitizing white spaces.
Constant Summary collapse
- BLOCK_TAGS =
HTML block elements that trigger line breaks or special formatting.
%w[p div li ul ol h1 h2 h3 h4 h5 h6 tr br].to_set.freeze
- INVISIBLE_CONTENT_TAGS =
Tags ignored when extracting visible text content.
%w[svg script noscript style template].to_set.freeze
Class Method Summary collapse
-
.call(tag, separator: ' ', exclude_nodes: nil) ⇒ String?
The concatenated visible text, or nil if none is found.
Class Method Details
.call(tag, separator: ' ', exclude_nodes: nil) ⇒ String?
Returns the concatenated visible text, or nil if none is found.
20 21 22 23 24 25 26 27 |
# File 'lib/html2rss/html_extractor/text_extractor.rb', line 20 def call(tag, separator: ' ', exclude_nodes: nil) return tag.text.gsub(/\s+/, ' ').strip if tag.respond_to?(:text?) && tag.text? parts = iterate_children(tag, separator, exclude_nodes) return if parts.empty? parts.join.squeeze(' ').strip end |