Module: Jekyll::L10n::HtmlTextUtils
- Defined in:
- lib/jekyll-l10n/utils/html_text_utils.rb
Overview
Utilities for extracting and manipulating HTML text content.
HtmlTextUtils provides helpers for extracting text from HTML elements while preserving inline formatting, removing block-level elements, decoding HTML entities, and cleaning up icon tags. These utilities support the extraction and translation pipelines.
Key responsibilities:
-
Extract text with inline HTML tags preserved
-
Remove block-level elements from cloned nodes
-
Remove empty icon tags
-
Decode HTML entities to plain text
-
Validate extracted text content
Constant Summary collapse
- CONTENT_ELEMENTS =
Extended content elements for text extraction (includes inline elements)
%w[ p h1 h2 h3 h4 h5 h6 li dd dt blockquote figcaption button span a label ].freeze
- CONTAINER_ELEMENTS =
HtmlElements::CONTAINER_ELEMENTS
- ALL_BLOCK_ELEMENTS =
(CONTENT_ELEMENTS + CONTAINER_ELEMENTS).freeze
Class Method Summary collapse
-
.decode_html_entities(text) ⇒ String
Decode HTML entities to plain text.
-
.extract_and_validate_text(node) ⇒ String?
Extract and validate text from a node.
-
.extract_with_inline_tags(node) ⇒ String
Extract text with inline tags preserved.
-
.extractable?(node) ⇒ Boolean
Check if a node is extractable (content element).
-
.remove_block_elements(node) ⇒ void
Remove block-level elements from a node.
-
.remove_block_elements_from_node(node) ⇒ void
Remove block-level elements from a cloned node.
-
.remove_empty_icon_tags(node) ⇒ void
Remove empty icon tags from a node.
Class Method Details
.decode_html_entities(text) ⇒ String
Decode HTML entities to plain text.
Converts HTML entities (&, <, etc.) to their plain text equivalents. Uses CGI.unescape_html if available, falls back to manual replacement.
40 41 42 43 44 45 46 47 48 49 |
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 40 def self.decode_html_entities(text) require 'cgi' CGI.unescape_html(text) rescue StandardError text.gsub('&', '&') .gsub('<', '<') .gsub('>', '>') .gsub('"', '"') .gsub(''', "'") end |
.extract_and_validate_text(node) ⇒ String?
Extract and validate text from a node.
Extracts text from element if it’s a content element, then validates it meets minimum length requirements.
114 115 116 117 118 119 |
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 114 def self.extract_and_validate_text(node) return nil unless extractable?(node) text = (node) TextValidator.valid?(text) ? text : nil end |
.extract_with_inline_tags(node) ⇒ String
Extract text with inline tags preserved.
Extracts text from an element, removes block elements and empty icons, normalizes whitespace, and decodes HTML entities. Returns plain text suitable for translation.
97 98 99 100 101 102 103 104 105 |
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 97 def self.(node) clone = node.dup remove_block_elements_from_node(clone) (clone) text = TextNormalizer.normalize(clone.inner_html) text = decode_html_entities(text) text&.then { |t| TextNormalizer.normalize(t).strip } end |
.extractable?(node) ⇒ Boolean
Check if a node is extractable (content element).
125 126 127 |
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 125 def self.extractable?(node) node.element? && CONTENT_ELEMENTS.include?(node.name) end |
.remove_block_elements(node) ⇒ void
This method returns an undefined value.
Remove block-level elements from a node.
Alias for remove_block_elements_from_node for convenience.
72 73 74 |
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 72 def self.remove_block_elements(node) remove_block_elements_from_node(node) end |
.remove_block_elements_from_node(node) ⇒ void
This method returns an undefined value.
Remove block-level elements from a cloned node.
Replaces block-level element nodes with their children (flattening structure). Used to extract text while preserving inline elements.
58 59 60 61 62 63 64 |
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 58 def self.remove_block_elements_from_node(node) HtmlElements::BLOCK_ELEMENTS.each do |tag| node.xpath(".//#{tag}").each do |elem| elem.replace(elem.children) end end end |
.remove_empty_icon_tags(node) ⇒ void
This method returns an undefined value.
Remove empty icon tags from a node.
Removes all <i> (icon) elements that contain no text. Used to clean up external link icon markers before text extraction.
83 84 85 86 87 |
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 83 def self.(node) node.xpath('.//i').each do |elem| elem.remove if elem.text.strip.empty? end end |