Module: Jekyll::L10n::HtmlTextUtils
- Defined in:
- lib/jekyll-l10n/utils/html_text_utils.rb
Overview
Utilities for extracting and manipulating HTML text content.
HtmlTextUtils provides helpers for extracting text from HTML elements while preserving inline formatting, removing block-level elements, decoding HTML entities, and cleaning up icon tags. These utilities support the extraction and translation pipelines.
Key responsibilities:
-
Extract text with inline HTML tags preserved
-
Remove block-level elements from cloned nodes
-
Remove empty icon tags
-
Decode HTML entities to plain text
-
Validate extracted text content
Constant Summary collapse
- CONTENT_ELEMENTS =
Extended content elements for text extraction (includes inline elements)
%w[ p h1 h2 h3 h4 h5 h6 li dd dt blockquote figcaption button span a label ].freeze
- CONTAINER_ELEMENTS =
HtmlElements::CONTAINER_ELEMENTS
- ALL_BLOCK_ELEMENTS =
(CONTENT_ELEMENTS + CONTAINER_ELEMENTS).freeze
Class Method Summary collapse
-
.decode_html_entities(text) ⇒ String
Decode HTML entities to plain text.
-
.extract_and_validate_text(node) ⇒ String?
Extract and validate text from a node.
-
.extract_with_inline_tags(node) ⇒ String
Extract text with inline tags preserved.
-
.extractable?(node) ⇒ Boolean
Check if a node is extractable (content element).
-
.remove_block_elements(node) ⇒ void
Remove block-level elements from a node.
-
.remove_block_elements_from_node(node) ⇒ void
Remove block-level elements from a cloned node.
-
.remove_empty_icon_tags(node) ⇒ void
Remove empty icon tags from a node.
Class Method Details
.decode_html_entities(text) ⇒ String
Decode HTML entities to plain text.
Converts HTML entities (&, <, etc.) to their plain text equivalents. Uses CGI.unescape_html if available, falls back to manual replacement.
40 41 42 43 44 45 46 47 48 49 |
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 40 def self.decode_html_entities(text) require 'cgi' CGI.unescape_html(text) rescue StandardError text.gsub('&', '&') .gsub('<', '<') .gsub('>', '>') .gsub('"', '"') .gsub(''', "'") end |
.extract_and_validate_text(node) ⇒ String?
Extract and validate text from a node.
Extracts text from element if it’s a content element, then validates it meets minimum length requirements.
115 116 117 118 119 120 |
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 115 def self.extract_and_validate_text(node) return nil unless extractable?(node) text = (node) TextValidator.valid?(text) ? text : nil end |
.extract_with_inline_tags(node) ⇒ String
Extract text with inline tags preserved.
Extracts text from an element, removes block elements and empty icons, and normalizes whitespace. HTML entities (e.g. <, >) are preserved verbatim so that entity-encoded content inside inline elements (such as <p>) is written to PO msgids as-is and does not become a live HTML tag when the msgstr is later injected via inner_html.
99 100 101 102 103 104 105 106 |
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 99 def self.(node) clone = node.dup remove_block_elements_from_node(clone) (clone) text = TextNormalizer.normalize(clone.inner_html) text&.then { |t| TextNormalizer.normalize(t).strip } end |
.extractable?(node) ⇒ Boolean
Check if a node is extractable (content element).
126 127 128 |
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 126 def self.extractable?(node) node.element? && CONTENT_ELEMENTS.include?(node.name) end |
.remove_block_elements(node) ⇒ void
This method returns an undefined value.
Remove block-level elements from a node.
Alias for remove_block_elements_from_node for convenience.
72 73 74 |
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 72 def self.remove_block_elements(node) remove_block_elements_from_node(node) end |
.remove_block_elements_from_node(node) ⇒ void
This method returns an undefined value.
Remove block-level elements from a cloned node.
Replaces block-level element nodes with their children (flattening structure). Used to extract text while preserving inline elements.
58 59 60 61 62 63 64 |
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 58 def self.remove_block_elements_from_node(node) HtmlElements::BLOCK_ELEMENTS.each do |tag| node.xpath(".//#{tag}").each do |elem| elem.replace(elem.children) end end end |
.remove_empty_icon_tags(node) ⇒ void
This method returns an undefined value.
Remove empty icon tags from a node.
Removes all <i> (icon) elements that contain no text. Used to clean up external link icon markers before text extraction.
83 84 85 86 87 |
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 83 def self.(node) node.xpath('.//i').each do |elem| elem.remove if elem.text.strip.empty? end end |