Module: Jekyll::L10n::HtmlTextUtils
- Defined in:
- lib/jekyll-l10n/utils/html_text_utils.rb
Overview
Utilities for extracting and manipulating HTML text content.
HtmlTextUtils provides helpers for extracting text from HTML elements while preserving inline formatting, removing block-level elements, decoding HTML entities, and cleaning up icon tags. These utilities support the extraction and translation pipelines.
Key responsibilities:
-
Extract text with inline HTML tags preserved
-
Remove block-level elements from cloned nodes
-
Remove empty icon tags
-
Decode HTML entities to plain text
-
Validate extracted text content
Constant Summary collapse
- CONTENT_ELEMENTS =
Extended content elements for text extraction (includes inline elements)
%w[ p h1 h2 h3 h4 h5 h6 li dd dt blockquote figcaption button span a label th td caption ].freeze
- CONTAINER_ELEMENTS =
HtmlElements::CONTAINER_ELEMENTS
- ALL_BLOCK_ELEMENTS =
(CONTENT_ELEMENTS + CONTAINER_ELEMENTS).freeze
- LAYOUT_ONLY_ELEMENTS =
Layout-only block elements: in HtmlElements::BLOCK_ELEMENTS but not in ALL_BLOCK_ELEMENTS, and not pre (which remove_code_blocks safely strips before flattening). Flattening these destroys structural nesting (ul/li, table rows, etc.) and must never be attempted.
(HtmlElements::BLOCK_ELEMENTS - ALL_BLOCK_ELEMENTS - %w[pre]).freeze
- PLACEHOLDERED_INLINE_TAGS =
Inline elements whose content is translatable — replaced with <g id=“N”>.
%w[a span em strong b u abbr mark label].freeze
- LITERAL_INLINE_TAGS =
Inline elements with literal content — tag kept, non-translatable attrs stripped.
%w[code var kbd samp].freeze
- STRIP_FROM_LITERAL_TAGS =
Attributes stripped from literal-content elements before extraction.
%w[class style id].freeze
Class Method Summary collapse
-
.decode_html_entities(text) ⇒ String
Decode HTML entities to plain text.
-
.extract_and_validate_text(node) ⇒ String?
Extract and validate text from a node.
-
.extract_with_inline_tags(node) ⇒ String
Extract text with inline tags preserved.
-
.extractable?(node) ⇒ Boolean
Check if a node is extractable (content element).
-
.layout_block_children?(node) ⇒ Boolean
Check if a node has any direct child that is a layout-only block element.
-
.remove_block_elements(node) ⇒ void
Remove block-level elements from a node.
-
.remove_block_elements_from_node(node) ⇒ void
Remove block-level elements from a cloned node.
-
.remove_code_blocks(node) ⇒ void
Remove preformatted code blocks from a node.
-
.remove_empty_icon_tags(node) ⇒ void
Remove empty icon tags from a node.
-
.replace_inline_elements_with_g_placeholders(node) ⇒ void
Replace top-level translatable inline elements with <g id=“N”> placeholders.
-
.strip_attributes_from_literal_elements(node) ⇒ void
Strip non-translatable attributes from literal-content inline elements.
- .top_level_placeholdered_inline_elements(node) ⇒ Object
Class Method Details
.decode_html_entities(text) ⇒ String
Decode HTML entities to plain text.
Converts HTML entities (&, <, etc.) to their plain text equivalents. Uses CGI.unescape_html if available, falls back to manual replacement.
41 42 43 44 45 46 47 48 49 50 |
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 41 def self.decode_html_entities(text) require 'cgi' CGI.unescape_html(text) rescue StandardError text.gsub('&', '&') .gsub('<', '<') .gsub('>', '>') .gsub('"', '"') .gsub(''', "'") end |
.extract_and_validate_text(node) ⇒ String?
Extract and validate text from a node.
Extracts text from element if it’s a content element, then validates it meets minimum length requirements.
205 206 207 208 209 210 |
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 205 def self.extract_and_validate_text(node) return nil unless extractable?(node) text = (node) TextValidator.valid?(text) ? text : nil end |
.extract_with_inline_tags(node) ⇒ String
Extract text with inline tags preserved.
Extracts text from an element, removes block elements and empty icons, replaces translatable inline elements with <g id=“N”> placeholders, and strips non-translatable attributes from literal elements (<code> etc.). HTML entities (e.g. <, >) are preserved verbatim.
130 131 132 133 134 135 136 137 138 139 |
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 130 def self.(node) clone = node.dup remove_code_blocks(clone) remove_block_elements_from_node(clone) (clone) replace_inline_elements_with_g_placeholders(clone) text = TextNormalizer.normalize(clone.inner_html) text&.then { |t| TextNormalizer.normalize(t).strip } end |
.extractable?(node) ⇒ Boolean
Check if a node is extractable (content element).
216 217 218 |
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 216 def self.extractable?(node) node.element? && CONTENT_ELEMENTS.include?(node.name) end |
.layout_block_children?(node) ⇒ Boolean
Check if a node has any direct child that is a layout-only block element.
Layout-only elements (ul, ol, dl, table, form, etc.) are in HtmlElements::BLOCK_ELEMENTS but not in ALL_BLOCK_ELEMENTS. When present as direct children, remove_block_elements_from_node would flatten them, destroying structural nesting (dropdown menus, nested lists, table rows). Callers should skip extraction for such nodes.
194 195 196 |
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 194 def self.layout_block_children?(node) node.children.any? { |c| c.element? && LAYOUT_ONLY_ELEMENTS.include?(c.name) } end |
.remove_block_elements(node) ⇒ void
This method returns an undefined value.
Remove block-level elements from a node.
Alias for remove_block_elements_from_node for convenience.
89 90 91 |
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 89 def self.remove_block_elements(node) remove_block_elements_from_node(node) end |
.remove_block_elements_from_node(node) ⇒ void
This method returns an undefined value.
Remove block-level elements from a cloned node.
Replaces block-level element nodes with their children (flattening structure). Used to extract text while preserving inline elements.
75 76 77 78 79 80 81 |
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 75 def self.remove_block_elements_from_node(node) HtmlElements::BLOCK_ELEMENTS.each do |tag| node.xpath(".//#{tag}").each do |elem| elem.replace(elem.children) end end end |
.remove_code_blocks(node) ⇒ void
This method returns an undefined value.
Remove preformatted code blocks from a node.
Removes all <pre> elements entirely. With highlighter: none in Jekyll config, fenced code blocks produce plain <pre><code> as direct children of content elements — no Rouge wrappers. Removing <pre> before extraction ensures raw code never appears in PO msgids.
Must run before remove_block_elements_from_node so that <code> inside <pre> is gone before the general flattening pass.
64 65 66 |
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 64 def self.remove_code_blocks(node) node.css('pre').each(&:remove) end |
.remove_empty_icon_tags(node) ⇒ void
This method returns an undefined value.
Remove empty icon tags from a node.
Removes all <i> (icon) elements that contain no text. Used to clean up external link icon markers before text extraction.
100 101 102 103 104 |
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 100 def self.(node) node.xpath('.//i').each do |elem| elem.remove if elem.text.strip.empty? end end |
.replace_inline_elements_with_g_placeholders(node) ⇒ void
This method returns an undefined value.
Replace top-level translatable inline elements with <g id=“N”> placeholders.
Only top-level inline elements are replaced — elements nested inside another inline element are preserved as content of the outer <g>. Literal-content elements (<code>, <var>, etc.) are not placeholdered; their non-translatable attributes are stripped instead.
150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 150 def self.replace_inline_elements_with_g_placeholders(node) g_index = 0 top_level_placeholdered_inline_elements(node).each do |el| if el.text.strip.empty? el.remove else g_index += 1 g = node.document.create_element('g') g['id'] = g_index.to_s el.children.each { |child| g.add_child(child.dup) } el.replace(g) end end strip_attributes_from_literal_elements(node) end |
.strip_attributes_from_literal_elements(node) ⇒ void
This method returns an undefined value.
Strip non-translatable attributes from literal-content inline elements.
178 179 180 181 182 |
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 178 def self.strip_attributes_from_literal_elements(node) node.css(LITERAL_INLINE_TAGS.join(',')).each do |el| STRIP_FROM_LITERAL_TAGS.each { |attr| el.remove_attribute(attr) } end end |
.top_level_placeholdered_inline_elements(node) ⇒ Object
166 167 168 169 170 171 172 |
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 166 def self.top_level_placeholdered_inline_elements(node) node.css(PLACEHOLDERED_INLINE_TAGS.join(',')).reject do |el| el.ancestors.any? do |a| PLACEHOLDERED_INLINE_TAGS.include?(a.name) || LITERAL_INLINE_TAGS.include?(a.name) end end end |