Module: Jekyll::L10n::HtmlTextUtils

Defined in:
lib/jekyll-l10n/utils/html_text_utils.rb

Overview

Utilities for extracting and manipulating HTML text content.

HtmlTextUtils provides helpers for extracting text from HTML elements while preserving inline formatting, removing block-level elements, decoding HTML entities, and cleaning up icon tags. These utilities support the extraction and translation pipelines.

Key responsibilities:

  • Extract text with inline HTML tags preserved

  • Remove block-level elements from cloned nodes

  • Remove empty icon tags

  • Decode HTML entities to plain text

  • Validate extracted text content

Constant Summary collapse

CONTENT_ELEMENTS =

Extended content elements for text extraction (includes inline elements)

%w[
  p h1 h2 h3 h4 h5 h6
  li dd dt blockquote figcaption
  button span a label
  th td caption
].freeze
CONTAINER_ELEMENTS =
HtmlElements::CONTAINER_ELEMENTS
ALL_BLOCK_ELEMENTS =
(CONTENT_ELEMENTS + CONTAINER_ELEMENTS).freeze
LAYOUT_ONLY_ELEMENTS =

Layout-only block elements: in HtmlElements::BLOCK_ELEMENTS but not in ALL_BLOCK_ELEMENTS, and not pre (which remove_code_blocks safely strips before flattening). Flattening these destroys structural nesting (ul/li, table rows, etc.) and must never be attempted.

(HtmlElements::BLOCK_ELEMENTS - ALL_BLOCK_ELEMENTS - %w[pre]).freeze
PLACEHOLDERED_INLINE_TAGS =

Inline elements whose content is translatable — replaced with <g id=“N”>.

%w[a span em strong b u abbr mark label].freeze
LITERAL_INLINE_TAGS =

Inline elements with literal content — tag kept, non-translatable attrs stripped.

%w[code var kbd samp].freeze
STRIP_FROM_LITERAL_TAGS =

Attributes stripped from literal-content elements before extraction.

%w[class style id].freeze

Class Method Summary collapse

Class Method Details

.decode_html_entities(text) ⇒ String

Decode HTML entities to plain text.

Converts HTML entities (&amp;, &lt;, etc.) to their plain text equivalents. Uses CGI.unescape_html if available, falls back to manual replacement.

Parameters:

  • text (String)

    Text with HTML entities

Returns:

  • (String)

    Text with entities decoded



41
42
43
44
45
46
47
48
49
50
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 41

def self.decode_html_entities(text)
  require 'cgi'
  CGI.unescape_html(text)
rescue StandardError
  text.gsub('&amp;', '&')
      .gsub('&lt;', '<')
      .gsub('&gt;', '>')
      .gsub('&quot;', '"')
      .gsub('&#39;', "'")
end

.extract_and_validate_text(node) ⇒ String?

Extract and validate text from a node.

Extracts text from element if it’s a content element, then validates it meets minimum length requirements.

Parameters:

  • node (Nokogiri::XML::Node)

    Node to extract from

Returns:

  • (String, nil)

    Validated text, or nil if not extractable or invalid



205
206
207
208
209
210
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 205

def self.extract_and_validate_text(node)
  return nil unless extractable?(node)

  text = extract_with_inline_tags(node)
  TextValidator.valid?(text) ? text : nil
end

.extract_with_inline_tags(node) ⇒ String

Extract text with inline tags preserved.

Extracts text from an element, removes block elements and empty icons, replaces translatable inline elements with <g id=“N”> placeholders, and strips non-translatable attributes from literal elements (<code> etc.). HTML entities (e.g. &lt;, &gt;) are preserved verbatim.

Parameters:

  • node (Nokogiri::XML::Node)

    Element to extract from

Returns:

  • (String)

    Extracted and normalized text



130
131
132
133
134
135
136
137
138
139
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 130

def self.extract_with_inline_tags(node)
  clone = node.dup
  remove_code_blocks(clone)
  remove_block_elements_from_node(clone)
  remove_empty_icon_tags(clone)
  replace_inline_elements_with_g_placeholders(clone)

  text = TextNormalizer.normalize(clone.inner_html)
  text&.then { |t| TextNormalizer.normalize(t).strip }
end

.extractable?(node) ⇒ Boolean

Check if a node is extractable (content element).

Parameters:

  • node (Nokogiri::XML::Node)

    Node to check

Returns:

  • (Boolean)

    True if node is a content element



216
217
218
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 216

def self.extractable?(node)
  node.element? && CONTENT_ELEMENTS.include?(node.name)
end

.layout_block_children?(node) ⇒ Boolean

Check if a node has any direct child that is a layout-only block element.

Layout-only elements (ul, ol, dl, table, form, etc.) are in HtmlElements::BLOCK_ELEMENTS but not in ALL_BLOCK_ELEMENTS. When present as direct children, remove_block_elements_from_node would flatten them, destroying structural nesting (dropdown menus, nested lists, table rows). Callers should skip extraction for such nodes.

Parameters:

  • node (Nokogiri::XML::Node)

    Node to inspect

Returns:

  • (Boolean)

    true if any direct child is a layout-only block element



194
195
196
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 194

def self.layout_block_children?(node)
  node.children.any? { |c| c.element? && LAYOUT_ONLY_ELEMENTS.include?(c.name) }
end

.remove_block_elements(node) ⇒ void

This method returns an undefined value.

Remove block-level elements from a node.

Alias for remove_block_elements_from_node for convenience.

Parameters:

  • node (Nokogiri::XML::Node)

    Node to process



89
90
91
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 89

def self.remove_block_elements(node)
  remove_block_elements_from_node(node)
end

.remove_block_elements_from_node(node) ⇒ void

This method returns an undefined value.

Remove block-level elements from a cloned node.

Replaces block-level element nodes with their children (flattening structure). Used to extract text while preserving inline elements.

Parameters:

  • node (Nokogiri::XML::Node)

    Node to process (modified in place)



75
76
77
78
79
80
81
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 75

def self.remove_block_elements_from_node(node)
  HtmlElements::BLOCK_ELEMENTS.each do |tag|
    node.xpath(".//#{tag}").each do |elem|
      elem.replace(elem.children)
    end
  end
end

.remove_code_blocks(node) ⇒ void

This method returns an undefined value.

Remove preformatted code blocks from a node.

Removes all <pre> elements entirely. With highlighter: none in Jekyll config, fenced code blocks produce plain <pre><code> as direct children of content elements — no Rouge wrappers. Removing <pre> before extraction ensures raw code never appears in PO msgids.

Must run before remove_block_elements_from_node so that <code> inside <pre> is gone before the general flattening pass.

Parameters:

  • node (Nokogiri::XML::Node)

    Node to process (modified in place)



64
65
66
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 64

def self.remove_code_blocks(node)
  node.css('pre').each(&:remove)
end

.remove_empty_icon_tags(node) ⇒ void

This method returns an undefined value.

Remove empty icon tags from a node.

Removes all <i> (icon) elements that contain no text. Used to clean up external link icon markers before text extraction.

Parameters:

  • node (Nokogiri::XML::Node)

    Node to process (modified in place)



100
101
102
103
104
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 100

def self.remove_empty_icon_tags(node)
  node.xpath('.//i').each do |elem|
    elem.remove if elem.text.strip.empty?
  end
end

.replace_inline_elements_with_g_placeholders(node) ⇒ void

This method returns an undefined value.

Replace top-level translatable inline elements with <g id=“N”> placeholders.

Only top-level inline elements are replaced — elements nested inside another inline element are preserved as content of the outer <g>. Literal-content elements (<code>, <var>, etc.) are not placeholdered; their non-translatable attributes are stripped instead.

Parameters:

  • node (Nokogiri::XML::Node)

    Node to process (modified in place)



150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 150

def self.replace_inline_elements_with_g_placeholders(node)
  g_index = 0
  top_level_placeholdered_inline_elements(node).each do |el|
    if el.text.strip.empty?
      el.remove
    else
      g_index += 1
      g = node.document.create_element('g')
      g['id'] = g_index.to_s
      el.children.each { |child| g.add_child(child.dup) }
      el.replace(g)
    end
  end
  strip_attributes_from_literal_elements(node)
end

.strip_attributes_from_literal_elements(node) ⇒ void

This method returns an undefined value.

Strip non-translatable attributes from literal-content inline elements.

Parameters:

  • node (Nokogiri::XML::Node)

    Node to process (modified in place)



178
179
180
181
182
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 178

def self.strip_attributes_from_literal_elements(node)
  node.css(LITERAL_INLINE_TAGS.join(',')).each do |el|
    STRIP_FROM_LITERAL_TAGS.each { |attr| el.remove_attribute(attr) }
  end
end

.top_level_placeholdered_inline_elements(node) ⇒ Object



166
167
168
169
170
171
172
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 166

def self.top_level_placeholdered_inline_elements(node)
  node.css(PLACEHOLDERED_INLINE_TAGS.join(',')).reject do |el|
    el.ancestors.any? do |a|
      PLACEHOLDERED_INLINE_TAGS.include?(a.name) || LITERAL_INLINE_TAGS.include?(a.name)
    end
  end
end