Module: Jekyll::L10n::HtmlTextUtils

Defined in:: lib/jekyll-l10n/utils/html_text_utils.rb

Overview

Utilities for extracting and manipulating HTML text content.

HtmlTextUtils provides helpers for extracting text from HTML elements while preserving inline formatting, removing block-level elements, decoding HTML entities, and cleaning up icon tags. These utilities support the extraction and translation pipelines.

Key responsibilities:

Extract text with inline HTML tags preserved
Remove block-level elements from cloned nodes
Remove empty icon tags
Decode HTML entities to plain text
Validate extracted text content

Constant Summary collapse

CONTENT_ELEMENTS = Extended content elements for text extraction (includes inline elements)

%w[
  p h1 h2 h3 h4 h5 h6
  li dd dt blockquote figcaption
  button span a label
  th td caption
].freeze

CONTAINER_ELEMENTS =

HtmlElements::CONTAINER_ELEMENTS

ALL_BLOCK_ELEMENTS =

(CONTENT_ELEMENTS + CONTAINER_ELEMENTS).freeze

LAYOUT_ONLY_ELEMENTS = Layout-only block elements: in HtmlElements::BLOCK_ELEMENTS but not in ALL_BLOCK_ELEMENTS, and not pre (which remove_code_blocks safely strips before flattening). Flattening these destroys structural nesting (ul/li, table rows, etc.) and must never be attempted.

(HtmlElements::BLOCK_ELEMENTS - ALL_BLOCK_ELEMENTS - %w[pre]).freeze

PLACEHOLDERED_INLINE_TAGS = Inline elements whose content is translatable — replaced with <g id=“N”>.

%w[a span em strong b u abbr mark label].freeze

LITERAL_INLINE_TAGS = Inline elements with literal content — tag kept, non-translatable attrs stripped.

%w[code var kbd samp].freeze

STRIP_FROM_LITERAL_TAGS = Attributes stripped from literal-content elements before extraction.

%w[class style id].freeze

Class Method Summary collapse

.decode_html_entities(text) ⇒ String

Decode HTML entities to plain text.
.extract_and_validate_text(node) ⇒ String^?

Extract and validate text from a node.
.extract_with_inline_tags(node) ⇒ String

Extract text with inline tags preserved.
.extractable?(node) ⇒ Boolean

Check if a node is extractable (content element).
.layout_block_children?(node) ⇒ Boolean

Check if a node has any direct child that is a layout-only block element.
.remove_block_elements(node) ⇒ void

Remove block-level elements from a node.
.remove_block_elements_from_node(node) ⇒ void

Remove block-level elements from a cloned node.
.remove_code_blocks(node) ⇒ void

Remove preformatted code blocks from a node.
.remove_empty_icon_tags(node) ⇒ void

Remove empty icon tags from a node.
.replace_inline_elements_with_g_placeholders(node) ⇒ void

Replace top-level translatable inline elements with <g id=“N”> placeholders.
.strip_attributes_from_literal_elements(node) ⇒ void

Strip non-translatable attributes from literal-content inline elements.
.top_level_placeholdered_inline_elements(node) ⇒ Object

Class Method Details

.decode_html_entities(text) ⇒ `String`

Decode HTML entities to plain text.

Converts HTML entities (&, <, etc.) to their plain text equivalents. Uses CGI.unescape_html if available, falls back to manual replacement.

Parameters:

text (String) —

Text with HTML entities

Returns:

(String) —

Text with entities decoded

# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 41

def self.decode_html_entities(text)
  require 'cgi'
  CGI.unescape_html(text)
rescue StandardError
  text.gsub('&amp;', '&')
      .gsub('&lt;', '<')
      .gsub('&gt;', '>')
      .gsub('&quot;', '"')
      .gsub('&#39;', "'")
end

.extract_and_validate_text(node) ⇒ `String`^?

Extract and validate text from a node.

Extracts text from element if it’s a content element, then validates it meets minimum length requirements.

Parameters:

node (Nokogiri::XML::Node) —

Node to extract from

Returns:

(String, nil) —

Validated text, or nil if not extractable or invalid

# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 205

def self.extract_and_validate_text(node)
  return nil unless extractable?(node)

  text = extract_with_inline_tags(node)
  TextValidator.valid?(text) ? text : nil
end

.extract_with_inline_tags(node) ⇒ `String`

Extract text with inline tags preserved.

Extracts text from an element, removes block elements and empty icons, replaces translatable inline elements with <g id=“N”> placeholders, and strips non-translatable attributes from literal elements (<code> etc.). HTML entities (e.g. <, >) are preserved verbatim.

Parameters:

node (Nokogiri::XML::Node) —

Element to extract from

Returns:

(String) —

Extracted and normalized text

# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 130

def self.extract_with_inline_tags(node)
  clone = node.dup
  remove_code_blocks(clone)
  remove_block_elements_from_node(clone)
  remove_empty_icon_tags(clone)
  replace_inline_elements_with_g_placeholders(clone)

  text = TextNormalizer.normalize(clone.inner_html)
  text&.then { |t| TextNormalizer.normalize(t).strip }
end

.extractable?(node) ⇒ `Boolean`

Check if a node is extractable (content element).

Parameters:

node (Nokogiri::XML::Node) —

Node to check

Returns:

(Boolean) —

True if node is a content element



216
217
218

# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 216

def self.extractable?(node)
  node.element? && CONTENT_ELEMENTS.include?(node.name)
end

.layout_block_children?(node) ⇒ `Boolean`

Check if a node has any direct child that is a layout-only block element.

Layout-only elements (ul, ol, dl, table, form, etc.) are in HtmlElements::BLOCK_ELEMENTS but not in ALL_BLOCK_ELEMENTS. When present as direct children, remove_block_elements_from_node would flatten them, destroying structural nesting (dropdown menus, nested lists, table rows). Callers should skip extraction for such nodes.

Parameters:

node (Nokogiri::XML::Node) —

Node to inspect

Returns:

(Boolean) —

true if any direct child is a layout-only block element



194
195
196

# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 194

def self.layout_block_children?(node)
  node.children.any? { |c| c.element? && LAYOUT_ONLY_ELEMENTS.include?(c.name) }
end

.remove_block_elements(node) ⇒ `void`

This method returns an undefined value.

Remove block-level elements from a node.

Alias for remove_block_elements_from_node for convenience.

Parameters:

node (Nokogiri::XML::Node) —

Node to process



89
90
91

# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 89

def self.remove_block_elements(node)
  remove_block_elements_from_node(node)
end

.remove_block_elements_from_node(node) ⇒ `void`

This method returns an undefined value.

Remove block-level elements from a cloned node.

Replaces block-level element nodes with their children (flattening structure). Used to extract text while preserving inline elements.

Parameters:

node (Nokogiri::XML::Node) —

Node to process (modified in place)

# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 75

def self.remove_block_elements_from_node(node)
  HtmlElements::BLOCK_ELEMENTS.each do |tag|
    node.xpath(".//#{tag}").each do |elem|
      elem.replace(elem.children)
    end
  end
end

.remove_code_blocks(node) ⇒ `void`

This method returns an undefined value.

Remove preformatted code blocks from a node.

Removes all <pre> elements entirely. With highlighter: none in Jekyll config, fenced code blocks produce plain <pre><code> as direct children of content elements — no Rouge wrappers. Removing <pre> before extraction ensures raw code never appears in PO msgids.

Must run before remove_block_elements_from_node so that <code> inside <pre> is gone before the general flattening pass.

Parameters:

node (Nokogiri::XML::Node) —

Node to process (modified in place)



64
65
66

# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 64

def self.remove_code_blocks(node)
  node.css('pre').each(&:remove)
end

.remove_empty_icon_tags(node) ⇒ `void`

This method returns an undefined value.

Remove empty icon tags from a node.

Removes all <i> (icon) elements that contain no text. Used to clean up external link icon markers before text extraction.

Parameters:

node (Nokogiri::XML::Node) —

Node to process (modified in place)

# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 100

def self.remove_empty_icon_tags(node)
  node.xpath('.//i').each do |elem|
    elem.remove if elem.text.strip.empty?
  end
end

.replace_inline_elements_with_g_placeholders(node) ⇒ `void`

This method returns an undefined value.

Replace top-level translatable inline elements with <g id=“N”> placeholders.

Only top-level inline elements are replaced — elements nested inside another inline element are preserved as content of the outer <g>. Literal-content elements (<code>, <var>, etc.) are not placeholdered; their non-translatable attributes are stripped instead.

Parameters:

node (Nokogiri::XML::Node) —

Node to process (modified in place)

# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 150

def self.replace_inline_elements_with_g_placeholders(node)
  g_index = 0
  top_level_placeholdered_inline_elements(node).each do |el|
    if el.text.strip.empty?
      el.remove
    else
      g_index += 1
      g = node.document.create_element('g')
      g['id'] = g_index.to_s
      el.children.each { |child| g.add_child(child.dup) }
      el.replace(g)
    end
  end
  strip_attributes_from_literal_elements(node)
end

.strip_attributes_from_literal_elements(node) ⇒ `void`

This method returns an undefined value.

Strip non-translatable attributes from literal-content inline elements.

Parameters:

node (Nokogiri::XML::Node) —

Node to process (modified in place)

# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 178

def self.strip_attributes_from_literal_elements(node)
  node.css(LITERAL_INLINE_TAGS.join(',')).each do |el|
    STRIP_FROM_LITERAL_TAGS.each { |attr| el.remove_attribute(attr) }
  end
end

.top_level_placeholdered_inline_elements(node) ⇒ `Object`

# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 166

def self.top_level_placeholdered_inline_elements(node)
  node.css(PLACEHOLDERED_INLINE_TAGS.join(',')).reject do |el|
    el.ancestors.any? do |a|
      PLACEHOLDERED_INLINE_TAGS.include?(a.name) || LITERAL_INLINE_TAGS.include?(a.name)
    end
  end
end

Module: Jekyll::L10n::HtmlTextUtils

Overview

Constant Summary collapse

Class Method Summary collapse

Class Method Details

.decode_html_entities(text) ⇒ String

.extract_and_validate_text(node) ⇒ String?

.extract_with_inline_tags(node) ⇒ String

.extractable?(node) ⇒ Boolean

.layout_block_children?(node) ⇒ Boolean

.remove_block_elements(node) ⇒ void

.remove_block_elements_from_node(node) ⇒ void

.remove_code_blocks(node) ⇒ void

.remove_empty_icon_tags(node) ⇒ void

.replace_inline_elements_with_g_placeholders(node) ⇒ void

.strip_attributes_from_literal_elements(node) ⇒ void

.top_level_placeholdered_inline_elements(node) ⇒ Object

.decode_html_entities(text) ⇒ `String`

.extract_and_validate_text(node) ⇒ `String`^?

.extract_with_inline_tags(node) ⇒ `String`

.extractable?(node) ⇒ `Boolean`

.layout_block_children?(node) ⇒ `Boolean`

.remove_block_elements(node) ⇒ `void`

.remove_block_elements_from_node(node) ⇒ `void`

.remove_code_blocks(node) ⇒ `void`

.remove_empty_icon_tags(node) ⇒ `void`

.replace_inline_elements_with_g_placeholders(node) ⇒ `void`

.strip_attributes_from_literal_elements(node) ⇒ `void`

.top_level_placeholdered_inline_elements(node) ⇒ `Object`