Module: Jekyll::L10n::HtmlTextUtils

Defined in:: lib/jekyll-l10n/utils/html_text_utils.rb

Overview

Utilities for extracting and manipulating HTML text content.

HtmlTextUtils provides helpers for extracting text from HTML elements while preserving inline formatting, removing block-level elements, decoding HTML entities, and cleaning up icon tags. These utilities support the extraction and translation pipelines.

Key responsibilities:

Extract text with inline HTML tags preserved
Remove block-level elements from cloned nodes
Remove empty icon tags
Decode HTML entities to plain text
Validate extracted text content

Constant Summary collapse

CONTENT_ELEMENTS = Extended content elements for text extraction (includes inline elements)

%w[
  p h1 h2 h3 h4 h5 h6
  li dd dt blockquote figcaption
  button span a label
].freeze

CONTAINER_ELEMENTS =

HtmlElements::CONTAINER_ELEMENTS

ALL_BLOCK_ELEMENTS =

(CONTENT_ELEMENTS + CONTAINER_ELEMENTS).freeze

Class Method Summary collapse

.decode_html_entities(text) ⇒ String

Decode HTML entities to plain text.
.extract_and_validate_text(node) ⇒ String^?

Extract and validate text from a node.
.extract_with_inline_tags(node) ⇒ String

Extract text with inline tags preserved.
.extractable?(node) ⇒ Boolean

Check if a node is extractable (content element).
.remove_block_elements(node) ⇒ void

Remove block-level elements from a node.
.remove_block_elements_from_node(node) ⇒ void

Remove block-level elements from a cloned node.
.remove_empty_icon_tags(node) ⇒ void

Remove empty icon tags from a node.

Class Method Details

.decode_html_entities(text) ⇒ `String`

Decode HTML entities to plain text.

Converts HTML entities (&, <, etc.) to their plain text equivalents. Uses CGI.unescape_html if available, falls back to manual replacement.

Parameters:

text (String) —

Text with HTML entities

Returns:

(String) —

Text with entities decoded

# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 40

def self.decode_html_entities(text)
  require 'cgi'
  CGI.unescape_html(text)
rescue StandardError
  text.gsub('&amp;', '&')
      .gsub('&lt;', '<')
      .gsub('&gt;', '>')
      .gsub('&quot;', '"')
      .gsub('&#39;', "'")
end

.extract_and_validate_text(node) ⇒ `String`^?

Extract and validate text from a node.

Extracts text from element if it’s a content element, then validates it meets minimum length requirements.

Parameters:

node (Nokogiri::XML::Node) —

Node to extract from

Returns:

(String, nil) —

Validated text, or nil if not extractable or invalid

# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 114

def self.extract_and_validate_text(node)
  return nil unless extractable?(node)

  text = extract_with_inline_tags(node)
  TextValidator.valid?(text) ? text : nil
end

.extract_with_inline_tags(node) ⇒ `String`

Extract text with inline tags preserved.

Extracts text from an element, removes block elements and empty icons, normalizes whitespace, and decodes HTML entities. Returns plain text suitable for translation.

Parameters:

node (Nokogiri::XML::Node) —

Element to extract from

Returns:

(String) —

Extracted and normalized text

# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 97

def self.extract_with_inline_tags(node)
  clone = node.dup
  remove_block_elements_from_node(clone)
  remove_empty_icon_tags(clone)

  text = TextNormalizer.normalize(clone.inner_html)
  text = decode_html_entities(text)
  text&.then { |t| TextNormalizer.normalize(t).strip }
end

.extractable?(node) ⇒ `Boolean`

Check if a node is extractable (content element).

Parameters:

node (Nokogiri::XML::Node) —

Node to check

Returns:

(Boolean) —

True if node is a content element



125
126
127

# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 125

def self.extractable?(node)
  node.element? && CONTENT_ELEMENTS.include?(node.name)
end

.remove_block_elements(node) ⇒ `void`

This method returns an undefined value.

Remove block-level elements from a node.

Alias for remove_block_elements_from_node for convenience.

Parameters:

node (Nokogiri::XML::Node) —

Node to process



72
73
74

# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 72

def self.remove_block_elements(node)
  remove_block_elements_from_node(node)
end

.remove_block_elements_from_node(node) ⇒ `void`

This method returns an undefined value.

Remove block-level elements from a cloned node.

Replaces block-level element nodes with their children (flattening structure). Used to extract text while preserving inline elements.

Parameters:

node (Nokogiri::XML::Node) —

Node to process (modified in place)

# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 58

def self.remove_block_elements_from_node(node)
  HtmlElements::BLOCK_ELEMENTS.each do |tag|
    node.xpath(".//#{tag}").each do |elem|
      elem.replace(elem.children)
    end
  end
end

.remove_empty_icon_tags(node) ⇒ `void`

This method returns an undefined value.

Remove empty icon tags from a node.

Removes all <i> (icon) elements that contain no text. Used to clean up external link icon markers before text extraction.

Parameters:

node (Nokogiri::XML::Node) —

Node to process (modified in place)

# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 83

def self.remove_empty_icon_tags(node)
  node.xpath('.//i').each do |elem|
    elem.remove if elem.text.strip.empty?
  end
end

Module: Jekyll::L10n::HtmlTextUtils

Overview

Constant Summary collapse

Class Method Summary collapse

Class Method Details

.decode_html_entities(text) ⇒ String

.extract_and_validate_text(node) ⇒ String?

.extract_with_inline_tags(node) ⇒ String

.extractable?(node) ⇒ Boolean

.remove_block_elements(node) ⇒ void

.remove_block_elements_from_node(node) ⇒ void

.remove_empty_icon_tags(node) ⇒ void

.decode_html_entities(text) ⇒ `String`

.extract_and_validate_text(node) ⇒ `String`^?

.extract_with_inline_tags(node) ⇒ `String`

.extractable?(node) ⇒ `Boolean`

.remove_block_elements(node) ⇒ `void`

.remove_block_elements_from_node(node) ⇒ `void`

.remove_empty_icon_tags(node) ⇒ `void`