Module: Jekyll::L10n::HtmlTextUtils

Defined in:
lib/jekyll-l10n/utils/html_text_utils.rb

Overview

Utilities for extracting and manipulating HTML text content.

HtmlTextUtils provides helpers for extracting text from HTML elements while preserving inline formatting, removing block-level elements, decoding HTML entities, and cleaning up icon tags. These utilities support the extraction and translation pipelines.

Key responsibilities:

  • Extract text with inline HTML tags preserved

  • Remove block-level elements from cloned nodes

  • Remove empty icon tags

  • Decode HTML entities to plain text

  • Validate extracted text content

Constant Summary collapse

CONTENT_ELEMENTS =

Extended content elements for text extraction (includes inline elements)

%w[
  p h1 h2 h3 h4 h5 h6
  li dd dt blockquote figcaption
  button span a label
].freeze
CONTAINER_ELEMENTS =
HtmlElements::CONTAINER_ELEMENTS
ALL_BLOCK_ELEMENTS =
(CONTENT_ELEMENTS + CONTAINER_ELEMENTS).freeze

Class Method Summary collapse

Class Method Details

.decode_html_entities(text) ⇒ String

Decode HTML entities to plain text.

Converts HTML entities (&, <, etc.) to their plain text equivalents. Uses CGI.unescape_html if available, falls back to manual replacement.

Parameters:

  • text (String)

    Text with HTML entities

Returns:

  • (String)

    Text with entities decoded



40
41
42
43
44
45
46
47
48
49
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 40

def self.decode_html_entities(text)
  require 'cgi'
  CGI.unescape_html(text)
rescue StandardError
  text.gsub('&', '&')
      .gsub('&lt;', '<')
      .gsub('&gt;', '>')
      .gsub('&quot;', '"')
      .gsub('&#39;', "'")
end

.extract_and_validate_text(node) ⇒ String?

Extract and validate text from a node.

Extracts text from element if it’s a content element, then validates it meets minimum length requirements.

Parameters:

  • node (Nokogiri::XML::Node)

    Node to extract from

Returns:

  • (String, nil)

    Validated text, or nil if not extractable or invalid



114
115
116
117
118
119
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 114

def self.extract_and_validate_text(node)
  return nil unless extractable?(node)

  text = extract_with_inline_tags(node)
  TextValidator.valid?(text) ? text : nil
end

.extract_with_inline_tags(node) ⇒ String

Extract text with inline tags preserved.

Extracts text from an element, removes block elements and empty icons, normalizes whitespace, and decodes HTML entities. Returns plain text suitable for translation.

Parameters:

  • node (Nokogiri::XML::Node)

    Element to extract from

Returns:

  • (String)

    Extracted and normalized text



97
98
99
100
101
102
103
104
105
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 97

def self.extract_with_inline_tags(node)
  clone = node.dup
  remove_block_elements_from_node(clone)
  remove_empty_icon_tags(clone)

  text = TextNormalizer.normalize(clone.inner_html)
  text = decode_html_entities(text)
  text&.then { |t| TextNormalizer.normalize(t).strip }
end

.extractable?(node) ⇒ Boolean

Check if a node is extractable (content element).

Parameters:

  • node (Nokogiri::XML::Node)

    Node to check

Returns:

  • (Boolean)

    True if node is a content element



125
126
127
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 125

def self.extractable?(node)
  node.element? && CONTENT_ELEMENTS.include?(node.name)
end

.remove_block_elements(node) ⇒ void

This method returns an undefined value.

Remove block-level elements from a node.

Alias for remove_block_elements_from_node for convenience.

Parameters:

  • node (Nokogiri::XML::Node)

    Node to process



72
73
74
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 72

def self.remove_block_elements(node)
  remove_block_elements_from_node(node)
end

.remove_block_elements_from_node(node) ⇒ void

This method returns an undefined value.

Remove block-level elements from a cloned node.

Replaces block-level element nodes with their children (flattening structure). Used to extract text while preserving inline elements.

Parameters:

  • node (Nokogiri::XML::Node)

    Node to process (modified in place)



58
59
60
61
62
63
64
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 58

def self.remove_block_elements_from_node(node)
  HtmlElements::BLOCK_ELEMENTS.each do |tag|
    node.xpath(".//#{tag}").each do |elem|
      elem.replace(elem.children)
    end
  end
end

.remove_empty_icon_tags(node) ⇒ void

This method returns an undefined value.

Remove empty icon tags from a node.

Removes all <i> (icon) elements that contain no text. Used to clean up external link icon markers before text extraction.

Parameters:

  • node (Nokogiri::XML::Node)

    Node to process (modified in place)



83
84
85
86
87
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 83

def self.remove_empty_icon_tags(node)
  node.xpath('.//i').each do |elem|
    elem.remove if elem.text.strip.empty?
  end
end