Module: Jekyll::L10n::HtmlTextUtils

Defined in:
lib/jekyll-l10n/utils/html_text_utils.rb

Overview

Utilities for extracting and manipulating HTML text content.

HtmlTextUtils provides helpers for extracting text from HTML elements while preserving inline formatting, removing block-level elements, decoding HTML entities, and cleaning up icon tags. These utilities support the extraction and translation pipelines.

Key responsibilities:

  • Extract text with inline HTML tags preserved

  • Remove block-level elements from cloned nodes

  • Remove empty icon tags

  • Decode HTML entities to plain text

  • Validate extracted text content

Constant Summary collapse

CONTENT_ELEMENTS =

Extended content elements for text extraction (includes inline elements)

%w[
  p h1 h2 h3 h4 h5 h6
  li dd dt blockquote figcaption
  button span a label
].freeze
CONTAINER_ELEMENTS =
HtmlElements::CONTAINER_ELEMENTS
ALL_BLOCK_ELEMENTS =
(CONTENT_ELEMENTS + CONTAINER_ELEMENTS).freeze

Class Method Summary collapse

Class Method Details

.decode_html_entities(text) ⇒ String

Decode HTML entities to plain text.

Converts HTML entities (&, <, etc.) to their plain text equivalents. Uses CGI.unescape_html if available, falls back to manual replacement.

Parameters:

  • text (String)

    Text with HTML entities

Returns:

  • (String)

    Text with entities decoded



40
41
42
43
44
45
46
47
48
49
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 40

def self.decode_html_entities(text)
  require 'cgi'
  CGI.unescape_html(text)
rescue StandardError
  text.gsub('&', '&')
      .gsub('&lt;', '<')
      .gsub('&gt;', '>')
      .gsub('&quot;', '"')
      .gsub('&#39;', "'")
end

.extract_and_validate_text(node) ⇒ String?

Extract and validate text from a node.

Extracts text from element if it’s a content element, then validates it meets minimum length requirements.

Parameters:

  • node (Nokogiri::XML::Node)

    Node to extract from

Returns:

  • (String, nil)

    Validated text, or nil if not extractable or invalid



115
116
117
118
119
120
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 115

def self.extract_and_validate_text(node)
  return nil unless extractable?(node)

  text = extract_with_inline_tags(node)
  TextValidator.valid?(text) ? text : nil
end

.extract_with_inline_tags(node) ⇒ String

Extract text with inline tags preserved.

Extracts text from an element, removes block elements and empty icons, and normalizes whitespace. HTML entities (e.g. &lt;, &gt;) are preserved verbatim so that entity-encoded content inside inline elements (such as &lt;p&gt;) is written to PO msgids as-is and does not become a live HTML tag when the msgstr is later injected via inner_html.

Parameters:

  • node (Nokogiri::XML::Node)

    Element to extract from

Returns:

  • (String)

    Extracted and normalized text



99
100
101
102
103
104
105
106
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 99

def self.extract_with_inline_tags(node)
  clone = node.dup
  remove_block_elements_from_node(clone)
  remove_empty_icon_tags(clone)

  text = TextNormalizer.normalize(clone.inner_html)
  text&.then { |t| TextNormalizer.normalize(t).strip }
end

.extractable?(node) ⇒ Boolean

Check if a node is extractable (content element).

Parameters:

  • node (Nokogiri::XML::Node)

    Node to check

Returns:

  • (Boolean)

    True if node is a content element



126
127
128
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 126

def self.extractable?(node)
  node.element? && CONTENT_ELEMENTS.include?(node.name)
end

.remove_block_elements(node) ⇒ void

This method returns an undefined value.

Remove block-level elements from a node.

Alias for remove_block_elements_from_node for convenience.

Parameters:

  • node (Nokogiri::XML::Node)

    Node to process



72
73
74
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 72

def self.remove_block_elements(node)
  remove_block_elements_from_node(node)
end

.remove_block_elements_from_node(node) ⇒ void

This method returns an undefined value.

Remove block-level elements from a cloned node.

Replaces block-level element nodes with their children (flattening structure). Used to extract text while preserving inline elements.

Parameters:

  • node (Nokogiri::XML::Node)

    Node to process (modified in place)



58
59
60
61
62
63
64
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 58

def self.remove_block_elements_from_node(node)
  HtmlElements::BLOCK_ELEMENTS.each do |tag|
    node.xpath(".//#{tag}").each do |elem|
      elem.replace(elem.children)
    end
  end
end

.remove_empty_icon_tags(node) ⇒ void

This method returns an undefined value.

Remove empty icon tags from a node.

Removes all <i> (icon) elements that contain no text. Used to clean up external link icon markers before text extraction.

Parameters:

  • node (Nokogiri::XML::Node)

    Node to process (modified in place)



83
84
85
86
87
# File 'lib/jekyll-l10n/utils/html_text_utils.rb', line 83

def self.remove_empty_icon_tags(node)
  node.xpath('.//i').each do |elem|
    elem.remove if elem.text.strip.empty?
  end
end