Module: Jekyll::L10n::BlockTextExtractor

Defined in:: lib/jekyll-l10n/translation/block_text_extractor.rb

Overview

Extracts normalized text from block-level HTML elements.

BlockTextExtractor extracts the complete text content from a block element while removing nested block-level elements and empty icon tags. This is used to match against block-level translations where the entire element has a single translation rather than individual text node translations.

Key responsibilities:

Extract text from extractable block elements
Remove nested block elements from text
Remove empty icon tags (external link markers)
Normalize and validate extracted text

Examples:

text = BlockTextExtractor.extract(paragraph_node)
# Returns normalized text from paragraph, useful for finding block translations

Class Method Summary collapse

.extract(node) ⇒ String^?

Extract normalized block text from an element.
.extractable?(node) ⇒ Boolean

Class Method Details

.extract(node) ⇒ `String`^?

Extract normalized block text from an element.

Returns nil if element is not extractable or if extracted text fails validation. Clones the node, removes nested block elements and empty icon tags, normalizes whitespace, and validates. HTML entities are preserved verbatim to match the keys produced by the extraction pipeline.

Parameters:

node (Nokogiri::XML::Element) —

DOM element to extract from

Returns:

(String, nil) —

Normalized text from element, or nil if not valid

# File 'lib/jekyll-l10n/translation/block_text_extractor.rb', line 38

def extract(node)
  return nil unless extractable?(node)

  clone = node.dup
  HtmlTextUtils.remove_block_elements(clone)
  HtmlTextUtils.remove_empty_icon_tags(clone)

  text = TextNormalizer.normalize(clone.inner_html).strip

  TextValidator.valid?(text) ? text : nil
end

.extractable?(node) ⇒ `Boolean`