Module: Jekyll::L10n::DomTextExtractor

Extended by:: DomTextExtractor

Included in:: DomTextExtractor

Defined in:: lib/jekyll-l10n/extraction/dom_text_extractor.rb

Overview

Extracts text content from HTML elements for translation.

DomTextExtractor identifies content-bearing HTML elements (paragraphs, headings, list items, etc.) and extracts their text content while preserving inline HTML tags. It validates extracted text and generates file location references for debugging. Text is extracted from elements that contain text nodes or inline elements, but not from elements containing only block-level children.

Key responsibilities:

Identify extractable content elements (p, h1-h6, li, blockquote, etc.)
Extract text while preserving inline HTML structure
Skip elements containing only block-level children
Validate extracted text (minimum length, non-numeric)
Generate file location references for extracted strings

Examples:

entry = DomTextExtractor.extract(node, 'docs/index.html', '_site')
# Returns hash with :msgid, :msgstr, :reference if valid text found

Instance Method Summary collapse

#extract(node, file_path, dest) ⇒ Hash^?

Extract text content from an HTML element.
#extract_block_text(node) ⇒ Object
#extractable?(node) ⇒ Boolean
#only_contains_block_elements?(node) ⇒ Boolean

Instance Method Details

#extract(node, file_path, dest) ⇒ `Hash`^?

Extract text content from an HTML element.

Returns nil if element is not extractable (not a content element) or if extracted text fails validation (too short, numeric-only, etc.). For valid text, returns hash with msgid, empty msgstr, and file location reference for debugging.

Parameters:

node (Nokogiri::XML::Element) —

DOM element to extract from
file_path (String) —

Source file path (for file location reference)
dest (String) —

Destination directory (for file location reference)

Returns:

(Hash, nil) —

Hash with :msgid, :msgstr, :reference if valid text found, nil if element is not extractable or text fails validation

# File 'lib/jekyll-l10n/extraction/dom_text_extractor.rb', line 44

def extract(node, file_path, dest)
  return nil unless extractable?(node)

  text = extract_block_text(node)
  return nil if text.nil?

  reference = XPathReferenceGenerator.generate(node, file_path, dest)
  { msgid: text, msgstr: '', reference: reference }
end

#extract_block_text(node) ⇒ `Object`

# File 'lib/jekyll-l10n/extraction/dom_text_extractor.rb', line 58

def extract_block_text(node)
  return nil if only_contains_block_elements?(node)

  text = HtmlTextUtils.extract_with_inline_tags(node)
  TextValidator.valid?(text) ? text : nil
end

#extractable?(node) ⇒ `Boolean`

Returns:

(Boolean)



54
55
56

# File 'lib/jekyll-l10n/extraction/dom_text_extractor.rb', line 54

def extractable?(node)
  node.element? && HtmlTextUtils::CONTENT_ELEMENTS.include?(node.name)
end

#only_contains_block_elements?(node) ⇒ `Boolean`

Returns:

(Boolean)

# File 'lib/jekyll-l10n/extraction/dom_text_extractor.rb', line 65

def only_contains_block_elements?(node)
  node.children.each do |child|
    return false if non_empty_text?(child)
    return false if non_block_element?(child)
  end

  block_element_children?(node)
end

Module: Jekyll::L10n::DomTextExtractor

Overview

Examples:

Instance Method Summary collapse

Instance Method Details

#extract(node, file_path, dest) ⇒ Hash?

#extract_block_text(node) ⇒ Object

#extractable?(node) ⇒ Boolean

#only_contains_block_elements?(node) ⇒ Boolean

#extract(node, file_path, dest) ⇒ `Hash`^?

#extract_block_text(node) ⇒ `Object`

#extractable?(node) ⇒ `Boolean`

#only_contains_block_elements?(node) ⇒ `Boolean`