Module: Jekyll::L10n::DomTextExtractor

Extended by:
DomTextExtractor
Included in:
DomTextExtractor
Defined in:
lib/jekyll-l10n/extraction/dom_text_extractor.rb

Overview

Extracts text content from HTML elements for translation.

DomTextExtractor identifies content-bearing HTML elements (paragraphs, headings, list items, etc.) and extracts their text content while preserving inline HTML tags. It validates extracted text and generates file location references for debugging. Text is extracted from elements that contain text nodes or inline elements, but not from elements containing only block-level children.

Key responsibilities:

  • Identify extractable content elements (p, h1-h6, li, blockquote, etc.)

  • Extract text while preserving inline HTML structure

  • Skip elements containing only block-level children

  • Validate extracted text (minimum length, non-numeric)

  • Generate file location references for extracted strings

Examples:

entry = DomTextExtractor.extract(node, 'docs/index.html', '_site')
# Returns hash with :msgid, :msgstr, :reference if valid text found

Instance Method Summary collapse

Instance Method Details

#extract(node, file_path, dest) ⇒ Hash?

Extract text content from an HTML element.

Returns nil if element is not extractable (not a content element) or if extracted text fails validation (too short, numeric-only, etc.). For valid text, returns hash with msgid, empty msgstr, and file location reference for debugging.

Parameters:

  • node (Nokogiri::XML::Element)

    DOM element to extract from

  • file_path (String)

    Source file path (for file location reference)

  • dest (String)

    Destination directory (for file location reference)

Returns:

  • (Hash, nil)

    Hash with :msgid, :msgstr, :reference if valid text found, nil if element is not extractable or text fails validation



44
45
46
47
48
49
50
51
52
# File 'lib/jekyll-l10n/extraction/dom_text_extractor.rb', line 44

def extract(node, file_path, dest)
  return nil unless extractable?(node)

  text = extract_block_text(node)
  return nil if text.nil?

  reference = XPathReferenceGenerator.generate(node, file_path, dest)
  { msgid: text, msgstr: '', reference: reference }
end

#extract_block_text(node) ⇒ Object



58
59
60
61
62
63
# File 'lib/jekyll-l10n/extraction/dom_text_extractor.rb', line 58

def extract_block_text(node)
  return nil if only_contains_block_elements?(node)

  text = HtmlTextUtils.extract_with_inline_tags(node)
  TextValidator.valid?(text) ? text : nil
end

#extractable?(node) ⇒ Boolean

Returns:

  • (Boolean)


54
55
56
# File 'lib/jekyll-l10n/extraction/dom_text_extractor.rb', line 54

def extractable?(node)
  node.element? && HtmlTextUtils::CONTENT_ELEMENTS.include?(node.name)
end

#only_contains_block_elements?(node) ⇒ Boolean

Returns:

  • (Boolean)


65
66
67
68
69
70
71
72
# File 'lib/jekyll-l10n/extraction/dom_text_extractor.rb', line 65

def only_contains_block_elements?(node)
  node.children.each do |child|
    return false if non_empty_text?(child)
    return false if non_block_element?(child)
  end

  block_element_children?(node)
end