Module: Jekyll::L10n::DomTextExtractor
- Extended by:
- DomTextExtractor
- Included in:
- DomTextExtractor
- Defined in:
- lib/jekyll-l10n/extraction/dom_text_extractor.rb
Overview
Extracts text content from HTML elements for translation.
DomTextExtractor identifies content-bearing HTML elements (paragraphs, headings, list items, etc.) and extracts their text content while preserving inline HTML tags. It validates extracted text and generates file location references for debugging. Text is extracted from elements that contain text nodes or inline elements, but not from elements containing only block-level children.
Key responsibilities:
-
Identify extractable content elements (p, h1-h6, li, blockquote, etc.)
-
Extract text while preserving inline HTML structure
-
Skip elements containing only block-level children
-
Validate extracted text (minimum length, non-numeric)
-
Generate file location references for extracted strings
Instance Method Summary collapse
-
#extract(node, file_path, dest) ⇒ Hash?
Extract text content from an HTML element.
- #extract_block_text(node) ⇒ Object
- #extractable?(node) ⇒ Boolean
- #only_contains_block_elements?(node) ⇒ Boolean
Instance Method Details
#extract(node, file_path, dest) ⇒ Hash?
Extract text content from an HTML element.
Returns nil if element is not extractable (not a content element) or if extracted text fails validation (too short, numeric-only, etc.). For valid text, returns hash with msgid, empty msgstr, and file location reference for debugging.
44 45 46 47 48 49 50 51 52 |
# File 'lib/jekyll-l10n/extraction/dom_text_extractor.rb', line 44 def extract(node, file_path, dest) return nil unless extractable?(node) text = extract_block_text(node) return nil if text.nil? reference = XPathReferenceGenerator.generate(node, file_path, dest) { msgid: text, msgstr: '', reference: reference } end |
#extract_block_text(node) ⇒ Object
58 59 60 61 62 63 |
# File 'lib/jekyll-l10n/extraction/dom_text_extractor.rb', line 58 def extract_block_text(node) return nil if only_contains_block_elements?(node) text = HtmlTextUtils.(node) TextValidator.valid?(text) ? text : nil end |
#extractable?(node) ⇒ Boolean
54 55 56 |
# File 'lib/jekyll-l10n/extraction/dom_text_extractor.rb', line 54 def extractable?(node) node.element? && HtmlTextUtils::CONTENT_ELEMENTS.include?(node.name) end |
#only_contains_block_elements?(node) ⇒ Boolean
65 66 67 68 69 70 71 72 |
# File 'lib/jekyll-l10n/extraction/dom_text_extractor.rb', line 65 def only_contains_block_elements?(node) node.children.each do |child| return false if non_empty_text?(child) return false if non_block_element?(child) end block_element_children?(node) end |