Class: Jekyll::L10n::HtmlStringExtractor

Inherits:

Object

Object
Jekyll::L10n::HtmlStringExtractor

show all

Defined in:: lib/jekyll-l10n/extraction/html_string_extractor.rb

Overview

Extracts translatable strings from HTML documents for localization.

HtmlStringExtractor walks the DOM tree of parsed HTML and extracts text content from content elements and values from configurable HTML attributes. It deduplicates entries by msgid and generates file location references for each extraction to aid in debugging and tracking. Entries are excluded based on CSS selectors.

Key responsibilities:

Parse HTML into DOM tree
Walk DOM recursively to find translatable content
Extract text from content elements (p, h1-h6, li, etc.)
Extract attribute values (title, alt, aria-label, etc.)
Generate file location references for each extracted string
Skip elements matching exclude selectors
Deduplicate entries by msgid

Examples:

extractor = HtmlStringExtractor.new(['title', 'alt'], ['script', 'style'])
entries = extractor.extract(html_content, '_site', 'docs/index.html')
# Returns array of hash entries with :msgid, :msgstr, :reference keys

Instance Attribute Summary collapse

#exclude_selectors ⇒ Object readonly

Returns the value of attribute exclude_selectors.
#translatable_attrs ⇒ Object readonly

Returns the value of attribute translatable_attrs.

Instance Method Summary collapse

#extract(html, dest, file_path) ⇒ Array<Hash>

Extract translatable strings from HTML.
#initialize(translatable_attrs, exclude_selectors) ⇒ HtmlStringExtractor constructor

Initialize a new HtmlStringExtractor.

Constructor Details

#initialize(translatable_attrs, exclude_selectors) ⇒ `HtmlStringExtractor`

Initialize a new HtmlStringExtractor.

Parameters:

translatable_attrs (Array<String>) —

HTML attributes to extract (e.g., [‘title’, ‘alt’, ‘aria-label’, ‘placeholder’, ‘aria-description’])
exclude_selectors (Array<String>) —

CSS selectors for elements to skip during extraction (e.g., [‘script’, ‘style’, ‘.no-translate’])

# File 'lib/jekyll-l10n/extraction/html_string_extractor.rb', line 38

def initialize(translatable_attrs, exclude_selectors)
  @translatable_attrs = translatable_attrs
  @exclude_selectors = exclude_selectors
end

Instance Attribute Details

#exclude_selectors ⇒ `Object` (readonly)

Returns the value of attribute exclude_selectors.



30
31
32

# File 'lib/jekyll-l10n/extraction/html_string_extractor.rb', line 30

def exclude_selectors
  @exclude_selectors
end

#translatable_attrs ⇒ `Object` (readonly)

Returns the value of attribute translatable_attrs.



30
31
32

# File 'lib/jekyll-l10n/extraction/html_string_extractor.rb', line 30

def translatable_attrs
  @translatable_attrs
end

Instance Method Details

#extract(html, dest, file_path) ⇒ `Array<Hash>`

Extract translatable strings from HTML.

Walks the DOM tree and extracts text nodes from content elements and values from specified attributes. Each extraction is assigned a file location reference for debugging. Entries are deduplicated by msgid (multiple occurrences of same text yield a single entry).