Class: Jekyll::L10n::HtmlStringExtractor

Inherits:
Object
  • Object
show all
Defined in:
lib/jekyll-l10n/extraction/html_string_extractor.rb

Overview

Extracts translatable strings from HTML documents for localization.

HtmlStringExtractor walks the DOM tree of parsed HTML and extracts text content from content elements and values from configurable HTML attributes. It deduplicates entries by msgid and generates file location references for each extraction to aid in debugging and tracking. Entries are excluded based on CSS selectors.

Key responsibilities:

  • Parse HTML into DOM tree

  • Walk DOM recursively to find translatable content

  • Extract text from content elements (p, h1-h6, li, etc.)

  • Extract attribute values (title, alt, aria-label, etc.)

  • Generate file location references for each extracted string

  • Skip elements matching exclude selectors

  • Deduplicate entries by msgid

Examples:

extractor = HtmlStringExtractor.new(['title', 'alt'], ['script', 'style'])
entries = extractor.extract(html_content, '_site', 'docs/index.html')
# Returns array of hash entries with :msgid, :msgstr, :reference keys

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(translatable_attrs, exclude_selectors) ⇒ HtmlStringExtractor

Initialize a new HtmlStringExtractor.

Parameters:

  • translatable_attrs (Array<String>)

    HTML attributes to extract (e.g., [‘title’, ‘alt’, ‘aria-label’, ‘placeholder’, ‘aria-description’])

  • exclude_selectors (Array<String>)

    CSS selectors for elements to skip during extraction (e.g., [‘script’, ‘style’, ‘.no-translate’])



38
39
40
41
# File 'lib/jekyll-l10n/extraction/html_string_extractor.rb', line 38

def initialize(translatable_attrs, exclude_selectors)
  @translatable_attrs = translatable_attrs
  @exclude_selectors = exclude_selectors
end

Instance Attribute Details

#exclude_selectorsObject (readonly)

Returns the value of attribute exclude_selectors.



30
31
32
# File 'lib/jekyll-l10n/extraction/html_string_extractor.rb', line 30

def exclude_selectors
  @exclude_selectors
end

#translatable_attrsObject (readonly)

Returns the value of attribute translatable_attrs.



30
31
32
# File 'lib/jekyll-l10n/extraction/html_string_extractor.rb', line 30

def translatable_attrs
  @translatable_attrs
end

Instance Method Details

#extract(html, dest, file_path) ⇒ Array<Hash>

Extract translatable strings from HTML.

Walks the DOM tree and extracts text nodes from content elements and values from specified attributes. Each extraction is assigned a file location reference for debugging. Entries are deduplicated by msgid (multiple occurrences of same text yield a single entry).

Parameters:

  • html (String)

    HTML content to extract from

  • dest (String)

    Destination directory path (used in file location reference generation)

  • file_path (String)

    Path to source file (used in file location reference generation)

Returns:

  • (Array<Hash>)

    Array of extraction entries, each containing:

    • :msgid [String] The text or attribute value to translate

    • :msgstr [String] Empty string (to be filled by translator)

    • :reference [String] File location reference for debugging



57
58
59
60
61
62
63
64
# File 'lib/jekyll-l10n/extraction/html_string_extractor.rb', line 57

def extract(html, dest, file_path)
  entries = []

  doc = Nokogiri::HTML(html)
  walk_dom(doc.root, file_path, entries, dest)

  entries.uniq { |e| e[:msgid] }
end