Class: Jekyll::L10n::HtmlParser

Inherits:
Object
  • Object
show all
Defined in:
lib/jekyll-l10n/utils/html_parser.rb

Overview

Parses HTML content using Nokogiri.

HtmlParser provides a unified interface for parsing HTML as either full documents (preserving DOCTYPE and structure) or as fragments (partial HTML). It also provides utilities for cleaning up auto-inserted meta tags that Nokogiri/libxml2 adds during serialization.

Key responsibilities:

  • Parse full HTML documents with DOCTYPE preservation

  • Parse HTML fragments for partial content

  • Remove auto-inserted meta charset tags

Class Method Summary collapse

Class Method Details

.parse_document(html) ⇒ Nokogiri::HTML::Document

Parse HTML as a full document.

Preserves DOCTYPE, html tag, and document structure. Use this for complete HTML documents. Auto-inserted meta tags can be removed with remove_meta_charset.

Parameters:

  • html (String)

    HTML content to parse

Returns:

  • (Nokogiri::HTML::Document)

    Parsed HTML document



24
25
26
# File 'lib/jekyll-l10n/utils/html_parser.rb', line 24

def self.parse_document(html)
  Nokogiri::HTML(html)
end

.parse_fragment(html) ⇒ Nokogiri::HTML::DocumentFragment

Parse HTML as a fragment.

Parses partial HTML without wrapping in html/body tags. Use for extracting pieces of HTML content.

Parameters:

  • html (String)

    HTML fragment to parse

Returns:

  • (Nokogiri::HTML::DocumentFragment)

    Parsed HTML fragment



35
36
37
# File 'lib/jekyll-l10n/utils/html_parser.rb', line 35

def self.parse_fragment(html)
  Nokogiri::HTML.fragment(html)
end

.remove_meta_charset(html_string) ⇒ String

Remove auto-inserted meta charset tag from serialized HTML.

Nokogiri/libxml2 automatically inserts a meta charset tag during serialization. This removes that tag which was not in the original HTML.

Parameters:

  • html_string (String)

    Serialized HTML

Returns:

  • (String)

    HTML with meta charset tag removed



46
47
48
49
# File 'lib/jekyll-l10n/utils/html_parser.rb', line 46

def self.remove_meta_charset(html_string)
  pattern = %r{<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n?}
  html_string.gsub(pattern, '')
end