Class: Jekyll::L10n::HtmlParser
- Inherits:
-
Object
- Object
- Jekyll::L10n::HtmlParser
- Defined in:
- lib/jekyll-l10n/utils/html_parser.rb
Overview
Parses HTML content using Nokogiri.
HtmlParser provides a unified interface for parsing HTML as either full documents (preserving DOCTYPE and structure) or as fragments (partial HTML). It also provides utilities for cleaning up auto-inserted meta tags that Nokogiri/libxml2 adds during serialization.
Key responsibilities:
-
Parse full HTML documents with DOCTYPE preservation
-
Parse HTML fragments for partial content
-
Remove auto-inserted meta charset tags
Class Method Summary collapse
-
.parse_document(html) ⇒ Nokogiri::HTML::Document
Parse HTML as a full document.
-
.parse_fragment(html) ⇒ Nokogiri::HTML::DocumentFragment
Parse HTML as a fragment.
-
.remove_meta_charset(html_string) ⇒ String
Remove auto-inserted meta charset tag from serialized HTML.
Class Method Details
.parse_document(html) ⇒ Nokogiri::HTML::Document
Parse HTML as a full document.
Preserves DOCTYPE, html tag, and document structure. Use this for complete HTML documents. Auto-inserted meta tags can be removed with remove_meta_charset.
24 25 26 |
# File 'lib/jekyll-l10n/utils/html_parser.rb', line 24 def self.parse_document(html) Nokogiri::HTML(html) end |
.parse_fragment(html) ⇒ Nokogiri::HTML::DocumentFragment
Parse HTML as a fragment.
Parses partial HTML without wrapping in html/body tags. Use for extracting pieces of HTML content.
35 36 37 |
# File 'lib/jekyll-l10n/utils/html_parser.rb', line 35 def self.parse_fragment(html) Nokogiri::HTML.fragment(html) end |
.remove_meta_charset(html_string) ⇒ String
Remove auto-inserted meta charset tag from serialized HTML.
Nokogiri/libxml2 automatically inserts a meta charset tag during serialization. This removes that tag which was not in the original HTML.
46 47 48 49 |
# File 'lib/jekyll-l10n/utils/html_parser.rb', line 46 def self.(html_string) pattern = %r{<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n?} html_string.gsub(pattern, '') end |