Class: Canon::Formatters::HtmlFormatterBase
- Inherits:
-
Object
- Object
- Canon::Formatters::HtmlFormatterBase
- Defined in:
- lib/canon/formatters/html_formatter_base.rb
Overview
Base class for HTML formatters with shared canonicalization logic
This abstract base class provides common HTML canonicalization logic for both HTML4 and HTML5 formatters. It handles:
-
Attribute sorting for consistency
-
Whitespace normalization
-
Block element spacing
Canonicalization Process
-
Parse HTML using format-specific parser (subclass responsibility)
-
Sort all element attributes alphabetically
-
Normalize whitespace (remove whitespace-only text nodes, collapse runs)
-
Ensure proper spacing between block-level elements
-
Serialize to HTML string
Subclass Implementation
Subclasses must implement the ‘parse` class method:
def self.parse(html)
# Return Nokogiri::HTML4::Document or Nokogiri::HTML5::Document
end
Block Elements
The following elements are treated as block-level and will have spacing preserved between them: address, article, aside, blockquote, dd, details, dialog, div, dl, dt, fieldset, figcaption, figure, footer, form, h1-h6, header, hgroup, hr, li, main, nav, ol, p, pre, section, table, tbody, td, tfoot, th, thead, tr, ul
Usage
# Via subclass (Html4Formatter or Html5Formatter)
canonical_html = Canon::Formatters::Html4Formatter.format(html_string)
Direct Known Subclasses
Constant Summary collapse
- BLOCK_ELEMENTS =
Block-level HTML elements that should preserve spacing between them
%w[ address article aside blockquote dd details dialog div dl dt fieldset figcaption figure footer form h1 h2 h3 h4 h5 h6 header hgroup hr li main nav ol p pre section table tbody td tfoot th thead tr ul ].freeze
- WHITESPACE_SENSITIVE_ELEMENTS =
HTML elements where whitespace is semantically significant and should NOT be normalized
%w[ pre code textarea script style ].freeze
Class Method Summary collapse
-
.canonicalize(doc) ⇒ String
Canonicalize HTML document.
-
.format(html) ⇒ String
Format HTML using canonical form.
-
.parse(_html) ⇒ Nokogiri::HTML::Document, Nokogiri::XML::Document
Parse HTML into a Nokogiri document.
Class Method Details
.canonicalize(doc) ⇒ String
Canonicalize HTML document
78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
# File 'lib/canon/formatters/html_formatter_base.rb', line 78 def self.canonicalize(doc) # Sort attributes for consistency sort_attributes(doc) # Normalize whitespace between elements normalize_whitespace(doc) # Serialize with consistent formatting html = doc.to_html( save_with: Nokogiri::XML::Node::SaveOptions::NO_DECLARATION, ).strip # Post-process: ensure spaces between block element tags # This is needed because Nokogiri's serialization may remove # whitespace text nodes between block elements ensure_block_element_spacing(html) end |
.format(html) ⇒ String
Format HTML using canonical form
61 62 63 64 |
# File 'lib/canon/formatters/html_formatter_base.rb', line 61 def self.format(html) doc = parse(html) canonicalize(doc) end |
.parse(_html) ⇒ Nokogiri::HTML::Document, Nokogiri::XML::Document
Parse HTML into a Nokogiri document
70 71 72 73 |
# File 'lib/canon/formatters/html_formatter_base.rb', line 70 def self.parse(_html) raise NotImplementedError, "Subclasses must implement the parse method" end |