Class: Canon::Formatters::HtmlFormatterBase

Inherits:
Object
  • Object
show all
Defined in:
lib/canon/formatters/html_formatter_base.rb

Overview

Base class for HTML formatters with shared canonicalization logic

This abstract base class provides common HTML canonicalization logic for both HTML4 and HTML5 formatters. It handles:

  • Attribute sorting for consistency

  • Whitespace normalization

  • Block element spacing

Canonicalization Process

  1. Parse HTML using format-specific parser (subclass responsibility)

  2. Sort all element attributes alphabetically

  3. Normalize whitespace (remove whitespace-only text nodes, collapse runs)

  4. Ensure proper spacing between block-level elements

  5. Serialize to HTML string

Subclass Implementation

Subclasses must implement the ‘parse` class method:

def self.parse(html)
  # Return Nokogiri::HTML4::Document or Nokogiri::HTML5::Document
end

Block Elements

The following elements are treated as block-level and will have spacing preserved between them: address, article, aside, blockquote, dd, details, dialog, div, dl, dt, fieldset, figcaption, figure, footer, form, h1-h6, header, hgroup, hr, li, main, nav, ol, p, pre, section, table, tbody, td, tfoot, th, thead, tr, ul

Usage

# Via subclass (Html4Formatter or Html5Formatter)
canonical_html = Canon::Formatters::Html4Formatter.format(html_string)

Direct Known Subclasses

Html4Formatter, Html5Formatter, HtmlFormatter

Constant Summary collapse

BLOCK_ELEMENTS =

Block-level HTML elements that should preserve spacing between them

%w[
  address article aside blockquote dd details dialog div dl dt
  fieldset figcaption figure footer form h1 h2 h3 h4 h5 h6
  header hgroup hr li main nav ol p pre section table tbody
  td tfoot th thead tr ul
].freeze
WHITESPACE_SENSITIVE_ELEMENTS =

HTML elements where whitespace is semantically significant and should NOT be normalized

%w[
  pre code textarea script style
].freeze

Class Method Summary collapse

Class Method Details

.canonicalize(doc) ⇒ String

Canonicalize HTML document

Parameters:

  • doc (Nokogiri::HTML::Document)

    Parsed HTML document

Returns:

  • (String)

    Canonical HTML string



78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
# File 'lib/canon/formatters/html_formatter_base.rb', line 78

def self.canonicalize(doc)
  # Sort attributes for consistency
  sort_attributes(doc)

  # Normalize whitespace between elements
  normalize_whitespace(doc)

  # Serialize with consistent formatting
  html = doc.to_html(
    save_with: Nokogiri::XML::Node::SaveOptions::NO_DECLARATION,
  ).strip

  # Post-process: ensure spaces between block element tags
  # This is needed because Nokogiri's serialization may remove
  # whitespace text nodes between block elements
  ensure_block_element_spacing(html)
end

.format(html) ⇒ String

Format HTML using canonical form

Parameters:

  • html (String)

    HTML document to canonicalize

Returns:

  • (String)

    Canonical form of HTML



61
62
63
64
# File 'lib/canon/formatters/html_formatter_base.rb', line 61

def self.format(html)
  doc = parse(html)
  canonicalize(doc)
end

.parse(_html) ⇒ Nokogiri::HTML::Document, Nokogiri::XML::Document

Parse HTML into a Nokogiri document

Parameters:

  • html (String)

    HTML document to parse

Returns:

  • (Nokogiri::HTML::Document, Nokogiri::XML::Document)

    Parsed HTML document

Raises:

  • (NotImplementedError)


70
71
72
73
# File 'lib/canon/formatters/html_formatter_base.rb', line 70

def self.parse(_html)
  raise NotImplementedError,
        "Subclasses must implement the parse method"
end