Class: Canon::PrettyPrinter::XmlNormalized

Inherits:

Object

Object
Canon::PrettyPrinter::XmlNormalized

show all

Defined in:: lib/canon/pretty_printer/xml_normalized.rb

Overview

Mixed-content-aware XML serializer for diff display preprocessing.

The mixed-content problem

Standard XML pretty-printers (including Nokogiri’s built-in serializer) keep elements that contain both text and child elements on a single line. They have no choice: inserting a newline between, say, ‘See ` and `<xref…/>` would create a new whitespace text node, changing the document’s semantic content. The result for line-by-line diffs is that any change inside such an element forces the entire line — potentially hundreds or thousands of characters — to be marked as changed. Issue #53 documented this as “1000-character long lines” from HTML diffs.

Three-way whitespace classification

This serializer distinguishes three categories of element-level whitespace behaviour, configured via element-name lists:

Preserve (‘preserve_whitespace_elements`) — every whitespace character is significant. `“ ”` ≠ `“n”`. Typical: `<pre>`, `<code>`, `<textarea>`. Whitespace-only text nodes are visualized character-by-character.
Collapse (‘collapse_whitespace_elements`) — presence ≠ absence, but all whitespace forms are equivalent: `“ ”` == `“n ”` == `“t”`. Typical: ``, `<li>`, `<td>`, heading elements. Whitespace-only text nodes are collapsed to a single `░` visualization, so `n ` (indented fixture) and ` ` (compact source) both render as `░` — identical display lines, no spurious diff.
Strip (everything else, or explicit ‘strip_whitespace_elements`) —all whitespace between child elements is structural formatting noise. `“ ”` == `“n ”` == nothing. Whitespace-only text nodes are silently dropped. Typical: `<section>`, `<ul>`, `<formattedref>`, `<bibitem>`.

Classification is ancestor-based: a text node’s class is determined by the closest matching ancestor. This means ‘` inside `` inherits ``’s normalize behaviour without needing to be listed explicitly.

Format defaults

XML: all three lists are empty by default — insensitive everywhere. Whitespace sensitivity is opt-in, consistent with XML’s data-first usage.
HTML: built-in defaults are provided (but overridable):
- preserve: ‘pre`, `code`, `textarea`, `script`, `style`
- collapse: ‘p`, `li`, `dt`, `dd`, `td`, `th`, `h1`–`h6`, `caption`, `figcaption`, `label`, `legend`, `summary`, `blockquote`, `address`

Structural vs. content whitespace

**Structural whitespace** — indentation characters emitted by the serializer itself. These do not exist in the source document. They are rendered as ordinary ASCII space and newline characters.
**Content whitespace** — whitespace that exists as text-node content in the source document. Classification (above) decides how to render it.

The invariant is: every XML element always starts on its own line. Content whitespace is never confused with structural indentation.

Example (normalize element )

Input — compact source (Metanorma-style):

<p>See <xref target="M"/></p>

Input — indented fixture heredoc:

<p>
  See
  <xref target="M"/>
</p>

Both serialize to:

<p>
  See░
  <xref target="M"/>
</p>

Result: zero diff lines for a semantically identical document.

Example (insensitive element <formattedref>)

Input — compact source:

<formattedref><em>Cereals</em>.</formattedref>

Input — indented fixture:

<formattedref>
  <em>Cereals</em>.
</formattedref>

Both serialize to (whitespace-only nodes silently dropped):

<formattedref>
  <em>Cereals</em>
  .
</formattedref>

Result: zero diff lines.

Usage

printer = Canon::PrettyPrinter::XmlNormalized.new
formatted = printer.format(xml_string)

# With element lists (XML):
printer = Canon::PrettyPrinter::XmlNormalized.new(
  collapse_whitespace_elements: %w[p formattedref title],
  preserve_whitespace_elements: %w[sourcecode pre],
)

Instance Method Summary collapse

#format(xml_string) ⇒ String

Format an XML string with mixed-content-aware serialization.
#initialize(indent: 2, indent_type: "space", visualization_map: nil, preserve_whitespace_elements: [], collapse_whitespace_elements: [], strip_whitespace_elements: [], pretty_printed: false, sort_attributes: false, html_mode: false) ⇒ XmlNormalized constructor

A new instance of XmlNormalized.

Constructor Details

#initialize(indent: 2, indent_type: "space", visualization_map: nil, preserve_whitespace_elements: [], collapse_whitespace_elements: [], strip_whitespace_elements: [], pretty_printed: false, sort_attributes: false, html_mode: false) ⇒ `XmlNormalized`

Returns a new instance of XmlNormalized.

Parameters:

indent (Integer) (defaults to: 2) —

number of indent characters per level (default 2)
indent_type (String) (defaults to: "space") —

“space” or “tab”
visualization_map (Hash, nil) (defaults to: nil) —

character visualization map
preserve_whitespace_elements (Array<String>) (defaults to: []) —

element names where every whitespace character is significant (e.g. pre, code).
collapse_whitespace_elements (Array<String>) (defaults to: []) —

element names where presence of whitespace matters but all forms are equivalent (e.g. p, li).
strip_whitespace_elements (Array<String>) (defaults to: []) —

explicit blacklist — these elements and their children always have whitespace dropped, even if an ancestor would otherwise be preserve or collapse.
pretty_printed (Boolean) (defaults to: false) —

when true, whitespace-only text nodes that begin with “n” inside :collapse elements are treated as structural indentation and silently dropped. This matches the comparison-side behaviour activated by pretty_printed_expected / pretty_printed_received match options. Nodes under :preserve elements are always preserved; nodes under :strip elements are already dropped.

# File 'lib/canon/pretty_printer/xml_normalized.rb', line 132

def initialize(indent: 2, indent_type: "space", visualization_map: nil,
               preserve_whitespace_elements: [],
               collapse_whitespace_elements: [],
               strip_whitespace_elements: [],
               pretty_printed: false,
               sort_attributes: false,
               html_mode: false)
  @indent = indent.to_i
  @indent_char = indent_type == "tab" ? "\t" : " "
  @vis_map = visualization_map || default_vis_map
  @pretty_printed = pretty_printed
  @sort_attributes = sort_attributes
  @html_mode = html_mode

  @strict_ws  = Set.new((preserve_whitespace_elements || []).map(&:to_s))
  @norm_ws    = Set.new((collapse_whitespace_elements || []).map(&:to_s))
  @insens_ws  = Set.new((strip_whitespace_elements || []).map(&:to_s))
end

Instance Method Details

#format(xml_string) ⇒ `String`

Format an XML string with mixed-content-aware serialization.