Class: Canon::PrettyPrinter::XmlNormalized

Inherits:
Object
  • Object
show all
Defined in:
lib/canon/pretty_printer/xml_normalized.rb

Overview

Mixed-content-aware XML serializer for diff display preprocessing.

The mixed-content problem

Standard XML pretty-printers (including Nokogiri’s built-in serializer) keep elements that contain both text and child elements on a single line. They have no choice: inserting a newline between, say, ‘<p>See ` and `<xref…/>` would create a new whitespace text node, changing the document’s semantic content. The result for line-by-line diffs is that any change inside such an element forces the entire line — potentially hundreds or thousands of characters — to be marked as changed. Issue #53 documented this as “1000-character long lines” from HTML diffs.

Three-way whitespace classification

This serializer distinguishes three categories of element-level whitespace behaviour, configured via element-name lists:

  • Preserve (‘preserve_whitespace_elements`) — every whitespace character is significant. `“ ”` ≠ `“n”`. Typical: `<pre>`, `<code>`, `<textarea>`. Whitespace-only text nodes are visualized character-by-character.

  • Collapse (‘collapse_whitespace_elements`) — presence ≠ absence, but all whitespace forms are equivalent: `“ ”` == `“n ”` == `“t”`. Typical: `<p>`, `<li>`, `<td>`, heading elements. Whitespace-only text nodes are collapsed to a single `░` visualization, so `<p>n <em>` (indented fixture) and `<p> <em>` (compact source) both render as `<p>░<em>` — identical display lines, no spurious diff.

  • Strip (everything else, or explicit ‘strip_whitespace_elements`) —all whitespace between child elements is structural formatting noise. `“ ”` == `“n ”` == nothing. Whitespace-only text nodes are silently dropped. Typical: `<section>`, `<ul>`, `<formattedref>`, `<bibitem>`.

Classification is ancestor-based: a text node’s class is determined by the closest matching ancestor. This means ‘<em>` inside `<p>` inherits `<p>`’s normalize behaviour without needing to be listed explicitly.

Format defaults

  • XML: all three lists are empty by default — insensitive everywhere. Whitespace sensitivity is opt-in, consistent with XML’s data-first usage.

  • HTML: built-in defaults are provided (but overridable):

    • preserve: ‘pre`, `code`, `textarea`, `script`, `style`

    • collapse: ‘p`, `li`, `dt`, `dd`, `td`, `th`, `h1`–`h6`, `caption`, `figcaption`, `label`, `legend`, `summary`, `blockquote`, `address`

Structural vs. content whitespace

  • **Structural whitespace** — indentation characters emitted by the serializer itself. These do not exist in the source document. They are rendered as ordinary ASCII space and newline characters.

  • **Content whitespace** — whitespace that exists as text-node content in the source document. Classification (above) decides how to render it.

The invariant is: every XML element always starts on its own line. Content whitespace is never confused with structural indentation.

Example (normalize element <p>)

Input — compact source (Metanorma-style):

<p>See <xref target="M"/></p>

Input — indented fixture heredoc:

<p>
  See
  <xref target="M"/>
</p>

Both serialize to:

<p>
  See░
  <xref target="M"/>
</p>

Result: zero diff lines for a semantically identical document.

Example (insensitive element <formattedref>)

Input — compact source:

<formattedref><em>Cereals</em>.</formattedref>

Input — indented fixture:

<formattedref>
  <em>Cereals</em>.
</formattedref>

Both serialize to (whitespace-only nodes silently dropped):

<formattedref>
  <em>Cereals</em>
  .
</formattedref>

Result: zero diff lines.

Usage

printer = Canon::PrettyPrinter::XmlNormalized.new
formatted = printer.format(xml_string)

# With element lists (XML):
printer = Canon::PrettyPrinter::XmlNormalized.new(
  collapse_whitespace_elements: %w[p formattedref title],
  preserve_whitespace_elements: %w[sourcecode pre],
)

Instance Method Summary collapse

Constructor Details

#initialize(indent: 2, indent_type: "space", visualization_map: nil, preserve_whitespace_elements: [], collapse_whitespace_elements: [], strip_whitespace_elements: [], pretty_printed: false, sort_attributes: false) ⇒ XmlNormalized

Returns a new instance of XmlNormalized.

Parameters:

  • indent (Integer) (defaults to: 2)

    number of indent characters per level (default 2)

  • indent_type (String) (defaults to: "space")

    “space” or “tab”

  • visualization_map (Hash, nil) (defaults to: nil)

    character visualization map

  • preserve_whitespace_elements (Array<String>) (defaults to: [])

    element names where every whitespace character is significant (e.g. pre, code).

  • collapse_whitespace_elements (Array<String>) (defaults to: [])

    element names where presence of whitespace matters but all forms are equivalent (e.g. p, li).

  • strip_whitespace_elements (Array<String>) (defaults to: [])

    explicit blacklist — these elements and their children always have whitespace dropped, even if an ancestor would otherwise be preserve or collapse.

  • pretty_printed (Boolean) (defaults to: false)

    when true, whitespace-only text nodes that begin with “n” inside :collapse elements are treated as structural indentation and silently dropped. This matches the comparison-side behaviour activated by pretty_printed_expected / pretty_printed_received match options. Nodes under :preserve elements are always preserved; nodes under :strip elements are already dropped.



131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
# File 'lib/canon/pretty_printer/xml_normalized.rb', line 131

def initialize(indent: 2, indent_type: "space", visualization_map: nil,
               preserve_whitespace_elements: [],
               collapse_whitespace_elements: [],
               strip_whitespace_elements: [],
               pretty_printed: false,
               sort_attributes: false)
  @indent = indent.to_i
  @indent_char = indent_type == "tab" ? "\t" : " "
  @vis_map = visualization_map || default_vis_map
  @pretty_printed = pretty_printed
  @sort_attributes = sort_attributes

  @strict_ws  = Set.new((preserve_whitespace_elements || []).map(&:to_s))
  @norm_ws    = Set.new((collapse_whitespace_elements || []).map(&:to_s))
  @insens_ws  = Set.new((strip_whitespace_elements || []).map(&:to_s))
end

Instance Method Details

#format(xml_string) ⇒ String

Format an XML string with mixed-content-aware serialization.

Parameters:

  • xml_string (String)

    Input XML

Returns:

  • (String)

    Serialized XML, one node per line, with content whitespace visualized at line boundaries



153
154
155
156
157
158
159
160
161
162
163
164
# File 'lib/canon/pretty_printer/xml_normalized.rb', line 153

def format(xml_string)
  doc = Nokogiri::XML(xml_string)
  lines = []

  if doc.version
    enc = doc.encoding ? " encoding=\"#{doc.encoding}\"" : ""
    lines << "<?xml version=\"#{doc.version}\"#{enc}?>"
  end

  lines << serialize_element(doc.root, 0) if doc.root
  lines.join("\n")
end