Class: Canon::PrettyPrinter::XmlNormalized

Inherits:
Object
  • Object
show all
Defined in:
lib/canon/pretty_printer/xml_normalized.rb

Overview

Mixed-content-aware XML serializer for diff display preprocessing.

The mixed-content problem

Standard XML pretty-printers (including Nokogiri’s built-in serializer) keep elements that contain both text and child elements on a single line. They have no choice: inserting a newline between, say, ‘<p>See ` and `<xref…/>` would create a new whitespace text node, changing the document’s semantic content. The result for line-by-line diffs is that any change inside such an element forces the entire line — potentially hundreds or thousands of characters — to be marked as changed. Issue #53 documented this as “1000-character long lines” from HTML diffs.

Three-way whitespace classification

This serializer distinguishes three categories of element-level whitespace behaviour, configured via element-name lists:

  • Preserve (‘preserve_whitespace_elements`) — every whitespace character is significant. `“ ”` ≠ `“n”`. Typical: `<pre>`, `<code>`, `<textarea>`. Whitespace-only text nodes are visualized character-by-character.

  • Collapse (‘collapse_whitespace_elements`) — presence ≠ absence, but all whitespace forms are equivalent: `“ ”` == `“n ”` == `“t”`. Typical: `<p>`, `<li>`, `<td>`, heading elements. Whitespace-only text nodes are collapsed to a single `░` visualization, so `<p>n <em>` (indented fixture) and `<p> <em>` (compact source) both render as `<p>░<em>` — identical display lines, no spurious diff.

  • Strip (everything else, or explicit ‘strip_whitespace_elements`) —all whitespace between child elements is structural formatting noise. `“ ”` == `“n ”` == nothing. Whitespace-only text nodes are silently dropped. Typical: `<section>`, `<ul>`, `<formattedref>`, `<bibitem>`.

Classification is ancestor-based: a text node’s class is determined by the closest matching ancestor. This means ‘<em>` inside `<p>` inherits `<p>`’s normalize behaviour without needing to be listed explicitly.

Format defaults

  • XML: all three lists are empty by default — insensitive everywhere. Whitespace sensitivity is opt-in, consistent with XML’s data-first usage.

  • HTML: built-in defaults are provided (but overridable):

    • preserve: ‘pre`, `code`, `textarea`, `script`, `style`

    • collapse: ‘p`, `li`, `dt`, `dd`, `td`, `th`, `h1`–`h6`, `caption`, `figcaption`, `label`, `legend`, `summary`, `blockquote`, `address`

Structural vs. content whitespace

  • **Structural whitespace** — indentation characters emitted by the serializer itself. These do not exist in the source document. They are rendered as ordinary ASCII space and newline characters.

  • **Content whitespace** — whitespace that exists as text-node content in the source document. Classification (above) decides how to render it.

The invariant is: every XML element always starts on its own line. Content whitespace is never confused with structural indentation.

Example (normalize element <p>)

Input — compact source (Metanorma-style):

<p>See <xref target="M"/></p>

Input — indented fixture heredoc:

<p>
  See
  <xref target="M"/>
</p>

Both serialize to:

<p>
  See░
  <xref target="M"/>
</p>

Result: zero diff lines for a semantically identical document.

Example (insensitive element <formattedref>)

Input — compact source:

<formattedref><em>Cereals</em>.</formattedref>

Input — indented fixture:

<formattedref>
  <em>Cereals</em>.
</formattedref>

Both serialize to (whitespace-only nodes silently dropped):

<formattedref>
  <em>Cereals</em>
  .
</formattedref>

Result: zero diff lines.

Usage

printer = Canon::PrettyPrinter::XmlNormalized.new
formatted = printer.format(xml_string)

# With element lists (XML):
printer = Canon::PrettyPrinter::XmlNormalized.new(
  collapse_whitespace_elements: %w[p formattedref title],
  preserve_whitespace_elements: %w[sourcecode pre],
)

Instance Method Summary collapse

Constructor Details

#initialize(indent: 2, indent_type: "space", visualization_map: nil, preserve_whitespace_elements: [], collapse_whitespace_elements: [], strip_whitespace_elements: [], pretty_printed: false, sort_attributes: false, html_mode: false) ⇒ XmlNormalized

Returns a new instance of XmlNormalized.

Parameters:

  • indent (Integer) (defaults to: 2)

    number of indent characters per level (default 2)

  • indent_type (String) (defaults to: "space")

    “space” or “tab”

  • visualization_map (Hash, nil) (defaults to: nil)

    character visualization map

  • preserve_whitespace_elements (Array<String>) (defaults to: [])

    element names where every whitespace character is significant (e.g. pre, code).

  • collapse_whitespace_elements (Array<String>) (defaults to: [])

    element names where presence of whitespace matters but all forms are equivalent (e.g. p, li).

  • strip_whitespace_elements (Array<String>) (defaults to: [])

    explicit blacklist — these elements and their children always have whitespace dropped, even if an ancestor would otherwise be preserve or collapse.

  • pretty_printed (Boolean) (defaults to: false)

    when true, whitespace-only text nodes that begin with “n” inside :collapse elements are treated as structural indentation and silently dropped. This matches the comparison-side behaviour activated by pretty_printed_expected / pretty_printed_received match options. Nodes under :preserve elements are always preserved; nodes under :strip elements are already dropped.



132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
# File 'lib/canon/pretty_printer/xml_normalized.rb', line 132

def initialize(indent: 2, indent_type: "space", visualization_map: nil,
               preserve_whitespace_elements: [],
               collapse_whitespace_elements: [],
               strip_whitespace_elements: [],
               pretty_printed: false,
               sort_attributes: false,
               html_mode: false)
  @indent = indent.to_i
  @indent_char = indent_type == "tab" ? "\t" : " "
  @vis_map = visualization_map || default_vis_map
  @pretty_printed = pretty_printed
  @sort_attributes = sort_attributes
  @html_mode = html_mode

  @strict_ws  = Set.new((preserve_whitespace_elements || []).map(&:to_s))
  @norm_ws    = Set.new((collapse_whitespace_elements || []).map(&:to_s))
  @insens_ws  = Set.new((strip_whitespace_elements || []).map(&:to_s))
end

Instance Method Details

#format(xml_string) ⇒ String

Format an XML string with mixed-content-aware serialization.

Parameters:

  • xml_string (String)

    Input XML

Returns:

  • (String)

    Serialized XML, one node per line, with content whitespace visualized at line boundaries



156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
# File 'lib/canon/pretty_printer/xml_normalized.rb', line 156

def format(xml_string)
  doc = if Canon::XmlBackend.moxml?
          Canon::XmlParsing.parse(xml_string)
        elsif @html_mode
          Nokogiri::HTML5(xml_string)
        else
          Nokogiri::XML(xml_string)
        end
  lines = []

  if !@html_mode && doc.version
    enc = doc.encoding ? " encoding=\"#{doc.encoding}\"" : ""
    lines << "<?xml version=\"#{doc.version}\"#{enc}?>"
  end

  lines << serialize_element(doc.root, 0) if doc.root
  lines.join("\n")
end