Class: Canon::PrettyPrinter::XmlNormalized
- Inherits:
-
Object
- Object
- Canon::PrettyPrinter::XmlNormalized
- Defined in:
- lib/canon/pretty_printer/xml_normalized.rb
Overview
Mixed-content-aware XML serializer for diff display preprocessing.
The mixed-content problem
Standard XML pretty-printers (including Nokogiri’s built-in serializer) keep elements that contain both text and child elements on a single line. They have no choice: inserting a newline between, say, ‘<p>See ` and `<xref…/>` would create a new whitespace text node, changing the document’s semantic content. The result for line-by-line diffs is that any change inside such an element forces the entire line — potentially hundreds or thousands of characters — to be marked as changed. Issue #53 documented this as “1000-character long lines” from HTML diffs.
Three-way whitespace classification
This serializer distinguishes three categories of element-level whitespace behaviour, configured via element-name lists:
-
Preserve (‘preserve_whitespace_elements`) — every whitespace character is significant. `“ ”` ≠ `“n”`. Typical: `<pre>`, `<code>`, `<textarea>`. Whitespace-only text nodes are visualized character-by-character.
-
Collapse (‘collapse_whitespace_elements`) — presence ≠ absence, but all whitespace forms are equivalent: `“ ”` == `“n ”` == `“t”`. Typical: `<p>`, `<li>`, `<td>`, heading elements. Whitespace-only text nodes are collapsed to a single `░` visualization, so `<p>n <em>` (indented fixture) and `<p> <em>` (compact source) both render as `<p>░<em>` — identical display lines, no spurious diff.
-
Strip (everything else, or explicit ‘strip_whitespace_elements`) —all whitespace between child elements is structural formatting noise. `“ ”` == `“n ”` == nothing. Whitespace-only text nodes are silently dropped. Typical: `<section>`, `<ul>`, `<formattedref>`, `<bibitem>`.
Classification is ancestor-based: a text node’s class is determined by the closest matching ancestor. This means ‘<em>` inside `<p>` inherits `<p>`’s normalize behaviour without needing to be listed explicitly.
Format defaults
-
XML: all three lists are empty by default — insensitive everywhere. Whitespace sensitivity is opt-in, consistent with XML’s data-first usage.
-
HTML: built-in defaults are provided (but overridable):
-
preserve: ‘pre`, `code`, `textarea`, `script`, `style`
-
collapse: ‘p`, `li`, `dt`, `dd`, `td`, `th`, `h1`–`h6`, `caption`, `figcaption`, `label`, `legend`, `summary`, `blockquote`, `address`
-
Structural vs. content whitespace
-
**Structural whitespace** — indentation characters emitted by the serializer itself. These do not exist in the source document. They are rendered as ordinary ASCII space and newline characters.
-
**Content whitespace** — whitespace that exists as text-node content in the source document. Classification (above) decides how to render it.
The invariant is: every XML element always starts on its own line. Content whitespace is never confused with structural indentation.
Example (normalize element <p>)
Input — compact source (Metanorma-style):
<p>See <xref target="M"/></p>
Input — indented fixture heredoc:
<p>
See
<xref target="M"/>
</p>
Both serialize to:
<p>
See░
<xref target="M"/>
</p>
Result: zero diff lines for a semantically identical document.
Example (insensitive element <formattedref>)
Input — compact source:
<formattedref><em>Cereals</em>.</formattedref>
Input — indented fixture:
<formattedref>
<em>Cereals</em>.
</formattedref>
Both serialize to (whitespace-only nodes silently dropped):
<formattedref>
<em>Cereals</em>
.
</formattedref>
Result: zero diff lines.
Usage
printer = Canon::PrettyPrinter::XmlNormalized.new
formatted = printer.format(xml_string)
# With element lists (XML):
printer = Canon::PrettyPrinter::XmlNormalized.new(
collapse_whitespace_elements: %w[p formattedref title],
preserve_whitespace_elements: %w[sourcecode pre],
)
Instance Method Summary collapse
-
#format(xml_string) ⇒ String
Format an XML string with mixed-content-aware serialization.
-
#initialize(indent: 2, indent_type: "space", visualization_map: nil, preserve_whitespace_elements: [], collapse_whitespace_elements: [], strip_whitespace_elements: [], pretty_printed: false, sort_attributes: false) ⇒ XmlNormalized
constructor
A new instance of XmlNormalized.
Constructor Details
#initialize(indent: 2, indent_type: "space", visualization_map: nil, preserve_whitespace_elements: [], collapse_whitespace_elements: [], strip_whitespace_elements: [], pretty_printed: false, sort_attributes: false) ⇒ XmlNormalized
Returns a new instance of XmlNormalized.
131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 |
# File 'lib/canon/pretty_printer/xml_normalized.rb', line 131 def initialize(indent: 2, indent_type: "space", visualization_map: nil, preserve_whitespace_elements: [], collapse_whitespace_elements: [], strip_whitespace_elements: [], pretty_printed: false, sort_attributes: false) @indent = indent.to_i @indent_char = indent_type == "tab" ? "\t" : " " @vis_map = visualization_map || default_vis_map @pretty_printed = pretty_printed @sort_attributes = sort_attributes @strict_ws = Set.new((preserve_whitespace_elements || []).map(&:to_s)) @norm_ws = Set.new((collapse_whitespace_elements || []).map(&:to_s)) @insens_ws = Set.new((strip_whitespace_elements || []).map(&:to_s)) end |
Instance Method Details
#format(xml_string) ⇒ String
Format an XML string with mixed-content-aware serialization.
153 154 155 156 157 158 159 160 161 162 163 164 |
# File 'lib/canon/pretty_printer/xml_normalized.rb', line 153 def format(xml_string) doc = Nokogiri::XML(xml_string) lines = [] if doc.version enc = doc.encoding ? " encoding=\"#{doc.encoding}\"" : "" lines << "<?xml version=\"#{doc.version}\"#{enc}?>" end lines << serialize_element(doc.root, 0) if doc.root lines.join("\n") end |