Class: Canon::Xml::SaxBuilder

Inherits:
Object
  • Object
show all
Defined in:
lib/canon/xml/sax_builder.rb

Overview

Builds Canon::Xml::Node tree using Nokogiri SAX parser

This is MUCH faster than DOM parsing + conversion because:

  1. No intermediate Nokogiri DOM tree (saves ~60ms)

  2. No tree traversal to build Canon (saves ~1200ms)

  3. No memory overhead of two complete DOM trees

Current (SLOW): XML String → Nokogiri DOM (~60ms) → Canon DOM (~1200ms) = ~1260ms Optimized (FAST): XML String → Nokogiri SAX → Canon DOM (~200ms) = ~200ms

Usage:

root = SaxBuilder.parse(xml_string, preserve_whitespace: false)
# root is a Canon::Xml::Nodes::RootNode

For C14N, use strip_doctype: true to avoid DTD default attribute expansion:

root = SaxBuilder.parse(xml_string, strip_doctype: true)

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(preserve_whitespace: false) ⇒ SaxBuilder

Initialize the SAX builder

Parameters:

  • preserve_whitespace (Boolean) (defaults to: false)

    Whether to preserve whitespace-only text nodes



81
82
83
84
85
86
87
88
89
90
91
92
93
94
# File 'lib/canon/xml/sax_builder.rb', line 81

def initialize(preserve_whitespace: false)
  super()
  @preserve_whitespace = preserve_whitespace
  @root = Nodes::RootNode.new
  @stack = [@root]
  # Track in-scope namespaces at each level
  # Each entry is a hash of prefix => uri
  @namespace_stack = [build_initial_namespaces]
  # Captured libxml errors during SAX parsing.  Surfaced on the
  # resulting RootNode so the diff report can warn the user
  # when a FATAL parse error has caused content loss
  # (see lutaml/canon#130).
  @parse_errors = []
end

Class Method Details

.parse(xml_string, preserve_whitespace: false, strip_doctype: false) ⇒ Nodes::RootNode

Parse XML string and return Canon::Xml::Node tree

Parameters:

  • xml_string (String)

    XML content to parse

  • preserve_whitespace (Boolean) (defaults to: false)

    Whether to preserve whitespace-only text nodes

  • strip_doctype (Boolean) (defaults to: false)

    Strip DOCTYPE before parsing (for C14N to avoid DTD default attrs)

Returns:



31
32
33
34
35
36
37
38
39
40
41
42
43
44
# File 'lib/canon/xml/sax_builder.rb', line 31

def self.parse(xml_string, preserve_whitespace: false,
strip_doctype: false)
  # Strip DOCTYPE to prevent Nokogiri SAX from expanding DTD default attributes
  # This is needed for C14N which should NOT include default attributes from DTD
  # Use string methods instead of complex regex to avoid ReDoS vulnerability
  if strip_doctype
    xml_string = strip_doctype_declaration(xml_string)
  end

  builder = new(preserve_whitespace: preserve_whitespace)
  parser = Nokogiri::XML::SAX::Parser.new(builder)
  parser.parse(xml_string)
  builder.result
end

.strip_doctype_declaration(xml) ⇒ String

Strip DOCTYPE declaration without using complex regex This avoids ReDoS vulnerability from patterns like s+ and [^>]*

Parameters:

  • xml (String)

    XML string potentially containing DOCTYPE

Returns:

  • (String)

    XML string with DOCTYPE removed



51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# File 'lib/canon/xml/sax_builder.rb', line 51

def self.strip_doctype_declaration(xml)
  # Find DOCTYPE start (case-insensitive)
  doctype_start = xml.upcase.index("<!DOCTYPE")
  return xml unless doctype_start

  # Find the end of DOCTYPE - it ends with >
  # Handle both simple DOCTYPE and those with internal subset [...]
  pos = doctype_start + 9 # length of "<!DOCTYPE"
  in_bracket = false

  while pos < xml.length
    char = xml[pos]
    if char == "[" && !in_bracket
      in_bracket = true
    elsif char == "]" && in_bracket
      in_bracket = false
    elsif char == ">" && !in_bracket
      # Found the end of DOCTYPE
      return xml[0...doctype_start] + xml[(pos + 1)..]
    end
    pos += 1
  end

  # If we didn't find a proper end, just return original
  xml
end

Instance Method Details

#characters(string) ⇒ Object

Called for text content

Parameters:

  • string (String)

    Text content



168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
# File 'lib/canon/xml/sax_builder.rb', line 168

def characters(string)
  return if string.nil?

  parent = @stack.last

  # Capture raw text BEFORE entity resolution for accurate serialization
  raw_string = string

  # Decode numeric character references
  decoded_string = decode_character_references(string)

  # Combine with previous text node if adjacent (SAX can split text content)
  # This MUST happen before whitespace check, because SAX may split "foo "
  # into "foo" and " " callbacks - we need to combine them before deciding
  # whether to skip whitespace
  last_child = parent.children.last
  if last_child&.node_type == :text
    # Combine both raw and decoded forms
    last_child.value = last_child.value + decoded_string
    last_child.original = (last_child.original || "") + raw_string
    return
  end

  # Skip whitespace-only text nodes unless:
  # 1. preserve_whitespace is true, OR
  # 2. The content contains CR (from &#xD; entities) which must be preserved for C14N, OR
  # 3. The content contains non-ASCII whitespace (NBSP U+00A0, ideographic
  #    space U+3000, etc.) — those are semantically meaningful content,
  #    not pretty-print indentation, and must survive parsing so the
  #    comparator can detect Unicode whitespace-type differences.
  #
  # Strip only when the node is pure ASCII whitespace (space, tab, CR, LF).
  # This lets pretty-printed fixtures work (indent nodes stripped) while
  # preserving NBSP-only text nodes.
  if !@preserve_whitespace && decoded_string.gsub(/[ \t\r\n]/,
                                                  "").empty? && parent.node_type == :element && !decoded_string.include?("\r")
    # Only skip if parent is an element (not root)
    return
  end

  text = Nodes::TextNode.new(value: decoded_string, original: raw_string)
  parent.add_child(text)
end

#comment(string) ⇒ Object

Called for comments

Parameters:

  • string (String)

    Comment content



215
216
217
218
219
# File 'lib/canon/xml/sax_builder.rb', line 215

def comment(string)
  parent = @stack.last
  comment_node = Nodes::CommentNode.new(value: string)
  parent.add_child(comment_node)
end

#end_element(_name) ⇒ Object

Called when an element ends

Parameters:

  • _name (String)

    Element name (unused)



160
161
162
163
# File 'lib/canon/xml/sax_builder.rb', line 160

def end_element(_name)
  @stack.pop
  @namespace_stack.pop
end

#error(string) ⇒ Object

SAX callbacks for libxml errors and warnings. Without these overrides the default handlers swallow the events; with them, libxml’s “Attribute xml:lang redefined” and similar messages land in @parse_errors and ride through to ComparisonResult.



100
101
102
# File 'lib/canon/xml/sax_builder.rb', line 100

def error(string)
  @parse_errors << string.to_s.strip
end

#processing_instruction(name, content) ⇒ Object

Called for processing instructions

Parameters:

  • name (String)

    PI target

  • content (String)

    PI content



225
226
227
228
229
230
# File 'lib/canon/xml/sax_builder.rb', line 225

def processing_instruction(name, content)
  parent = @stack.last
  pi = Nodes::ProcessingInstructionNode.new(target: name,
                                            data: content || "")
  parent.add_child(pi)
end

#reorder_children(root) ⇒ Object

Reorder root children so document element comes first followed by PIs and comments (outside document element)



246
247
248
249
250
251
252
# File 'lib/canon/xml/sax_builder.rb', line 246

def reorder_children(root)
  doc_element = root.children.find { |c| c.node_type == :element }
  return unless doc_element

  other_children = root.children.reject { |c| c.node_type == :element }
  root.children = [doc_element] + other_children
end

#resultNodes::RootNode

Return the built tree

Returns:



235
236
237
238
239
240
241
242
# File 'lib/canon/xml/sax_builder.rb', line 235

def result
  # Reorder children so that the document element comes first,
  # followed by PIs and comments outside the document element
  # (C14N requires this ordering)
  reorder_children(@root)
  @root.parse_errors = @parse_errors if @parse_errors.any?
  @root
end

#start_element(name, attrs = []) ⇒ Object

Called when an element starts

Parameters:

  • name (String)

    Element name (may include prefix like “ns:element”)

  • attrs (Array) (defaults to: [])

    Array of [name, value] pairs



112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
# File 'lib/canon/xml/sax_builder.rb', line 112

def start_element(name, attrs = [])
  parent = @stack.last

  # Parse namespace from name (prefix:localname or just localname)
  prefix, local_name = parse_qname(name)

  # Separate namespace declarations from regular attributes
  ns_decls, regular_attrs = separate_namespaces(attrs)

  # Check for relative namespace URIs (before building hash)
  # Convert to hash for iteration
  ns_hash = build_ns_hash(ns_decls)
  ns_hash.each_value do |uri|
    next if uri.nil? || uri.empty?

    if relative_uri?(uri)
      raise Canon::Error,
            "Relative namespace URI not allowed: #{uri}"
    end
  end

  # Push new namespace scope with declarations
  new_scope = @namespace_stack.last.merge(ns_hash)
  @namespace_stack.push(new_scope)

  # Find namespace URI from current scope
  ns_uri = new_scope[prefix.to_s]

  # Create element node
  element = Nodes::ElementNode.new(
    name: local_name,
    namespace_uri: ns_uri,
    prefix: prefix,
  )

  # Add namespace nodes from current scope
  add_namespace_nodes(element, new_scope)

  # Build and add attribute nodes (excluding xmlns declarations)
  add_attribute_nodes(element, regular_attrs)

  parent.add_child(element)
  @stack.push(element)
end

#warning(string) ⇒ Object



104
105
106
# File 'lib/canon/xml/sax_builder.rb', line 104

def warning(string)
  @parse_errors << string.to_s.strip
end