Class: Canon::Xml::SaxBuilder

Inherits:
Nokogiri::XML::SAX::Document
  • Object
show all
Defined in:
lib/canon/xml/sax_builder.rb

Overview

Builds Canon::Xml::Node tree using Nokogiri SAX parser

This is MUCH faster than DOM parsing + conversion because:

  1. No intermediate Nokogiri DOM tree (saves ~60ms)

  2. No tree traversal to build Canon (saves ~1200ms)

  3. No memory overhead of two complete DOM trees

Current (SLOW): XML String → Nokogiri DOM (~60ms) → Canon DOM (~1200ms) = ~1260ms Optimized (FAST): XML String → Nokogiri SAX → Canon DOM (~200ms) = ~200ms

Usage:

root = SaxBuilder.parse(xml_string, preserve_whitespace: false)
# root is a Canon::Xml::Nodes::RootNode

For C14N, use strip_doctype: true to avoid DTD default attribute expansion:

root = SaxBuilder.parse(xml_string, strip_doctype: true)

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(preserve_whitespace: false) ⇒ SaxBuilder

Initialize the SAX builder

Parameters:

  • preserve_whitespace (Boolean) (defaults to: false)

    Whether to preserve whitespace-only text nodes



88
89
90
91
92
93
94
95
96
97
98
99
100
101
# File 'lib/canon/xml/sax_builder.rb', line 88

def initialize(preserve_whitespace: false)
  super()
  @preserve_whitespace = preserve_whitespace
  @root = Nodes::RootNode.new
  @stack = [@root]
  # Track in-scope namespaces at each level
  # Each entry is a hash of prefix => uri
  @namespace_stack = [build_initial_namespaces]
  # Captured libxml errors during SAX parsing.  Surfaced on the
  # resulting RootNode so the diff report can warn the user
  # when a FATAL parse error has caused content loss
  # (see lutaml/canon#130).
  @parse_errors = []
end

Class Method Details

.parse(xml_string, preserve_whitespace: false, strip_doctype: false) ⇒ Nodes::RootNode

Parse XML string and return Canon::Xml::Node tree

Parameters:

  • xml_string (String)

    XML content to parse

  • preserve_whitespace (Boolean) (defaults to: false)

    Whether to preserve whitespace-only text nodes

  • strip_doctype (Boolean) (defaults to: false)

    Strip DOCTYPE before parsing (for C14N to avoid DTD default attrs)

Returns:



38
39
40
41
42
43
44
45
46
47
48
49
50
51
# File 'lib/canon/xml/sax_builder.rb', line 38

def self.parse(xml_string, preserve_whitespace: false,
strip_doctype: false)
  # Strip DOCTYPE to prevent Nokogiri SAX from expanding DTD default attributes
  # This is needed for C14N which should NOT include default attributes from DTD
  # Use string methods instead of complex regex to avoid ReDoS vulnerability
  if strip_doctype
    xml_string = strip_doctype_declaration(xml_string)
  end

  builder = new(preserve_whitespace: preserve_whitespace)
  parser = Nokogiri::XML::SAX::Parser.new(builder)
  parser.parse(xml_string)
  builder.result
end

.strip_doctype_declaration(xml) ⇒ String

Strip DOCTYPE declaration without using complex regex This avoids ReDoS vulnerability from patterns like s+ and [^>]*

Parameters:

  • xml (String)

    XML string potentially containing DOCTYPE

Returns:

  • (String)

    XML string with DOCTYPE removed



58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# File 'lib/canon/xml/sax_builder.rb', line 58

def self.strip_doctype_declaration(xml)
  # Find DOCTYPE start (case-insensitive)
  doctype_start = xml.upcase.index("<!DOCTYPE")
  return xml unless doctype_start

  # Find the end of DOCTYPE - it ends with >
  # Handle both simple DOCTYPE and those with internal subset [...]
  pos = doctype_start + 9 # length of "<!DOCTYPE"
  in_bracket = false

  while pos < xml.length
    char = xml[pos]
    if char == "[" && !in_bracket
      in_bracket = true
    elsif char == "]" && in_bracket
      in_bracket = false
    elsif char == ">" && !in_bracket
      # Found the end of DOCTYPE
      return xml[0...doctype_start] + xml[(pos + 1)..]
    end
    pos += 1
  end

  # If we didn't find a proper end, just return original
  xml
end

Instance Method Details

#characters(string) ⇒ Object

Called for text content

Parameters:

  • string (String)

    Text content



175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
# File 'lib/canon/xml/sax_builder.rb', line 175

def characters(string)
  return if string.nil?

  parent = @stack.last

  # Capture raw text BEFORE entity resolution for accurate serialization
  raw_string = string

  # Decode numeric character references
  decoded_string = decode_character_references(string)

  # Combine with previous text node if adjacent (SAX can split text content)
  # This MUST happen before whitespace check, because SAX may split "foo "
  # into "foo" and " " callbacks - we need to combine them before deciding
  # whether to skip whitespace
  last_child = parent.children.last
  if last_child&.node_type == :text
    # Combine both raw and decoded forms
    last_child.instance_variable_set(:@value,
                                     last_child.value + decoded_string)
    last_child.instance_variable_set(:@original,
                                     (last_child.original || "") + raw_string)
    return
  end

  # Skip whitespace-only text nodes unless:
  # 1. preserve_whitespace is true, OR
  # 2. The content contains CR (from &#xD; entities) which must be preserved for C14N, OR
  # 3. The content contains non-ASCII whitespace (NBSP U+00A0, ideographic
  #    space U+3000, etc.) — those are semantically meaningful content,
  #    not pretty-print indentation, and must survive parsing so the
  #    comparator can detect Unicode whitespace-type differences.
  #
  # Strip only when the node is pure ASCII whitespace (space, tab, CR, LF).
  # This lets pretty-printed fixtures work (indent nodes stripped) while
  # preserving NBSP-only text nodes.
  if !@preserve_whitespace && decoded_string.gsub(/[ \t\r\n]/,
                                                  "").empty? && parent.node_type == :element && !decoded_string.include?("\r")
    # Only skip if parent is an element (not root)
    return
  end

  text = Nodes::TextNode.new(value: decoded_string, original: raw_string)
  parent.add_child(text)
end

#comment(string) ⇒ Object

Called for comments

Parameters:

  • string (String)

    Comment content



224
225
226
227
228
# File 'lib/canon/xml/sax_builder.rb', line 224

def comment(string)
  parent = @stack.last
  comment_node = Nodes::CommentNode.new(value: string)
  parent.add_child(comment_node)
end

#end_element(_name) ⇒ Object

Called when an element ends

Parameters:

  • _name (String)

    Element name (unused)



167
168
169
170
# File 'lib/canon/xml/sax_builder.rb', line 167

def end_element(_name)
  @stack.pop
  @namespace_stack.pop
end

#error(string) ⇒ Object

SAX callbacks for libxml errors and warnings. Without these overrides the default handlers swallow the events; with them, libxml’s “Attribute xml:lang redefined” and similar messages land in @parse_errors and ride through to ComparisonResult.



107
108
109
# File 'lib/canon/xml/sax_builder.rb', line 107

def error(string)
  @parse_errors << string.to_s.strip
end

#processing_instruction(name, content) ⇒ Object

Called for processing instructions

Parameters:

  • name (String)

    PI target

  • content (String)

    PI content



234
235
236
237
238
239
# File 'lib/canon/xml/sax_builder.rb', line 234

def processing_instruction(name, content)
  parent = @stack.last
  pi = Nodes::ProcessingInstructionNode.new(target: name,
                                            data: content || "")
  parent.add_child(pi)
end

#reorder_children(root) ⇒ Object

Reorder root children so document element comes first followed by PIs and comments (outside document element)



255
256
257
258
259
260
261
# File 'lib/canon/xml/sax_builder.rb', line 255

def reorder_children(root)
  doc_element = root.children.find { |c| c.node_type == :element }
  return unless doc_element

  other_children = root.children.reject { |c| c.node_type == :element }
  root.instance_variable_set(:@children, [doc_element] + other_children)
end

#resultNodes::RootNode

Return the built tree

Returns:



244
245
246
247
248
249
250
251
# File 'lib/canon/xml/sax_builder.rb', line 244

def result
  # Reorder children so that the document element comes first,
  # followed by PIs and comments outside the document element
  # (C14N requires this ordering)
  reorder_children(@root)
  @root.parse_errors = @parse_errors if @parse_errors.any?
  @root
end

#start_element(name, attrs = []) ⇒ Object

Called when an element starts

Parameters:

  • name (String)

    Element name (may include prefix like “ns:element”)

  • attrs (Array) (defaults to: [])

    Array of [name, value] pairs



119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
# File 'lib/canon/xml/sax_builder.rb', line 119

def start_element(name, attrs = [])
  parent = @stack.last

  # Parse namespace from name (prefix:localname or just localname)
  prefix, local_name = parse_qname(name)

  # Separate namespace declarations from regular attributes
  ns_decls, regular_attrs = separate_namespaces(attrs)

  # Check for relative namespace URIs (before building hash)
  # Convert to hash for iteration
  ns_hash = build_ns_hash(ns_decls)
  ns_hash.each_value do |uri|
    next if uri.nil? || uri.empty?

    if relative_uri?(uri)
      raise Canon::Error,
            "Relative namespace URI not allowed: #{uri}"
    end
  end

  # Push new namespace scope with declarations
  new_scope = @namespace_stack.last.merge(ns_hash)
  @namespace_stack.push(new_scope)

  # Find namespace URI from current scope
  ns_uri = new_scope[prefix.to_s]

  # Create element node
  element = Nodes::ElementNode.new(
    name: local_name,
    namespace_uri: ns_uri,
    prefix: prefix,
  )

  # Add namespace nodes from current scope
  add_namespace_nodes(element, new_scope)

  # Build and add attribute nodes (excluding xmlns declarations)
  add_attribute_nodes(element, regular_attrs)

  parent.add_child(element)
  @stack.push(element)
end

#warning(string) ⇒ Object



111
112
113
# File 'lib/canon/xml/sax_builder.rb', line 111

def warning(string)
  @parse_errors << string.to_s.strip
end