Class: Canon::Xml::SaxBuilder

Inherits:
Nokogiri::XML::SAX::Document
  • Object
show all
Defined in:
lib/canon/xml/sax_builder.rb

Overview

Builds Canon::Xml::Node tree using Nokogiri SAX parser

This is MUCH faster than DOM parsing + conversion because:

  1. No intermediate Nokogiri DOM tree (saves ~60ms)

  2. No tree traversal to build Canon (saves ~1200ms)

  3. No memory overhead of two complete DOM trees

Current (SLOW): XML String → Nokogiri DOM (~60ms) → Canon DOM (~1200ms) = ~1260ms Optimized (FAST): XML String → Nokogiri SAX → Canon DOM (~200ms) = ~200ms

Usage:

root = SaxBuilder.parse(xml_string, preserve_whitespace: false)
# root is a Canon::Xml::Nodes::RootNode

For C14N, use strip_doctype: true to avoid DTD default attribute expansion:

root = SaxBuilder.parse(xml_string, strip_doctype: true)

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(preserve_whitespace: false) ⇒ SaxBuilder

Initialize the SAX builder

Parameters:

  • preserve_whitespace (Boolean) (defaults to: false)

    Whether to preserve whitespace-only text nodes



88
89
90
91
92
93
94
95
96
# File 'lib/canon/xml/sax_builder.rb', line 88

def initialize(preserve_whitespace: false)
  super()
  @preserve_whitespace = preserve_whitespace
  @root = Nodes::RootNode.new
  @stack = [@root]
  # Track in-scope namespaces at each level
  # Each entry is a hash of prefix => uri
  @namespace_stack = [build_initial_namespaces]
end

Class Method Details

.parse(xml_string, preserve_whitespace: false, strip_doctype: false) ⇒ Nodes::RootNode

Parse XML string and return Canon::Xml::Node tree

Parameters:

  • xml_string (String)

    XML content to parse

  • preserve_whitespace (Boolean) (defaults to: false)

    Whether to preserve whitespace-only text nodes

  • strip_doctype (Boolean) (defaults to: false)

    Strip DOCTYPE before parsing (for C14N to avoid DTD default attrs)

Returns:



38
39
40
41
42
43
44
45
46
47
48
49
50
51
# File 'lib/canon/xml/sax_builder.rb', line 38

def self.parse(xml_string, preserve_whitespace: false,
strip_doctype: false)
  # Strip DOCTYPE to prevent Nokogiri SAX from expanding DTD default attributes
  # This is needed for C14N which should NOT include default attributes from DTD
  # Use string methods instead of complex regex to avoid ReDoS vulnerability
  if strip_doctype
    xml_string = strip_doctype_declaration(xml_string)
  end

  builder = new(preserve_whitespace: preserve_whitespace)
  parser = Nokogiri::XML::SAX::Parser.new(builder)
  parser.parse(xml_string)
  builder.result
end

.strip_doctype_declaration(xml) ⇒ String

Strip DOCTYPE declaration without using complex regex This avoids ReDoS vulnerability from patterns like s+ and [^>]*

Parameters:

  • xml (String)

    XML string potentially containing DOCTYPE

Returns:

  • (String)

    XML string with DOCTYPE removed



58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# File 'lib/canon/xml/sax_builder.rb', line 58

def self.strip_doctype_declaration(xml)
  # Find DOCTYPE start (case-insensitive)
  doctype_start = xml.upcase.index("<!DOCTYPE")
  return xml unless doctype_start

  # Find the end of DOCTYPE - it ends with >
  # Handle both simple DOCTYPE and those with internal subset [...]
  pos = doctype_start + 9 # length of "<!DOCTYPE"
  in_bracket = false

  while pos < xml.length
    char = xml[pos]
    if char == "[" && !in_bracket
      in_bracket = true
    elsif char == "]" && in_bracket
      in_bracket = false
    elsif char == ">" && !in_bracket
      # Found the end of DOCTYPE
      return xml[0...doctype_start] + xml[(pos + 1)..]
    end
    pos += 1
  end

  # If we didn't find a proper end, just return original
  xml
end

Instance Method Details

#characters(string) ⇒ Object

Called for text content

Parameters:

  • string (String)

    Text content



158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
# File 'lib/canon/xml/sax_builder.rb', line 158

def characters(string)
  return if string.nil?

  parent = @stack.last

  # Capture raw text BEFORE entity resolution for accurate serialization
  raw_string = string

  # Decode numeric character references
  decoded_string = decode_character_references(string)

  # Combine with previous text node if adjacent (SAX can split text content)
  # This MUST happen before whitespace check, because SAX may split "foo "
  # into "foo" and " " callbacks - we need to combine them before deciding
  # whether to skip whitespace
  last_child = parent.children.last
  if last_child&.node_type == :text
    # Combine both raw and decoded forms
    last_child.instance_variable_set(:@value,
                                     last_child.value + decoded_string)
    last_child.instance_variable_set(:@original,
                                     (last_child.original || "") + raw_string)
    return
  end

  # Skip whitespace-only text nodes unless:
  # 1. preserve_whitespace is true, OR
  # 2. The content contains CR (from &#xD; entities) which must be preserved for C14N
  if !@preserve_whitespace && decoded_string.strip.empty? && parent.node_type == :element && !decoded_string.include?("\r")
    # Only skip if parent is an element (not root)
    return
  end

  text = Nodes::TextNode.new(value: decoded_string, original: raw_string)
  parent.add_child(text)
end

#comment(string) ⇒ Object

Called for comments

Parameters:

  • string (String)

    Comment content



198
199
200
201
202
# File 'lib/canon/xml/sax_builder.rb', line 198

def comment(string)
  parent = @stack.last
  comment_node = Nodes::CommentNode.new(value: string)
  parent.add_child(comment_node)
end

#end_element(_name) ⇒ Object

Called when an element ends

Parameters:

  • _name (String)

    Element name (unused)



150
151
152
153
# File 'lib/canon/xml/sax_builder.rb', line 150

def end_element(_name)
  @stack.pop
  @namespace_stack.pop
end

#processing_instruction(name, content) ⇒ Object

Called for processing instructions

Parameters:

  • name (String)

    PI target

  • content (String)

    PI content



208
209
210
211
212
213
# File 'lib/canon/xml/sax_builder.rb', line 208

def processing_instruction(name, content)
  parent = @stack.last
  pi = Nodes::ProcessingInstructionNode.new(target: name,
                                            data: content || "")
  parent.add_child(pi)
end

#reorder_children(root) ⇒ Object

Reorder root children so document element comes first followed by PIs and comments (outside document element)



228
229
230
231
232
233
234
# File 'lib/canon/xml/sax_builder.rb', line 228

def reorder_children(root)
  doc_element = root.children.find { |c| c.node_type == :element }
  return unless doc_element

  other_children = root.children.reject { |c| c.node_type == :element }
  root.instance_variable_set(:@children, [doc_element] + other_children)
end

#resultNodes::RootNode

Return the built tree

Returns:



218
219
220
221
222
223
224
# File 'lib/canon/xml/sax_builder.rb', line 218

def result
  # Reorder children so that the document element comes first,
  # followed by PIs and comments outside the document element
  # (C14N requires this ordering)
  reorder_children(@root)
  @root
end

#start_element(name, attrs = []) ⇒ Object

Called when an element starts

Parameters:

  • name (String)

    Element name (may include prefix like “ns:element”)

  • attrs (Array) (defaults to: [])

    Array of [name, value] pairs



102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
# File 'lib/canon/xml/sax_builder.rb', line 102

def start_element(name, attrs = [])
  parent = @stack.last

  # Parse namespace from name (prefix:localname or just localname)
  prefix, local_name = parse_qname(name)

  # Separate namespace declarations from regular attributes
  ns_decls, regular_attrs = separate_namespaces(attrs)

  # Check for relative namespace URIs (before building hash)
  # Convert to hash for iteration
  ns_hash = build_ns_hash(ns_decls)
  ns_hash.each_value do |uri|
    next if uri.nil? || uri.empty?

    if relative_uri?(uri)
      raise Canon::Error,
            "Relative namespace URI not allowed: #{uri}"
    end
  end

  # Push new namespace scope with declarations
  new_scope = @namespace_stack.last.merge(ns_hash)
  @namespace_stack.push(new_scope)

  # Find namespace URI from current scope
  ns_uri = new_scope[prefix.to_s]

  # Create element node
  element = Nodes::ElementNode.new(
    name: local_name,
    namespace_uri: ns_uri,
    prefix: prefix,
  )

  # Add namespace nodes from current scope
  add_namespace_nodes(element, new_scope)

  # Build and add attribute nodes (excluding xmlns declarations)
  add_attribute_nodes(element, regular_attrs)

  parent.add_child(element)
  @stack.push(element)
end