Class: Canon::Xml::SaxBuilder
- Inherits:
-
Nokogiri::XML::SAX::Document
- Object
- Nokogiri::XML::SAX::Document
- Canon::Xml::SaxBuilder
- Defined in:
- lib/canon/xml/sax_builder.rb
Overview
Builds Canon::Xml::Node tree using Nokogiri SAX parser
This is MUCH faster than DOM parsing + conversion because:
-
No intermediate Nokogiri DOM tree (saves ~60ms)
-
No tree traversal to build Canon (saves ~1200ms)
-
No memory overhead of two complete DOM trees
Current (SLOW): XML String → Nokogiri DOM (~60ms) → Canon DOM (~1200ms) = ~1260ms Optimized (FAST): XML String → Nokogiri SAX → Canon DOM (~200ms) = ~200ms
Usage:
root = SaxBuilder.parse(xml_string, preserve_whitespace: false)
# root is a Canon::Xml::Nodes::RootNode
For C14N, use strip_doctype: true to avoid DTD default attribute expansion:
root = SaxBuilder.parse(xml_string, strip_doctype: true)
Class Method Summary collapse
-
.parse(xml_string, preserve_whitespace: false, strip_doctype: false) ⇒ Nodes::RootNode
Parse XML string and return Canon::Xml::Node tree.
-
.strip_doctype_declaration(xml) ⇒ String
Strip DOCTYPE declaration without using complex regex This avoids ReDoS vulnerability from patterns like s+ and [^>]*.
Instance Method Summary collapse
-
#characters(string) ⇒ Object
Called for text content.
-
#comment(string) ⇒ Object
Called for comments.
-
#end_element(_name) ⇒ Object
Called when an element ends.
-
#initialize(preserve_whitespace: false) ⇒ SaxBuilder
constructor
Initialize the SAX builder.
-
#processing_instruction(name, content) ⇒ Object
Called for processing instructions.
-
#reorder_children(root) ⇒ Object
Reorder root children so document element comes first followed by PIs and comments (outside document element).
-
#result ⇒ Nodes::RootNode
Return the built tree.
-
#start_element(name, attrs = []) ⇒ Object
Called when an element starts.
Constructor Details
#initialize(preserve_whitespace: false) ⇒ SaxBuilder
Initialize the SAX builder
88 89 90 91 92 93 94 95 96 |
# File 'lib/canon/xml/sax_builder.rb', line 88 def initialize(preserve_whitespace: false) super() @preserve_whitespace = preserve_whitespace @root = Nodes::RootNode.new @stack = [@root] # Track in-scope namespaces at each level # Each entry is a hash of prefix => uri @namespace_stack = [build_initial_namespaces] end |
Class Method Details
.parse(xml_string, preserve_whitespace: false, strip_doctype: false) ⇒ Nodes::RootNode
Parse XML string and return Canon::Xml::Node tree
38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
# File 'lib/canon/xml/sax_builder.rb', line 38 def self.parse(xml_string, preserve_whitespace: false, strip_doctype: false) # Strip DOCTYPE to prevent Nokogiri SAX from expanding DTD default attributes # This is needed for C14N which should NOT include default attributes from DTD # Use string methods instead of complex regex to avoid ReDoS vulnerability if strip_doctype xml_string = strip_doctype_declaration(xml_string) end builder = new(preserve_whitespace: preserve_whitespace) parser = Nokogiri::XML::SAX::Parser.new(builder) parser.parse(xml_string) builder.result end |
.strip_doctype_declaration(xml) ⇒ String
Strip DOCTYPE declaration without using complex regex This avoids ReDoS vulnerability from patterns like s+ and [^>]*
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
# File 'lib/canon/xml/sax_builder.rb', line 58 def self.strip_doctype_declaration(xml) # Find DOCTYPE start (case-insensitive) doctype_start = xml.upcase.index("<!DOCTYPE") return xml unless doctype_start # Find the end of DOCTYPE - it ends with > # Handle both simple DOCTYPE and those with internal subset [...] pos = doctype_start + 9 # length of "<!DOCTYPE" in_bracket = false while pos < xml.length char = xml[pos] if char == "[" && !in_bracket in_bracket = true elsif char == "]" && in_bracket in_bracket = false elsif char == ">" && !in_bracket # Found the end of DOCTYPE return xml[0...doctype_start] + xml[(pos + 1)..] end pos += 1 end # If we didn't find a proper end, just return original xml end |
Instance Method Details
#characters(string) ⇒ Object
Called for text content
158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 |
# File 'lib/canon/xml/sax_builder.rb', line 158 def characters(string) return if string.nil? parent = @stack.last # Capture raw text BEFORE entity resolution for accurate serialization raw_string = string # Decode numeric character references decoded_string = decode_character_references(string) # Combine with previous text node if adjacent (SAX can split text content) # This MUST happen before whitespace check, because SAX may split "foo " # into "foo" and " " callbacks - we need to combine them before deciding # whether to skip whitespace last_child = parent.children.last if last_child&.node_type == :text # Combine both raw and decoded forms last_child.instance_variable_set(:@value, last_child.value + decoded_string) last_child.instance_variable_set(:@original, (last_child.original || "") + raw_string) return end # Skip whitespace-only text nodes unless: # 1. preserve_whitespace is true, OR # 2. The content contains CR (from 
 entities) which must be preserved for C14N if !@preserve_whitespace && decoded_string.strip.empty? && parent.node_type == :element && !decoded_string.include?("\r") # Only skip if parent is an element (not root) return end text = Nodes::TextNode.new(value: decoded_string, original: raw_string) parent.add_child(text) end |
#comment(string) ⇒ Object
Called for comments
198 199 200 201 202 |
# File 'lib/canon/xml/sax_builder.rb', line 198 def comment(string) parent = @stack.last comment_node = Nodes::CommentNode.new(value: string) parent.add_child(comment_node) end |
#end_element(_name) ⇒ Object
Called when an element ends
150 151 152 153 |
# File 'lib/canon/xml/sax_builder.rb', line 150 def end_element(_name) @stack.pop @namespace_stack.pop end |
#processing_instruction(name, content) ⇒ Object
Called for processing instructions
208 209 210 211 212 213 |
# File 'lib/canon/xml/sax_builder.rb', line 208 def processing_instruction(name, content) parent = @stack.last pi = Nodes::ProcessingInstructionNode.new(target: name, data: content || "") parent.add_child(pi) end |
#reorder_children(root) ⇒ Object
Reorder root children so document element comes first followed by PIs and comments (outside document element)
228 229 230 231 232 233 234 |
# File 'lib/canon/xml/sax_builder.rb', line 228 def reorder_children(root) doc_element = root.children.find { |c| c.node_type == :element } return unless doc_element other_children = root.children.reject { |c| c.node_type == :element } root.instance_variable_set(:@children, [doc_element] + other_children) end |
#result ⇒ Nodes::RootNode
Return the built tree
218 219 220 221 222 223 224 |
# File 'lib/canon/xml/sax_builder.rb', line 218 def result # Reorder children so that the document element comes first, # followed by PIs and comments outside the document element # (C14N requires this ordering) reorder_children(@root) @root end |
#start_element(name, attrs = []) ⇒ Object
Called when an element starts
102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
# File 'lib/canon/xml/sax_builder.rb', line 102 def start_element(name, attrs = []) parent = @stack.last # Parse namespace from name (prefix:localname or just localname) prefix, local_name = parse_qname(name) # Separate namespace declarations from regular attributes ns_decls, regular_attrs = separate_namespaces(attrs) # Check for relative namespace URIs (before building hash) # Convert to hash for iteration ns_hash = build_ns_hash(ns_decls) ns_hash.each_value do |uri| next if uri.nil? || uri.empty? if relative_uri?(uri) raise Canon::Error, "Relative namespace URI not allowed: #{uri}" end end # Push new namespace scope with declarations new_scope = @namespace_stack.last.merge(ns_hash) @namespace_stack.push(new_scope) # Find namespace URI from current scope ns_uri = new_scope[prefix.to_s] # Create element node element = Nodes::ElementNode.new( name: local_name, namespace_uri: ns_uri, prefix: prefix, ) # Add namespace nodes from current scope add_namespace_nodes(element, new_scope) # Build and add attribute nodes (excluding xmlns declarations) add_attribute_nodes(element, regular_attrs) parent.add_child(element) @stack.push(element) end |