Class: Canon::Xml::SaxBuilder
- Inherits:
-
Nokogiri::XML::SAX::Document
- Object
- Nokogiri::XML::SAX::Document
- Canon::Xml::SaxBuilder
- Defined in:
- lib/canon/xml/sax_builder.rb
Overview
Builds Canon::Xml::Node tree using Nokogiri SAX parser
This is MUCH faster than DOM parsing + conversion because:
-
No intermediate Nokogiri DOM tree (saves ~60ms)
-
No tree traversal to build Canon (saves ~1200ms)
-
No memory overhead of two complete DOM trees
Current (SLOW): XML String → Nokogiri DOM (~60ms) → Canon DOM (~1200ms) = ~1260ms Optimized (FAST): XML String → Nokogiri SAX → Canon DOM (~200ms) = ~200ms
Usage:
root = SaxBuilder.parse(xml_string, preserve_whitespace: false)
# root is a Canon::Xml::Nodes::RootNode
For C14N, use strip_doctype: true to avoid DTD default attribute expansion:
root = SaxBuilder.parse(xml_string, strip_doctype: true)
Class Method Summary collapse
-
.parse(xml_string, preserve_whitespace: false, strip_doctype: false) ⇒ Nodes::RootNode
Parse XML string and return Canon::Xml::Node tree.
-
.strip_doctype_declaration(xml) ⇒ String
Strip DOCTYPE declaration without using complex regex This avoids ReDoS vulnerability from patterns like s+ and [^>]*.
Instance Method Summary collapse
-
#characters(string) ⇒ Object
Called for text content.
-
#comment(string) ⇒ Object
Called for comments.
-
#end_element(_name) ⇒ Object
Called when an element ends.
-
#error(string) ⇒ Object
SAX callbacks for libxml errors and warnings.
-
#initialize(preserve_whitespace: false) ⇒ SaxBuilder
constructor
Initialize the SAX builder.
-
#processing_instruction(name, content) ⇒ Object
Called for processing instructions.
-
#reorder_children(root) ⇒ Object
Reorder root children so document element comes first followed by PIs and comments (outside document element).
-
#result ⇒ Nodes::RootNode
Return the built tree.
-
#start_element(name, attrs = []) ⇒ Object
Called when an element starts.
- #warning(string) ⇒ Object
Constructor Details
#initialize(preserve_whitespace: false) ⇒ SaxBuilder
Initialize the SAX builder
88 89 90 91 92 93 94 95 96 97 98 99 100 101 |
# File 'lib/canon/xml/sax_builder.rb', line 88 def initialize(preserve_whitespace: false) super() @preserve_whitespace = preserve_whitespace @root = Nodes::RootNode.new @stack = [@root] # Track in-scope namespaces at each level # Each entry is a hash of prefix => uri @namespace_stack = [build_initial_namespaces] # Captured libxml errors during SAX parsing. Surfaced on the # resulting RootNode so the diff report can warn the user # when a FATAL parse error has caused content loss # (see lutaml/canon#130). @parse_errors = [] end |
Class Method Details
.parse(xml_string, preserve_whitespace: false, strip_doctype: false) ⇒ Nodes::RootNode
Parse XML string and return Canon::Xml::Node tree
38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
# File 'lib/canon/xml/sax_builder.rb', line 38 def self.parse(xml_string, preserve_whitespace: false, strip_doctype: false) # Strip DOCTYPE to prevent Nokogiri SAX from expanding DTD default attributes # This is needed for C14N which should NOT include default attributes from DTD # Use string methods instead of complex regex to avoid ReDoS vulnerability if strip_doctype xml_string = strip_doctype_declaration(xml_string) end builder = new(preserve_whitespace: preserve_whitespace) parser = Nokogiri::XML::SAX::Parser.new(builder) parser.parse(xml_string) builder.result end |
.strip_doctype_declaration(xml) ⇒ String
Strip DOCTYPE declaration without using complex regex This avoids ReDoS vulnerability from patterns like s+ and [^>]*
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
# File 'lib/canon/xml/sax_builder.rb', line 58 def self.strip_doctype_declaration(xml) # Find DOCTYPE start (case-insensitive) doctype_start = xml.upcase.index("<!DOCTYPE") return xml unless doctype_start # Find the end of DOCTYPE - it ends with > # Handle both simple DOCTYPE and those with internal subset [...] pos = doctype_start + 9 # length of "<!DOCTYPE" in_bracket = false while pos < xml.length char = xml[pos] if char == "[" && !in_bracket in_bracket = true elsif char == "]" && in_bracket in_bracket = false elsif char == ">" && !in_bracket # Found the end of DOCTYPE return xml[0...doctype_start] + xml[(pos + 1)..] end pos += 1 end # If we didn't find a proper end, just return original xml end |
Instance Method Details
#characters(string) ⇒ Object
Called for text content
175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 |
# File 'lib/canon/xml/sax_builder.rb', line 175 def characters(string) return if string.nil? parent = @stack.last # Capture raw text BEFORE entity resolution for accurate serialization raw_string = string # Decode numeric character references decoded_string = decode_character_references(string) # Combine with previous text node if adjacent (SAX can split text content) # This MUST happen before whitespace check, because SAX may split "foo " # into "foo" and " " callbacks - we need to combine them before deciding # whether to skip whitespace last_child = parent.children.last if last_child&.node_type == :text # Combine both raw and decoded forms last_child.instance_variable_set(:@value, last_child.value + decoded_string) last_child.instance_variable_set(:@original, (last_child.original || "") + raw_string) return end # Skip whitespace-only text nodes unless: # 1. preserve_whitespace is true, OR # 2. The content contains CR (from 
 entities) which must be preserved for C14N, OR # 3. The content contains non-ASCII whitespace (NBSP U+00A0, ideographic # space U+3000, etc.) — those are semantically meaningful content, # not pretty-print indentation, and must survive parsing so the # comparator can detect Unicode whitespace-type differences. # # Strip only when the node is pure ASCII whitespace (space, tab, CR, LF). # This lets pretty-printed fixtures work (indent nodes stripped) while # preserving NBSP-only text nodes. if !@preserve_whitespace && decoded_string.gsub(/[ \t\r\n]/, "").empty? && parent.node_type == :element && !decoded_string.include?("\r") # Only skip if parent is an element (not root) return end text = Nodes::TextNode.new(value: decoded_string, original: raw_string) parent.add_child(text) end |
#comment(string) ⇒ Object
Called for comments
224 225 226 227 228 |
# File 'lib/canon/xml/sax_builder.rb', line 224 def comment(string) parent = @stack.last comment_node = Nodes::CommentNode.new(value: string) parent.add_child(comment_node) end |
#end_element(_name) ⇒ Object
Called when an element ends
167 168 169 170 |
# File 'lib/canon/xml/sax_builder.rb', line 167 def end_element(_name) @stack.pop @namespace_stack.pop end |
#error(string) ⇒ Object
SAX callbacks for libxml errors and warnings. Without these overrides the default handlers swallow the events; with them, libxml’s “Attribute xml:lang redefined” and similar messages land in @parse_errors and ride through to ComparisonResult.
107 108 109 |
# File 'lib/canon/xml/sax_builder.rb', line 107 def error(string) @parse_errors << string.to_s.strip end |
#processing_instruction(name, content) ⇒ Object
Called for processing instructions
234 235 236 237 238 239 |
# File 'lib/canon/xml/sax_builder.rb', line 234 def processing_instruction(name, content) parent = @stack.last pi = Nodes::ProcessingInstructionNode.new(target: name, data: content || "") parent.add_child(pi) end |
#reorder_children(root) ⇒ Object
Reorder root children so document element comes first followed by PIs and comments (outside document element)
255 256 257 258 259 260 261 |
# File 'lib/canon/xml/sax_builder.rb', line 255 def reorder_children(root) doc_element = root.children.find { |c| c.node_type == :element } return unless doc_element other_children = root.children.reject { |c| c.node_type == :element } root.instance_variable_set(:@children, [doc_element] + other_children) end |
#result ⇒ Nodes::RootNode
Return the built tree
244 245 246 247 248 249 250 251 |
# File 'lib/canon/xml/sax_builder.rb', line 244 def result # Reorder children so that the document element comes first, # followed by PIs and comments outside the document element # (C14N requires this ordering) reorder_children(@root) @root.parse_errors = @parse_errors if @parse_errors.any? @root end |
#start_element(name, attrs = []) ⇒ Object
Called when an element starts
119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 |
# File 'lib/canon/xml/sax_builder.rb', line 119 def start_element(name, attrs = []) parent = @stack.last # Parse namespace from name (prefix:localname or just localname) prefix, local_name = parse_qname(name) # Separate namespace declarations from regular attributes ns_decls, regular_attrs = separate_namespaces(attrs) # Check for relative namespace URIs (before building hash) # Convert to hash for iteration ns_hash = build_ns_hash(ns_decls) ns_hash.each_value do |uri| next if uri.nil? || uri.empty? if relative_uri?(uri) raise Canon::Error, "Relative namespace URI not allowed: #{uri}" end end # Push new namespace scope with declarations new_scope = @namespace_stack.last.merge(ns_hash) @namespace_stack.push(new_scope) # Find namespace URI from current scope ns_uri = new_scope[prefix.to_s] # Create element node element = Nodes::ElementNode.new( name: local_name, namespace_uri: ns_uri, prefix: prefix, ) # Add namespace nodes from current scope add_namespace_nodes(element, new_scope) # Build and add attribute nodes (excluding xmlns declarations) add_attribute_nodes(element, regular_attrs) parent.add_child(element) @stack.push(element) end |
#warning(string) ⇒ Object
111 112 113 |
# File 'lib/canon/xml/sax_builder.rb', line 111 def warning(string) @parse_errors << string.to_s.strip end |